Challenge 8

challenge_8
PoChunYang
faostat
Joining Data
Author

PoChun Yang

Published

April 26, 2023

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • faostat ⭐⭐

In the read data, I read all th fao data here. There would be four of data which call egg, livestock, cattle, country.

egg<-read_csv("_data/FAOSTAT_egg_chicken.csv",show_col_types = FALSE)
head(egg,10)
# A tibble: 10 × 14
   Domain Cod…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
   <chr>        <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
 1 QL           Lives…       2 Afgh…    5313 Laying     1062 Eggs…    1961  1961
 2 QL           Lives…       2 Afgh…    5410 Yield      1062 Eggs…    1961  1961
 3 QL           Lives…       2 Afgh…    5510 Produc…    1062 Eggs…    1961  1961
 4 QL           Lives…       2 Afgh…    5313 Laying     1062 Eggs…    1962  1962
 5 QL           Lives…       2 Afgh…    5410 Yield      1062 Eggs…    1962  1962
 6 QL           Lives…       2 Afgh…    5510 Produc…    1062 Eggs…    1962  1962
 7 QL           Lives…       2 Afgh…    5313 Laying     1062 Eggs…    1963  1963
 8 QL           Lives…       2 Afgh…    5410 Yield      1062 Eggs…    1963  1963
 9 QL           Lives…       2 Afgh…    5510 Produc…    1062 Eggs…    1963  1963
10 QL           Lives…       2 Afgh…    5313 Laying     1062 Eggs…    1964  1964
# … with 4 more variables: Unit <chr>, Value <dbl>, Flag <chr>,
#   `Flag Description` <chr>, and abbreviated variable names ¹​`Domain Code`,
#   ²​`Area Code`, ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
dim(egg)
[1] 38170    14
livestock<-read_csv("_data/FAOSTAT_livestock.csv",show_col_types = FALSE)
head(livestock,10)
# A tibble: 10 × 14
   Domain Cod…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
   <chr>        <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
 1 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1961  1961
 2 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1962  1962
 3 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1963  1963
 4 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1964  1964
 5 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1965  1965
 6 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1966  1966
 7 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1967  1967
 8 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1968  1968
 9 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1969  1969
10 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1970  1970
# … with 4 more variables: Unit <chr>, Value <dbl>, Flag <chr>,
#   `Flag Description` <chr>, and abbreviated variable names ¹​`Domain Code`,
#   ²​`Area Code`, ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
dim(livestock)
[1] 82116    14
cattle<-read_csv("_data/FAOSTAT_cattle_dairy.csv",show_col_types = FALSE)
head(cattle,10)
# A tibble: 10 × 14
   Domain Cod…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
   <chr>        <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
 1 QL           Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961
 2 QL           Lives…       2 Afgh…    5420 Yield       882 Milk…    1961  1961
 3 QL           Lives…       2 Afgh…    5510 Produc…     882 Milk…    1961  1961
 4 QL           Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1962  1962
 5 QL           Lives…       2 Afgh…    5420 Yield       882 Milk…    1962  1962
 6 QL           Lives…       2 Afgh…    5510 Produc…     882 Milk…    1962  1962
 7 QL           Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1963  1963
 8 QL           Lives…       2 Afgh…    5420 Yield       882 Milk…    1963  1963
 9 QL           Lives…       2 Afgh…    5510 Produc…     882 Milk…    1963  1963
10 QL           Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1964  1964
# … with 4 more variables: Unit <chr>, Value <dbl>, Flag <chr>,
#   `Flag Description` <chr>, and abbreviated variable names ¹​`Domain Code`,
#   ²​`Area Code`, ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
dim(cattle)
[1] 36449    14

Briefly describe the data

Tidy Data (as needed)

The data is extremely tidy data. Therefor, I would not use any command to make a clean data. As the above table, I could know 14 of the columns with same column name. Therefore, I could join data easily.

Join Data

Be sure to include a sanity check, and double-check that case count is correct!

The first joins cattle data and livestock. As I used the full_join command here, I got the 118565 rows. I want to prove my answer is correct so I have to sum the rows between those data. In the cattle data, there are 36449 rows. In the livestock data, there are 82116 rows. Therefore, I sum both of the data and got the 118565. This is how I check I am correct.

cattle_live<-full_join(cattle,livestock)
dim(cattle_live)
[1] 118565     14
head(cattle_live)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961 Head 
2 QL      Lives…       2 Afgh…    5420 Yield       882 Milk…    1961  1961 hg/An
3 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk…    1961  1961 tonn…
4 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1962  1962 Head 
5 QL      Lives…       2 Afgh…    5420 Yield       882 Milk…    1962  1962 hg/An
6 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk…    1962  1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`

As the command, I want to double check whether my idea is correct or not. Therefore, I sum the rows of egg and cattle data. 36449 + 38170 = 74619.

egg_cattle<-full_join(egg,cattle)
dim(egg_cattle)
[1] 74619    14
head(egg_cattle)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QL      Lives…       2 Afgh…    5313 Laying     1062 Eggs…    1961  1961 1000…
2 QL      Lives…       2 Afgh…    5410 Yield      1062 Eggs…    1961  1961 100m…
3 QL      Lives…       2 Afgh…    5510 Produc…    1062 Eggs…    1961  1961 tonn…
4 QL      Lives…       2 Afgh…    5313 Laying     1062 Eggs…    1962  1962 1000…
5 QL      Lives…       2 Afgh…    5410 Yield      1062 Eggs…    1962  1962 100m…
6 QL      Lives…       2 Afgh…    5510 Produc…    1062 Eggs…    1962  1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`