Challenge 8

challenge_8
faostat
Joining Data
Author

Siddharth Goel

Published

January 31, 2023

library(tidyverse)
library(ggplot2)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

Read in data

group_df <- read_csv("_data/FAOSTAT_country_groups.csv")
spec(group_df)
cols(
  `Country Group Code` = col_double(),
  `Country Group` = col_character(),
  `Country Code` = col_double(),
  Country = col_character(),
  `M49 Code` = col_character(),
  `ISO2 Code` = col_character(),
  `ISO3 Code` = col_character()
)
head(group_df)
# A tibble: 6 × 7
  `Country Group Code` `Country Group` Country…¹ Country M49 C…² ISO2 …³ ISO3 …⁴
                 <dbl> <chr>               <dbl> <chr>   <chr>   <chr>   <chr>  
1                 5100 Africa                  4 Algeria 012     DZ      DZA    
2                 5100 Africa                  7 Angola  024     AO      AGO    
3                 5100 Africa                 53 Benin   204     BJ      BEN    
4                 5100 Africa                 20 Botswa… 072     BW      BWA    
5                 5100 Africa                233 Burkin… 854     BF      BFA    
6                 5100 Africa                 29 Burundi 108     BI      BDI    
# … with abbreviated variable names ¹​`Country Code`, ²​`M49 Code`, ³​`ISO2 Code`,
#   ⁴​`ISO3 Code`
cattle_df <- read_csv("_data/FAOSTAT_cattle_dairy.csv")
spec(cattle_df)
cols(
  `Domain Code` = col_character(),
  Domain = col_character(),
  `Area Code` = col_double(),
  Area = col_character(),
  `Element Code` = col_double(),
  Element = col_character(),
  `Item Code` = col_double(),
  Item = col_character(),
  `Year Code` = col_double(),
  Year = col_double(),
  Unit = col_character(),
  Value = col_double(),
  Flag = col_character(),
  `Flag Description` = col_character()
)
head(cattle_df)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961 Head 
2 QL      Lives…       2 Afgh…    5420 Yield       882 Milk…    1961  1961 hg/An
3 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk…    1961  1961 tonn…
4 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1962  1962 Head 
5 QL      Lives…       2 Afgh…    5420 Yield       882 Milk…    1962  1962 hg/An
6 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk…    1962  1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`

Briefly describe the data

I have read in 2 datasets -> FAOSTAT_cattle_dairy.csv and FAOSTAT_country_groups.csv which represent the dairy data for areas accross different countries and the country details respectively.

Tidy Data (as needed)

As we can see from the data, the Area Code column in the cattle_df contains values from the Country Code column in group_df. To make it less confusing, I am renaming the Area Code column to Country Code in cattle_df.

cattle_df <- cattle_df %>%  rename ("Country Code" = "Area Code")
head(cattle_df)
# A tibble: 6 × 14
  Domai…¹ Domain Count…² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961 Head 
2 QL      Lives…       2 Afgh…    5420 Yield       882 Milk…    1961  1961 hg/An
3 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk…    1961  1961 tonn…
4 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1962  1962 Head 
5 QL      Lives…       2 Afgh…    5420 Yield       882 Milk…    1962  1962 hg/An
6 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk…    1962  1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Country Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`

Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

No variables require mutation here and the datasets are ready for joining.

Join Data

join_df <- left_join(cattle_df, group_df, by = "Country Code" )
head(join_df)
# A tibble: 6 × 20
  Domai…¹ Domain Count…² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961 Head 
2 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961 Head 
3 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961 Head 
4 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961 Head 
5 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961 Head 
6 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961 Head 
# … with 9 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   `Country Group Code` <dbl>, `Country Group` <chr>, Country <chr>,
#   `M49 Code` <chr>, `ISO2 Code` <chr>, `ISO3 Code` <chr>, and abbreviated
#   variable names ¹​`Domain Code`, ²​`Country Code`, ³​`Element Code`,
#   ⁴​`Item Code`, ⁵​`Year Code`
spec(join_df)
cols(
  `Domain Code` = col_character(),
  Domain = col_character(),
  `Area Code` = col_double(),
  Area = col_character(),
  `Element Code` = col_double(),
  Element = col_character(),
  `Item Code` = col_double(),
  Item = col_character(),
  `Year Code` = col_double(),
  Year = col_double(),
  Unit = col_character(),
  Value = col_double(),
  Flag = col_character(),
  `Flag Description` = col_character()
)
nrow(join_df)
[1] 257061
ncol(join_df)
[1] 20
n_distinct(cattle_df$`Country Code`)
[1] 232
n_distinct(group_df$`Country Code`)
[1] 277

As an analytical example, I will calculate the total value of items per country group per year

output_df <- join_df %>% group_by(`Country Group`) %>%
                    summarise(total_value = sum(Value))
head(output_df)
# A tibble: 6 × 2
  `Country Group`           total_value
  <chr>                           <dbl>
1 Africa                     3300602669
2 Americas                           NA
3 Annex I countries                  NA
4 Asia                               NA
5 Australia and New Zealand  1371659056
6 Caribbean                          NA