library(tidyverse)
library(ggplot2)
library(dplyr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8
Challenge Overview
Today’s challenge is to:
- read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
<- read_csv("_data/FAOSTAT_country_groups.csv")
group_df spec(group_df)
cols(
`Country Group Code` = col_double(),
`Country Group` = col_character(),
`Country Code` = col_double(),
Country = col_character(),
`M49 Code` = col_character(),
`ISO2 Code` = col_character(),
`ISO3 Code` = col_character()
)
head(group_df)
# A tibble: 6 × 7
`Country Group Code` `Country Group` Country…¹ Country M49 C…² ISO2 …³ ISO3 …⁴
<dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 5100 Africa 4 Algeria 012 DZ DZA
2 5100 Africa 7 Angola 024 AO AGO
3 5100 Africa 53 Benin 204 BJ BEN
4 5100 Africa 20 Botswa… 072 BW BWA
5 5100 Africa 233 Burkin… 854 BF BFA
6 5100 Africa 29 Burundi 108 BI BDI
# … with abbreviated variable names ¹`Country Code`, ²`M49 Code`, ³`ISO2 Code`,
# ⁴`ISO3 Code`
<- read_csv("_data/FAOSTAT_cattle_dairy.csv")
cattle_df spec(cattle_df)
cols(
`Domain Code` = col_character(),
Domain = col_character(),
`Area Code` = col_double(),
Area = col_character(),
`Element Code` = col_double(),
Element = col_character(),
`Item Code` = col_double(),
Item = col_character(),
`Year Code` = col_double(),
Year = col_double(),
Unit = col_character(),
Value = col_double(),
Flag = col_character(),
`Flag Description` = col_character()
)
head(cattle_df)
# A tibble: 6 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1961 1961 Head
2 QL Lives… 2 Afgh… 5420 Yield 882 Milk… 1961 1961 hg/An
3 QL Lives… 2 Afgh… 5510 Produc… 882 Milk… 1961 1961 tonn…
4 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1962 1962 Head
5 QL Lives… 2 Afgh… 5420 Yield 882 Milk… 1962 1962 hg/An
6 QL Lives… 2 Afgh… 5510 Produc… 882 Milk… 1962 1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
Briefly describe the data
I have read in 2 datasets -> FAOSTAT_cattle_dairy.csv and FAOSTAT_country_groups.csv which represent the dairy data for areas accross different countries and the country details respectively.
Tidy Data (as needed)
As we can see from the data, the Area Code
column in the cattle_df contains values from the Country Code
column in group_df. To make it less confusing, I am renaming the Area Code
column to Country Code
in cattle_df.
<- cattle_df %>% rename ("Country Code" = "Area Code")
cattle_df head(cattle_df)
# A tibble: 6 × 14
Domai…¹ Domain Count…² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1961 1961 Head
2 QL Lives… 2 Afgh… 5420 Yield 882 Milk… 1961 1961 hg/An
3 QL Lives… 2 Afgh… 5510 Produc… 882 Milk… 1961 1961 tonn…
4 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1962 1962 Head
5 QL Lives… 2 Afgh… 5420 Yield 882 Milk… 1962 1962 hg/An
6 QL Lives… 2 Afgh… 5510 Produc… 882 Milk… 1962 1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Country Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
No variables require mutation here and the datasets are ready for joining.
Join Data
<- left_join(cattle_df, group_df, by = "Country Code" )
join_df head(join_df)
# A tibble: 6 × 20
Domai…¹ Domain Count…² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1961 1961 Head
2 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1961 1961 Head
3 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1961 1961 Head
4 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1961 1961 Head
5 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1961 1961 Head
6 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1961 1961 Head
# … with 9 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# `Country Group Code` <dbl>, `Country Group` <chr>, Country <chr>,
# `M49 Code` <chr>, `ISO2 Code` <chr>, `ISO3 Code` <chr>, and abbreviated
# variable names ¹`Domain Code`, ²`Country Code`, ³`Element Code`,
# ⁴`Item Code`, ⁵`Year Code`
spec(join_df)
cols(
`Domain Code` = col_character(),
Domain = col_character(),
`Area Code` = col_double(),
Area = col_character(),
`Element Code` = col_double(),
Element = col_character(),
`Item Code` = col_double(),
Item = col_character(),
`Year Code` = col_double(),
Year = col_double(),
Unit = col_character(),
Value = col_double(),
Flag = col_character(),
`Flag Description` = col_character()
)
nrow(join_df)
[1] 257061
ncol(join_df)
[1] 20
n_distinct(cattle_df$`Country Code`)
[1] 232
n_distinct(group_df$`Country Code`)
[1] 277
As an analytical example, I will calculate the total value of items per country group per year
<- join_df %>% group_by(`Country Group`) %>%
output_df summarise(total_value = sum(Value))
head(output_df)
# A tibble: 6 × 2
`Country Group` total_value
<chr> <dbl>
1 Africa 3300602669
2 Americas NA
3 Annex I countries NA
4 Asia NA
5 Australia and New Zealand 1371659056
6 Caribbean NA