Challenge 8 Instructions

challenge_8
FNU Avinesh Krishnan
faostat
Joining Data
Author

FNU Avinesh Krishnan

Published

May 16, 2023

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • military marriages ⭐⭐
  • faostat ⭐⭐
  • railroads ⭐⭐⭐
  • fed_rate ⭐⭐⭐
  • debt ⭐⭐⭐
  • us_hh ⭐⭐⭐⭐
  • snl ⭐⭐⭐⭐⭐
eggchickdata<-read_csv("~/Desktop/601_Spring_2023/posts/_data/FAOSTAT_egg_chicken.csv",show_col_types = FALSE)
stockdata<-read_csv("~/Desktop/601_Spring_2023/posts/_data/FAOSTAT_livestock.csv",show_col_types = FALSE)
cattledata<-read_csv("~/Desktop/601_Spring_2023/posts/_data/FAOSTAT_cattle_dairy.csv",show_col_types = FALSE)

Briefly describe the data

head(eggchickdata)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QL      Lives…       2 Afgh…    5313 Laying     1062 Eggs…    1961  1961 1000…
2 QL      Lives…       2 Afgh…    5410 Yield      1062 Eggs…    1961  1961 100m…
3 QL      Lives…       2 Afgh…    5510 Produc…    1062 Eggs…    1961  1961 tonn…
4 QL      Lives…       2 Afgh…    5313 Laying     1062 Eggs…    1962  1962 1000…
5 QL      Lives…       2 Afgh…    5410 Yield      1062 Eggs…    1962  1962 100m…
6 QL      Lives…       2 Afgh…    5510 Produc…    1062 Eggs…    1962  1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
dim(eggchickdata)
[1] 38170    14
str(eggchickdata)
spc_tbl_ [38,170 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Domain Code     : chr [1:38170] "QL" "QL" "QL" "QL" ...
 $ Domain          : chr [1:38170] "Livestock Primary" "Livestock Primary" "Livestock Primary" "Livestock Primary" ...
 $ Area Code       : num [1:38170] 2 2 2 2 2 2 2 2 2 2 ...
 $ Area            : chr [1:38170] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Element Code    : num [1:38170] 5313 5410 5510 5313 5410 ...
 $ Element         : chr [1:38170] "Laying" "Yield" "Production" "Laying" ...
 $ Item Code       : num [1:38170] 1062 1062 1062 1062 1062 ...
 $ Item            : chr [1:38170] "Eggs, hen, in shell" "Eggs, hen, in shell" "Eggs, hen, in shell" "Eggs, hen, in shell" ...
 $ Year Code       : num [1:38170] 1961 1961 1961 1962 1962 ...
 $ Year            : num [1:38170] 1961 1961 1961 1962 1962 ...
 $ Unit            : chr [1:38170] "1000 Head" "100mg/An" "tonnes" "1000 Head" ...
 $ Value           : num [1:38170] 4000 25000 10000 4400 25000 11000 4600 25000 11500 4800 ...
 $ Flag            : chr [1:38170] "F" "Fc" "F" "F" ...
 $ Flag Description: chr [1:38170] "FAO estimate" "Calculated data" "FAO estimate" "FAO estimate" ...
 - attr(*, "spec")=
  .. cols(
  ..   `Domain Code` = col_character(),
  ..   Domain = col_character(),
  ..   `Area Code` = col_double(),
  ..   Area = col_character(),
  ..   `Element Code` = col_double(),
  ..   Element = col_character(),
  ..   `Item Code` = col_double(),
  ..   Item = col_character(),
  ..   `Year Code` = col_double(),
  ..   Year = col_double(),
  ..   Unit = col_character(),
  ..   Value = col_double(),
  ..   Flag = col_character(),
  ..   `Flag Description` = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
head(stockdata)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1961  1961 Head 
2 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1962  1962 Head 
3 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1963  1963 Head 
4 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1964  1964 Head 
5 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1965  1965 Head 
6 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1966  1966 Head 
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
dim(stockdata)
[1] 82116    14
str(stockdata)
spc_tbl_ [82,116 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Domain Code     : chr [1:82116] "QA" "QA" "QA" "QA" ...
 $ Domain          : chr [1:82116] "Live Animals" "Live Animals" "Live Animals" "Live Animals" ...
 $ Area Code       : num [1:82116] 2 2 2 2 2 2 2 2 2 2 ...
 $ Area            : chr [1:82116] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Element Code    : num [1:82116] 5111 5111 5111 5111 5111 ...
 $ Element         : chr [1:82116] "Stocks" "Stocks" "Stocks" "Stocks" ...
 $ Item Code       : num [1:82116] 1107 1107 1107 1107 1107 ...
 $ Item            : chr [1:82116] "Asses" "Asses" "Asses" "Asses" ...
 $ Year Code       : num [1:82116] 1961 1962 1963 1964 1965 ...
 $ Year            : num [1:82116] 1961 1962 1963 1964 1965 ...
 $ Unit            : chr [1:82116] "Head" "Head" "Head" "Head" ...
 $ Value           : num [1:82116] 1300000 851850 1001112 1150000 1300000 ...
 $ Flag            : chr [1:82116] NA NA NA "F" ...
 $ Flag Description: chr [1:82116] "Official data" "Official data" "Official data" "FAO estimate" ...
 - attr(*, "spec")=
  .. cols(
  ..   `Domain Code` = col_character(),
  ..   Domain = col_character(),
  ..   `Area Code` = col_double(),
  ..   Area = col_character(),
  ..   `Element Code` = col_double(),
  ..   Element = col_character(),
  ..   `Item Code` = col_double(),
  ..   Item = col_character(),
  ..   `Year Code` = col_double(),
  ..   Year = col_double(),
  ..   Unit = col_character(),
  ..   Value = col_double(),
  ..   Flag = col_character(),
  ..   `Flag Description` = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
head(cattledata)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961 Head 
2 QL      Lives…       2 Afgh…    5420 Yield       882 Milk…    1961  1961 hg/An
3 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk…    1961  1961 tonn…
4 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1962  1962 Head 
5 QL      Lives…       2 Afgh…    5420 Yield       882 Milk…    1962  1962 hg/An
6 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk…    1962  1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
dim(cattledata)
[1] 36449    14
str(cattledata)
spc_tbl_ [36,449 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Domain Code     : chr [1:36449] "QL" "QL" "QL" "QL" ...
 $ Domain          : chr [1:36449] "Livestock Primary" "Livestock Primary" "Livestock Primary" "Livestock Primary" ...
 $ Area Code       : num [1:36449] 2 2 2 2 2 2 2 2 2 2 ...
 $ Area            : chr [1:36449] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Element Code    : num [1:36449] 5318 5420 5510 5318 5420 ...
 $ Element         : chr [1:36449] "Milk Animals" "Yield" "Production" "Milk Animals" ...
 $ Item Code       : num [1:36449] 882 882 882 882 882 882 882 882 882 882 ...
 $ Item            : chr [1:36449] "Milk, whole fresh cow" "Milk, whole fresh cow" "Milk, whole fresh cow" "Milk, whole fresh cow" ...
 $ Year Code       : num [1:36449] 1961 1961 1961 1962 1962 ...
 $ Year            : num [1:36449] 1961 1961 1961 1962 1962 ...
 $ Unit            : chr [1:36449] "Head" "hg/An" "tonnes" "Head" ...
 $ Value           : num [1:36449] 700000 5000 350000 700000 5000 ...
 $ Flag            : chr [1:36449] "F" "Fc" "F" "F" ...
 $ Flag Description: chr [1:36449] "FAO estimate" "Calculated data" "FAO estimate" "FAO estimate" ...
 - attr(*, "spec")=
  .. cols(
  ..   `Domain Code` = col_character(),
  ..   Domain = col_character(),
  ..   `Area Code` = col_double(),
  ..   Area = col_character(),
  ..   `Element Code` = col_double(),
  ..   Element = col_character(),
  ..   `Item Code` = col_double(),
  ..   Item = col_character(),
  ..   `Year Code` = col_double(),
  ..   Year = col_double(),
  ..   Unit = col_character(),
  ..   Value = col_double(),
  ..   Flag = col_character(),
  ..   `Flag Description` = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

Join Data

Be sure to include a sanity check, and double-check that case count is correct!

cattle_join<-full_join(cattledata,stockdata)
dim(cattle_join)
[1] 118565     14

The dim is correct, 36449+82116 = 118565

joindata<-full_join(eggchickdata,cattledata)
dim(joindata)
[1] 74619    14

The dim is correct, 36449+38170 = 74619