library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8 Instructions
Challenge Overview
Today’s challenge is to:
- read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- military marriages ⭐⭐
- faostat ⭐⭐
- railroads ⭐⭐⭐
- fed_rate ⭐⭐⭐
- debt ⭐⭐⭐
- us_hh ⭐⭐⭐⭐
- snl ⭐⭐⭐⭐⭐
<-read_csv("~/Desktop/601_Spring_2023/posts/_data/FAOSTAT_egg_chicken.csv",show_col_types = FALSE)
eggchickdata<-read_csv("~/Desktop/601_Spring_2023/posts/_data/FAOSTAT_livestock.csv",show_col_types = FALSE)
stockdata<-read_csv("~/Desktop/601_Spring_2023/posts/_data/FAOSTAT_cattle_dairy.csv",show_col_types = FALSE) cattledata
Briefly describe the data
head(eggchickdata)
# A tibble: 6 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QL Lives… 2 Afgh… 5313 Laying 1062 Eggs… 1961 1961 1000…
2 QL Lives… 2 Afgh… 5410 Yield 1062 Eggs… 1961 1961 100m…
3 QL Lives… 2 Afgh… 5510 Produc… 1062 Eggs… 1961 1961 tonn…
4 QL Lives… 2 Afgh… 5313 Laying 1062 Eggs… 1962 1962 1000…
5 QL Lives… 2 Afgh… 5410 Yield 1062 Eggs… 1962 1962 100m…
6 QL Lives… 2 Afgh… 5510 Produc… 1062 Eggs… 1962 1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
dim(eggchickdata)
[1] 38170 14
str(eggchickdata)
spc_tbl_ [38,170 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Domain Code : chr [1:38170] "QL" "QL" "QL" "QL" ...
$ Domain : chr [1:38170] "Livestock Primary" "Livestock Primary" "Livestock Primary" "Livestock Primary" ...
$ Area Code : num [1:38170] 2 2 2 2 2 2 2 2 2 2 ...
$ Area : chr [1:38170] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Element Code : num [1:38170] 5313 5410 5510 5313 5410 ...
$ Element : chr [1:38170] "Laying" "Yield" "Production" "Laying" ...
$ Item Code : num [1:38170] 1062 1062 1062 1062 1062 ...
$ Item : chr [1:38170] "Eggs, hen, in shell" "Eggs, hen, in shell" "Eggs, hen, in shell" "Eggs, hen, in shell" ...
$ Year Code : num [1:38170] 1961 1961 1961 1962 1962 ...
$ Year : num [1:38170] 1961 1961 1961 1962 1962 ...
$ Unit : chr [1:38170] "1000 Head" "100mg/An" "tonnes" "1000 Head" ...
$ Value : num [1:38170] 4000 25000 10000 4400 25000 11000 4600 25000 11500 4800 ...
$ Flag : chr [1:38170] "F" "Fc" "F" "F" ...
$ Flag Description: chr [1:38170] "FAO estimate" "Calculated data" "FAO estimate" "FAO estimate" ...
- attr(*, "spec")=
.. cols(
.. `Domain Code` = col_character(),
.. Domain = col_character(),
.. `Area Code` = col_double(),
.. Area = col_character(),
.. `Element Code` = col_double(),
.. Element = col_character(),
.. `Item Code` = col_double(),
.. Item = col_character(),
.. `Year Code` = col_double(),
.. Year = col_double(),
.. Unit = col_character(),
.. Value = col_double(),
.. Flag = col_character(),
.. `Flag Description` = col_character()
.. )
- attr(*, "problems")=<externalptr>
head(stockdata)
# A tibble: 6 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1961 1961 Head
2 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1962 1962 Head
3 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1963 1963 Head
4 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1964 1964 Head
5 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1965 1965 Head
6 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1966 1966 Head
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
dim(stockdata)
[1] 82116 14
str(stockdata)
spc_tbl_ [82,116 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Domain Code : chr [1:82116] "QA" "QA" "QA" "QA" ...
$ Domain : chr [1:82116] "Live Animals" "Live Animals" "Live Animals" "Live Animals" ...
$ Area Code : num [1:82116] 2 2 2 2 2 2 2 2 2 2 ...
$ Area : chr [1:82116] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Element Code : num [1:82116] 5111 5111 5111 5111 5111 ...
$ Element : chr [1:82116] "Stocks" "Stocks" "Stocks" "Stocks" ...
$ Item Code : num [1:82116] 1107 1107 1107 1107 1107 ...
$ Item : chr [1:82116] "Asses" "Asses" "Asses" "Asses" ...
$ Year Code : num [1:82116] 1961 1962 1963 1964 1965 ...
$ Year : num [1:82116] 1961 1962 1963 1964 1965 ...
$ Unit : chr [1:82116] "Head" "Head" "Head" "Head" ...
$ Value : num [1:82116] 1300000 851850 1001112 1150000 1300000 ...
$ Flag : chr [1:82116] NA NA NA "F" ...
$ Flag Description: chr [1:82116] "Official data" "Official data" "Official data" "FAO estimate" ...
- attr(*, "spec")=
.. cols(
.. `Domain Code` = col_character(),
.. Domain = col_character(),
.. `Area Code` = col_double(),
.. Area = col_character(),
.. `Element Code` = col_double(),
.. Element = col_character(),
.. `Item Code` = col_double(),
.. Item = col_character(),
.. `Year Code` = col_double(),
.. Year = col_double(),
.. Unit = col_character(),
.. Value = col_double(),
.. Flag = col_character(),
.. `Flag Description` = col_character()
.. )
- attr(*, "problems")=<externalptr>
head(cattledata)
# A tibble: 6 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1961 1961 Head
2 QL Lives… 2 Afgh… 5420 Yield 882 Milk… 1961 1961 hg/An
3 QL Lives… 2 Afgh… 5510 Produc… 882 Milk… 1961 1961 tonn…
4 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1962 1962 Head
5 QL Lives… 2 Afgh… 5420 Yield 882 Milk… 1962 1962 hg/An
6 QL Lives… 2 Afgh… 5510 Produc… 882 Milk… 1962 1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
dim(cattledata)
[1] 36449 14
str(cattledata)
spc_tbl_ [36,449 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Domain Code : chr [1:36449] "QL" "QL" "QL" "QL" ...
$ Domain : chr [1:36449] "Livestock Primary" "Livestock Primary" "Livestock Primary" "Livestock Primary" ...
$ Area Code : num [1:36449] 2 2 2 2 2 2 2 2 2 2 ...
$ Area : chr [1:36449] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Element Code : num [1:36449] 5318 5420 5510 5318 5420 ...
$ Element : chr [1:36449] "Milk Animals" "Yield" "Production" "Milk Animals" ...
$ Item Code : num [1:36449] 882 882 882 882 882 882 882 882 882 882 ...
$ Item : chr [1:36449] "Milk, whole fresh cow" "Milk, whole fresh cow" "Milk, whole fresh cow" "Milk, whole fresh cow" ...
$ Year Code : num [1:36449] 1961 1961 1961 1962 1962 ...
$ Year : num [1:36449] 1961 1961 1961 1962 1962 ...
$ Unit : chr [1:36449] "Head" "hg/An" "tonnes" "Head" ...
$ Value : num [1:36449] 700000 5000 350000 700000 5000 ...
$ Flag : chr [1:36449] "F" "Fc" "F" "F" ...
$ Flag Description: chr [1:36449] "FAO estimate" "Calculated data" "FAO estimate" "FAO estimate" ...
- attr(*, "spec")=
.. cols(
.. `Domain Code` = col_character(),
.. Domain = col_character(),
.. `Area Code` = col_double(),
.. Area = col_character(),
.. `Element Code` = col_double(),
.. Element = col_character(),
.. `Item Code` = col_double(),
.. Item = col_character(),
.. `Year Code` = col_double(),
.. Year = col_double(),
.. Unit = col_character(),
.. Value = col_double(),
.. Flag = col_character(),
.. `Flag Description` = col_character()
.. )
- attr(*, "problems")=<externalptr>
Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
Join Data
Be sure to include a sanity check, and double-check that case count is correct!
<-full_join(cattledata,stockdata)
cattle_joindim(cattle_join)
[1] 118565 14
The dim is correct, 36449+82116 = 118565
<-full_join(eggchickdata,cattledata)
joindatadim(joindata)
[1] 74619 14
The dim is correct, 36449+38170 = 74619