library(tidyverse)
library(ggplot2)
library(readr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8 Instructions
Challenge Overview
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
The required data is initially read.
library(readr)
<- read_csv("_data/FAOSTAT_egg_chicken.csv")
egg_chicken <- read_csv("_data/FAOSTAT_cattle_dairy.csv")
cattle_dairy <- read_csv("_data/FAOSTAT_livestock.csv") livestock
Briefly describe the data
While we make efforts to understand the data, we see all the column names present, the sample head of the three datasets.
colnames(egg_chicken)
[1] "Domain Code" "Domain" "Area Code" "Area"
[5] "Element Code" "Element" "Item Code" "Item"
[9] "Year Code" "Year" "Unit" "Value"
[13] "Flag" "Flag Description"
colnames(cattle_dairy)
[1] "Domain Code" "Domain" "Area Code" "Area"
[5] "Element Code" "Element" "Item Code" "Item"
[9] "Year Code" "Year" "Unit" "Value"
[13] "Flag" "Flag Description"
colnames(livestock)
[1] "Domain Code" "Domain" "Area Code" "Area"
[5] "Element Code" "Element" "Item Code" "Item"
[9] "Year Code" "Year" "Unit" "Value"
[13] "Flag" "Flag Description"
head(cattle_dairy)
# A tibble: 6 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1961 1961 Head
2 QL Lives… 2 Afgh… 5420 Yield 882 Milk… 1961 1961 hg/An
3 QL Lives… 2 Afgh… 5510 Produc… 882 Milk… 1961 1961 tonn…
4 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1962 1962 Head
5 QL Lives… 2 Afgh… 5420 Yield 882 Milk… 1962 1962 hg/An
6 QL Lives… 2 Afgh… 5510 Produc… 882 Milk… 1962 1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
head(egg_chicken)
# A tibble: 6 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QL Lives… 2 Afgh… 5313 Laying 1062 Eggs… 1961 1961 1000…
2 QL Lives… 2 Afgh… 5410 Yield 1062 Eggs… 1961 1961 100m…
3 QL Lives… 2 Afgh… 5510 Produc… 1062 Eggs… 1961 1961 tonn…
4 QL Lives… 2 Afgh… 5313 Laying 1062 Eggs… 1962 1962 1000…
5 QL Lives… 2 Afgh… 5410 Yield 1062 Eggs… 1962 1962 100m…
6 QL Lives… 2 Afgh… 5510 Produc… 1062 Eggs… 1962 1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
head(livestock)
# A tibble: 6 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1961 1961 Head
2 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1962 1962 Head
3 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1963 1963 Head
4 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1964 1964 Head
5 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1965 1965 Head
6 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1966 1966 Head
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
table(egg_chicken$Item)
Eggs, hen, in shell
38170
table(cattle_dairy$Item)
Milk, whole fresh cow
36449
table(livestock$Item)
Asses Buffaloes Camels Cattle Goats Horses Mules Pigs
8571 3505 3265 13086 12498 11104 6153 12015
Sheep
11919
we observe that only in live stock table we have multiple items.
all(egg_chicken$Year==egg_chicken$`Year Code`)
[1] TRUE
all(cattle_dairy$Year==cattle_dairy$`Year Code`)
[1] TRUE
all(livestock$Year==livestock$`Year Code`)
[1] TRUE
we observe duplicated columns in the data sets.
sapply(egg_chicken,function(x)sum(is.na(x)))
Domain Code Domain Area Code Area
0 0 0 0
Element Code Element Item Code Item
0 0 0 0
Year Code Year Unit Value
0 0 0 40
Flag Flag Description
7548 0
we see that there are 40 missing values of value and 7548 missing values of flag description.
sapply(cattle_dairy,function(x)sum(is.na(x)))
Domain Code Domain Area Code Area
0 0 0 0
Element Code Element Item Code Item
0 0 0 0
Year Code Year Unit Value
0 0 0 74
Flag Flag Description
10044 0
sapply(livestock,function(x)sum(is.na(x)))
Domain Code Domain Area Code Area
0 0 0 0
Element Code Element Item Code Item
0 0 0 0
Year Code Year Unit Value
0 0 0 1301
Flag Flag Description
38270 0
In live_stock there are 1301 missing values in values and 38270 missing values in flags
print(summarytools::dfSummary(egg_chicken,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
Data Frame Summary
egg_chicken
Dimensions: 38170 x 14Duplicates: 0
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Domain Code [character] | 1. QL |
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Domain [character] | 1. Livestock Primary |
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Area Code [numeric] |
|
245 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Area [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Element Code [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Element [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Item Code [numeric] | 1 distinct value |
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Item [character] | 1. Eggs, hen, in shell |
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Year Code [numeric] |
|
58 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Year [numeric] |
|
58 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Unit [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Value [numeric] |
|
21325 distinct values | 40 (0.1%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Flag [character] |
|
|
7548 (19.8%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Flag Description [character] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21
print(summarytools::dfSummary(livestock,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
Data Frame Summary
livestock
Dimensions: 82116 x 14Duplicates: 0
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Domain Code [character] | 1. QA |
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Domain [character] | 1. Live Animals |
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Area Code [numeric] |
|
253 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Area [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Element Code [numeric] | 1 distinct value |
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Element [character] | 1. Stocks |
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Item Code [numeric] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Item [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Year Code [numeric] |
|
58 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Year [numeric] |
|
58 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Unit [character] | 1. Head |
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Value [numeric] |
|
43667 distinct values | 1301 (1.6%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Flag [character] |
|
|
38270 (46.6%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Flag Description [character] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21
print(summarytools::dfSummary(cattle_dairy,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
Data Frame Summary
cattle_dairy
Dimensions: 36449 x 14Duplicates: 0
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Domain Code [character] | 1. QL |
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Domain [character] | 1. Livestock Primary |
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Area Code [numeric] |
|
232 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Area [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Element Code [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Element [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Item Code [numeric] | 1 distinct value |
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Item [character] | 1. Milk, whole fresh cow |
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Year Code [numeric] |
|
58 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Year [numeric] |
|
58 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Unit [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Value [numeric] |
|
24088 distinct values | 74 (0.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Flag [character] |
|
|
10044 (27.6%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Flag Description [character] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21
Tidy the data
First we need to remove the duplicate columns from each of the data set
<- subset(cattle_dairy,select= -c(`Year Code`)) cattle_diary_new
head(cattle_diary_new)
# A tibble: 6 × 13
Domain …¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year Unit Value
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1961 Head 7 e5
2 QL Lives… 2 Afgh… 5420 Yield 882 Milk… 1961 hg/An 5 e3
3 QL Lives… 2 Afgh… 5510 Produc… 882 Milk… 1961 tonn… 3.5e5
4 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1962 Head 7 e5
5 QL Lives… 2 Afgh… 5420 Yield 882 Milk… 1962 hg/An 5 e3
6 QL Lives… 2 Afgh… 5510 Produc… 882 Milk… 1962 tonn… 3.5e5
# … with 2 more variables: Flag <chr>, `Flag Description` <chr>, and
# abbreviated variable names ¹`Domain Code`, ²`Area Code`, ³`Element Code`,
# ⁴`Item Code`
<- subset(livestock,select= -c(`Year Code`))
livestock_new <- subset(egg_chicken,select= -c(`Year Code`)) egg_new
<- cattle_diary_new%>%
cattle_diary_newmutate(Item=str_replace_all(Item,"Milk, whole fresh cow", "Milk"))
<- egg_new%>%
egg_new mutate(Item=str_replace_all(Item,"Eggs, hen, in shell","Poultry"))
table(cattle_diary_mutate$Item)
Error in table(cattle_diary_mutate$Item): object 'cattle_diary_mutate' not found
table(egg_new$Item)
Poultry
38170
Now that we have renamed the columns, now we perform a join on the tidied data.
Join Data
we have same columns in cattle, diary and live stock tables. so when we join them the final data must have the rows summated together in all the tables.
<- full_join(egg_new,cattle_diary_new)
cattle_egg <- full_join(cattle_egg,livestock_new) joined_data
dim(joined_data)
[1] 156735 13
we now confirm that they are all joined together.
Visualizations
ggplot(joined_data,aes(x=Item))+geom_bar(fill = "lightpink") +
labs(title = "Item and count", x = "Item",
y = "Count")+geom_text(stat='count', aes(label=..count..), vjust=1)
Now, we have understood that we can perform visualizations too on joined data.