library(tidyverse)
library(ggplot2)
library(readr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Challenge 8 Instructions
Challenge Overview
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
The required data is initially read.
library(readr)
egg_chicken <- read_csv("_data/FAOSTAT_egg_chicken.csv")
cattle_dairy <- read_csv("_data/FAOSTAT_cattle_dairy.csv")
livestock <- read_csv("_data/FAOSTAT_livestock.csv")Briefly describe the data
While we make efforts to understand the data, we see all the column names present, the sample head of the three datasets.
colnames(egg_chicken) [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
 [5] "Element Code"     "Element"          "Item Code"        "Item"            
 [9] "Year Code"        "Year"             "Unit"             "Value"           
[13] "Flag"             "Flag Description"colnames(cattle_dairy) [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
 [5] "Element Code"     "Element"          "Item Code"        "Item"            
 [9] "Year Code"        "Year"             "Unit"             "Value"           
[13] "Flag"             "Flag Description"colnames(livestock) [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
 [5] "Element Code"     "Element"          "Item Code"        "Item"            
 [9] "Year Code"        "Year"             "Unit"             "Value"           
[13] "Flag"             "Flag Description"head(cattle_dairy)# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961 Head 
2 QL      Lives…       2 Afgh…    5420 Yield       882 Milk…    1961  1961 hg/An
3 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk…    1961  1961 tonn…
4 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1962  1962 Head 
5 QL      Lives…       2 Afgh…    5420 Yield       882 Milk…    1962  1962 hg/An
6 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk…    1962  1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
#   ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`head(egg_chicken)# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QL      Lives…       2 Afgh…    5313 Laying     1062 Eggs…    1961  1961 1000…
2 QL      Lives…       2 Afgh…    5410 Yield      1062 Eggs…    1961  1961 100m…
3 QL      Lives…       2 Afgh…    5510 Produc…    1062 Eggs…    1961  1961 tonn…
4 QL      Lives…       2 Afgh…    5313 Laying     1062 Eggs…    1962  1962 1000…
5 QL      Lives…       2 Afgh…    5410 Yield      1062 Eggs…    1962  1962 100m…
6 QL      Lives…       2 Afgh…    5510 Produc…    1062 Eggs…    1962  1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
#   ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`head(livestock)# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1961  1961 Head 
2 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1962  1962 Head 
3 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1963  1963 Head 
4 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1964  1964 Head 
5 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1965  1965 Head 
6 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1966  1966 Head 
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
#   ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`table(egg_chicken$Item)
Eggs, hen, in shell 
              38170 table(cattle_dairy$Item)
Milk, whole fresh cow 
                36449 table(livestock$Item)
    Asses Buffaloes    Camels    Cattle     Goats    Horses     Mules      Pigs 
     8571      3505      3265     13086     12498     11104      6153     12015 
    Sheep 
    11919 we observe that only in live stock table we have multiple items.
all(egg_chicken$Year==egg_chicken$`Year Code`)[1] TRUEall(cattle_dairy$Year==cattle_dairy$`Year Code`)[1] TRUEall(livestock$Year==livestock$`Year Code`)[1] TRUEwe observe duplicated columns in the data sets.
sapply(egg_chicken,function(x)sum(is.na(x)))     Domain Code           Domain        Area Code             Area 
               0                0                0                0 
    Element Code          Element        Item Code             Item 
               0                0                0                0 
       Year Code             Year             Unit            Value 
               0                0                0               40 
            Flag Flag Description 
            7548                0 we see that there are 40 missing values of value and 7548 missing values of flag description.
sapply(cattle_dairy,function(x)sum(is.na(x)))     Domain Code           Domain        Area Code             Area 
               0                0                0                0 
    Element Code          Element        Item Code             Item 
               0                0                0                0 
       Year Code             Year             Unit            Value 
               0                0                0               74 
            Flag Flag Description 
           10044                0 sapply(livestock,function(x)sum(is.na(x)))     Domain Code           Domain        Area Code             Area 
               0                0                0                0 
    Element Code          Element        Item Code             Item 
               0                0                0                0 
       Year Code             Year             Unit            Value 
               0                0                0             1301 
            Flag Flag Description 
           38270                0 In live_stock there are 1301 missing values in values and 38270 missing values in flags
print(summarytools::dfSummary(egg_chicken,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')Data Frame Summary
egg_chicken
Dimensions: 38170 x 14Duplicates: 0
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Domain Code [character] | 1. QL | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Domain [character] | 1. Livestock Primary | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Area Code [numeric] | 
 | 245 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Area [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Element Code [numeric] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Element [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Item Code [numeric] | 1 distinct value | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Item [character] | 1. Eggs, hen, in shell | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Year Code [numeric] | 
 | 58 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Year [numeric] | 
 | 58 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Unit [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Value [numeric] | 
 | 21325 distinct values | 40 (0.1%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Flag [character] | 
 | 
 | 7548 (19.8%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Flag Description [character] | 
 | 
 | 0 (0.0%) | 
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21
print(summarytools::dfSummary(livestock,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')Data Frame Summary
livestock
Dimensions: 82116 x 14Duplicates: 0
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Domain Code [character] | 1. QA | 
 | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Domain [character] | 1. Live Animals | 
 | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Area Code [numeric] | 
 | 253 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Area [character] | 
 | 
 | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Element Code [numeric] | 1 distinct value | 
 | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Element [character] | 1. Stocks | 
 | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Item Code [numeric] | 
 | 
 | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Item [character] | 
 | 
 | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Year Code [numeric] | 
 | 58 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Year [numeric] | 
 | 58 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Unit [character] | 1. Head | 
 | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Value [numeric] | 
 | 43667 distinct values | 1301 (1.6%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Flag [character] | 
 | 
 | 38270 (46.6%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Flag Description [character] | 
 | 
 | 0 (0.0%) | 
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21
print(summarytools::dfSummary(cattle_dairy,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')Data Frame Summary
cattle_dairy
Dimensions: 36449 x 14Duplicates: 0
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Domain Code [character] | 1. QL | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Domain [character] | 1. Livestock Primary | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Area Code [numeric] | 
 | 232 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Area [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Element Code [numeric] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Element [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Item Code [numeric] | 1 distinct value | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Item [character] | 1. Milk, whole fresh cow | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Year Code [numeric] | 
 | 58 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Year [numeric] | 
 | 58 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Unit [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Value [numeric] | 
 | 24088 distinct values | 74 (0.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Flag [character] | 
 | 
 | 10044 (27.6%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Flag Description [character] | 
 | 
 | 0 (0.0%) | 
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21
Tidy the data
First we need to remove the duplicate columns from each of the data set
cattle_diary_new <- subset(cattle_dairy,select= -c(`Year Code`))head(cattle_diary_new)# A tibble: 6 × 13
  Domain …¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item   Year Unit  Value
  <chr>     <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr> <dbl> <chr> <dbl>
1 QL        Lives…       2 Afgh…    5318 Milk A…     882 Milk…  1961 Head  7  e5
2 QL        Lives…       2 Afgh…    5420 Yield       882 Milk…  1961 hg/An 5  e3
3 QL        Lives…       2 Afgh…    5510 Produc…     882 Milk…  1961 tonn… 3.5e5
4 QL        Lives…       2 Afgh…    5318 Milk A…     882 Milk…  1962 Head  7  e5
5 QL        Lives…       2 Afgh…    5420 Yield       882 Milk…  1962 hg/An 5  e3
6 QL        Lives…       2 Afgh…    5510 Produc…     882 Milk…  1962 tonn… 3.5e5
# … with 2 more variables: Flag <chr>, `Flag Description` <chr>, and
#   abbreviated variable names ¹`Domain Code`, ²`Area Code`, ³`Element Code`,
#   ⁴`Item Code`livestock_new <- subset(livestock,select= -c(`Year Code`))
egg_new <- subset(egg_chicken,select= -c(`Year Code`))cattle_diary_new<- cattle_diary_new%>%
  mutate(Item=str_replace_all(Item,"Milk, whole fresh cow", "Milk"))
egg_new <- egg_new%>%
  mutate(Item=str_replace_all(Item,"Eggs, hen, in shell","Poultry"))
table(cattle_diary_mutate$Item)Error in table(cattle_diary_mutate$Item): object 'cattle_diary_mutate' not foundtable(egg_new$Item)
Poultry 
  38170 Now that we have renamed the columns, now we perform a join on the tidied data.
Join Data
we have same columns in cattle, diary and live stock tables. so when we join them the final data must have the rows summated together in all the tables.
cattle_egg <- full_join(egg_new,cattle_diary_new)
joined_data <- full_join(cattle_egg,livestock_new)dim(joined_data)[1] 156735     13we now confirm that they are all joined together.
Visualizations
ggplot(joined_data,aes(x=Item))+geom_bar(fill = "lightpink") +
  labs(title = "Item and count", x = "Item", 
     y = "Count")+geom_text(stat='count', aes(label=..count..), vjust=1)
Now, we have understood that we can perform visualizations too on joined data.