Challenge 2 - Adithya Parupudi

challenge_2
hw3
faostat_cattle_diary.csv
dplyr
tidyverse
Data wrangling: using group() and summarise()
Author

Adithya Parupudi

Published

August 16, 2022

Code
library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Focusing on FAOSTAT_cattle_dairy.csv

Code
cattle <- read_csv("_data/FAOSTAT_cattle_dairy.csv")
cattle

Looks like this is an extensive data gathered by faostat regarding the cattle’s products from different countries between 1961 and 2018. A quick glance at the last column tells whether this data is estimated or calculated by faostat. I am interested to see whether over the years, did faostat collect more data or did it unofficially made entries in this data set.

Describe the data

Code
dim(cattle)
[1] 36449    14
Code
colnames(cattle)
 [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
 [5] "Element Code"     "Element"          "Item Code"        "Item"            
 [9] "Year Code"        "Year"             "Unit"             "Value"           
[13] "Flag"             "Flag Description"
Code
cattle %>% 
  group_by(`Flag Description`) %>% 
  summarise()
Code
cattle %>% 
  group_by(Area) %>% 
  summarise()

By filtering the unique values of the Flag Description column, I notice there are some unofficial figures and “Data not available” entries. I want to see how many entries of these were since 1961 and did their count reduce. Maybe this will tell the data gathering techniques have improved over the years.

There are 232 unique countries from the data set with their domain = “Primary Livestock” common to all. (FAOSTAT focused on collecting livestock information from all the countries)

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code
all_countries <- cattle %>%
  group_by(Area,`Flag Description`, Year) %>%
  summarise(median_val=median(Value,na.rm = TRUE), .groups = 'drop') %>% 
  arrange(Year, `Flag Description`) %>% 
  select(Year, Area, `Flag Description`)
all_countries

For analysis purpose, I place my focus on these columns (Year, Area, Flag Description) and arranged the list in ascending order with respect to Year column

Code
all_grouped <- all_countries %>%
  group_by(`Flag Description`, Year) %>%
  summarise(count = n(), ) %>% 
  arrange(desc(Year)) %>% 
  select(Year, `Flag Description`, count)
all_grouped

We have grouped the data by year and flag-description to see the number of flag-description entries in each year. Now lets pick first and last years of the data set and compare the results.

Code
yr_2018 <- all_grouped %>% 
  filter(Year == '2018') %>% 
  arrange(desc(count))
yr_2000 <- all_grouped %>% 
  filter(Year == '2000') %>% 
  arrange(desc(count))
yr_2018
Code
yr_2000

I wanted to compare the entries of “unofficial figure” of two different years. Looks like there were efficient data gathering methods in place, when we look at “Calculated Data” and “Official Data”. There is a rise in “FAO data based on imputation methodology” from 45 to 125 in 18 years.

Code
all_grouped %>% 
  filter(`Flag Description` == 'Unofficial figure') %>% 
  arrange(desc(count))

Unofficial figures were high in 1981. This observation didnt have a steady decline in number, but it varied was rather varied each year.

Code
india <- cattle %>%
  filter(Area == 'India') %>% 
  select(Area, Year, Item, Element, `Element Code` ,`Flag Description`) %>% 
  group_by(`Flag Description`) %>%
  summarise(count = n())
india
Code
cattle %>%
  filter(Area == 'Afghanistan') %>% 
  select(Area, Year, Item, Element, `Element Code` ,`Flag Description`) %>% 
  group_by(`Flag Description`) %>%
  summarise(count = n())

Wanted to see India’s contribution to FAOSTAT since 1961. I’ve compared the details with Afghanistan and noticed India’s overall contribution to the faostat list is less, and unofficial figure numbers are higher. In India, there was no use of “FAO data based on imputation methodology” as well.

Explain and Interpret

I began this analysis, think to group Area, Flag Description and Year and probably tell that over the years (1961 to 2018) the numbers will tell a different story in terms of types of entries made. Though I didn’t uncover groundbreaking observations, it was interesting to see that not all countries have contributed to this FAOSTAT list equally in terms of how data was added.