challenge_2
Aritra Basu
faostat
Data wrangling: using group() and summarise()
Author

Aritra Basu

Published

March 18, 2023

Code
library(tidyverse)
library(dplyr)
library(descr)
library(readr)
library(summarytools)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Reading in the Data

In this way, I have read in the data:

Code
faostat <- read.csv("_data/FAOSTAT_cattle_dairy.csv")

Describing the Data

Code
view(dfSummary(faostat, style="grid", graph.magnif = 0.75, valid.col = FALSE))

Using dfsummary from the package summarytools, we can see that the dataset has 36449 rows and 14 columns. From dfsummary, we can see that the Area variable includes both unique region names as well as greater geographical aggregations. We can see that there are three Elements: milk animals, production and yield. The units are Head, hg/An and tonnes respectively. The data is for years 1961 to 2018, that is, 58 distinct years.

Code
flag_description <- faostat %>% distinct(Flag.Description)

It can be observed that

Let us see what the unique regional units are:

Code
unique_units <- faostat %>% distinct(Area)

It can be seen that there are 204 unique region names. ` I will locate the distinct regions to eliminate overlaps.

Code
distinct_regions_identify <- faostat %>% distinct(Area, Area.Code)

I use this to only retain the unique regions. We have seen that the area codes of the unique regional units have area code less than 5000.

Code
faostat_filter <- faostat %>% filter(Area.Code < 5000)
Code
faostat_summary1 <- faostat_filter  %>% 
  filter(Element == "Milk Animals")  %>% 
  group_by(Area)  %>% 
  summarise(
    Mean=mean(Value, na.rm = TRUE),
    Quantile1 = quantile(Value, c(0.25), q1 = c(0.25), na.rm = TRUE),
    Median=median(Value, na.rm = TRUE),
    Quantile3 = quantile(Value, c(0.75), q3 = c(0.75), na.rm = TRUE),
    SD=sd(Value, na.rm = TRUE),
    min=min(Value, na.rm = TRUE),
    max=max(Value, na.rm = TRUE),
    )
faostat_summary1
# A tibble: 205 × 8
   Area                     Mean Quantile1   Median Quant…¹     SD    min    max
   <chr>                   <dbl>     <dbl>    <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghanistan         1790863.  1029500   1150000   2.6 e6 1.08e6 7   e5 4.05e6
 2 Albania              288774.   167625    284650   3.59e5 1.13e5 1.17e5 4.88e5
 3 Algeria              659719.   455000    640000   8.69e5 2.50e5 2.63e5 1.11e6
 4 American Samoa           25.8      22.2      24   2.88e1 6.05e0 2   e1 4   e1
 5 Angola               324717.   285163.   320000   3.54e5 5.53e4 2.2 e5 4.36e5
 6 Antigua and Barbuda    5196.     3205      5525   6.6 e3 2.18e3 2.3 e3 9.07e3
 7 Argentina           2458013.  2194750   2400000   2.75e6 3.89e5 1.60e6 3.45e6
 8 Armenia              281229.   267382.   277840   2.97e5 2.05e4 2.51e5 3.19e5
 9 Australia           2080423.  1691265.  1882060.  2.35e6 4.91e5 1.52e6 3.26e6
10 Austria              825008.   548830    885372   1.02e6 2.34e5 5.25e5 1.14e6
# … with 195 more rows, and abbreviated variable name ¹​Quantile3

First, I have filtered Milk Animals, and then I grouped it using Area. Grouping by area is quite intuitive as we have observation for region for multiple years. We can observe the mean, values for the three quantiles, standard deviation, minimum and maximum. Next I also explicitly show the three quantiles by area.

Code
three_quantiles <- faostat_filter  %>% 
  filter(Element == "Milk Animals")  %>% 
  group_by(Area) %>% 
  summarise(  Quantile1 = quantile(Value, c(0.25, 0.5, 0.75), na.rm = TRUE), q = c(0.25, 0.5, 0.75))
three_quantiles
# A tibble: 615 × 3
# Groups:   Area [205]
   Area           Quantile1     q
   <chr>              <dbl> <dbl>
 1 Afghanistan    1029500    0.25
 2 Afghanistan    1150000    0.5 
 3 Afghanistan    2600000    0.75
 4 Albania         167625    0.25
 5 Albania         284650    0.5 
 6 Albania         359491.   0.75
 7 Algeria         455000    0.25
 8 Algeria         640000    0.5 
 9 Algeria         868881.   0.75
10 American Samoa      22.2  0.25
# … with 605 more rows

A similar grouping could be done by filtering the data with elements production as well as yield.

Grouping again by Area, but filtering using Yield would give us:

Code
faostat_summary3 <- faostat_filter  %>% 
  filter(Element == "Yield")  %>% 
  group_by(Area)  %>% 
  summarise(
    Mean=mean(Value, na.rm = TRUE),
    Quantile1 = quantile(Value, c(0.25), q1 = c(0.25), na.rm = TRUE),
    Median=median(Value, na.rm = TRUE),
    Quantile3 = quantile(Value, c(0.75), q3 = c(0.75), na.rm = TRUE),
    SD=sd(Value, na.rm = TRUE),
    min=min(Value, na.rm = TRUE),
    max=max(Value, na.rm = TRUE),
    ) %>% 
  arrange(desc(Mean))
faostat_summary3
# A tibble: 204 × 8
   Area                       Mean Quantile1 Median Quanti…¹     SD   min    max
   <chr>                     <dbl>     <dbl>  <dbl>    <dbl>  <dbl> <int>  <int>
 1 Israel                   82874.    60312. 86202.   97929  25084. 42941 134121
 2 United States of America 67685.    47596. 66011    85774. 21965. 33068 104633
 3 Luxembourg               65046.    61606. 64045    67626   6235. 54412  77429
 4 Republic of Korea        64932.    39392. 59580    97440. 29276. 17144 106757
 5 Czechia                  63992.    52134  64825    76416. 14758. 39513  87559
 6 Belgium                  63894.    57585  60777    72116   8099. 55091  77537
 7 Denmark                  62537.    45102. 62164    77245  18897. 35943  98952
 8 Sweden                   62221.    47403. 60110.   81047  17808. 31900  88172
 9 Netherlands              59649.    46500. 60230.   71319  13211. 40885  85870
10 Estonia                  59447.    42630  57508    73334. 19527. 31838  93615
# … with 194 more rows, and abbreviated variable name ¹​Quantile3

Here, I have arranged the output in descending order of mean. This allows us to arrange the countries according to the highest mean yields. We can see that Israel occupies the top of the list, while Cote d’Ivorie is at the bottom.

Code
faostat_summary4 <- faostat_filter  %>% 
  filter(Element == "Yield")  %>% 
  group_by(Year)  %>% 
  summarise(
    Mean=mean(Value, na.rm = TRUE),
    Quantile1 = quantile(Value, c(0.25), q1 = c(0.25), na.rm = TRUE),
    Median=median(Value, na.rm = TRUE),
    Quantile3 = quantile(Value, c(0.75), q3 = c(0.75), na.rm = TRUE),
    SD=sd(Value, na.rm = TRUE),
    min=min(Value, na.rm = TRUE),
    max=max(Value, na.rm = TRUE),
    ) %>% 
  arrange(desc(SD))
faostat_summary4
# A tibble: 58 × 8
    Year   Mean Quantile1 Median Quantile3     SD   min    max
   <int>  <dbl>     <dbl>  <dbl>     <dbl>  <dbl> <int>  <int>
 1  2017 27727.     7941  14205     39320  27443.   955 131817
 2  2018 27565.     7808  14115     39315  27428.   999 134121
 3  2016 27442.     7944  14359     39302  27016.   994 129531
 4  2015 26874.     7948  14186     37444  26324.   960 119815
 5  2014 26749.     8001  14127     39546  26146.  1000 122949
 6  2012 26284.     7532. 13906.    35807. 25614.  1000 115553
 7  2013 26295.     7707  13908     38600. 25551.  1000 116872
 8  2011 26103.     7338. 13901     36252. 25307.  1000 113931
 9  2010 25605.     7288. 13890     37084. 24863.  1000 107672
10  2009 25175.     7151. 13892.    36079  24419.  1000 106757
# … with 48 more rows

This would be another alternative way of grouping the data, where Yield has been filtered, but instead of grouping the dataset by country, we have grouped it by year. This allows the possibility of looking at yield over the years. For example, we can see that 2017 was the year with the highest variance in yield while 1961 was the year with the lowest variance.