Code
library(tidyverse)
library(dplyr)
library(descr)
library(readr)
library(summarytools)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Aritra Basu
March 18, 2023
In this way, I have read in the data:
Using dfsummary from the package summarytools, we can see that the dataset has 36449 rows and 14 columns. From dfsummary, we can see that the Area variable includes both unique region names as well as greater geographical aggregations. We can see that there are three Elements: milk animals, production and yield. The units are Head, hg/An and tonnes respectively. The data is for years 1961 to 2018, that is, 58 distinct years.
It can be observed that
Let us see what the unique regional units are:
It can be seen that there are 204 unique region names. ` I will locate the distinct regions to eliminate overlaps.
I use this to only retain the unique regions. We have seen that the area codes of the unique regional units have area code less than 5000.
faostat_summary1 <- faostat_filter %>%
filter(Element == "Milk Animals") %>%
group_by(Area) %>%
summarise(
Mean=mean(Value, na.rm = TRUE),
Quantile1 = quantile(Value, c(0.25), q1 = c(0.25), na.rm = TRUE),
Median=median(Value, na.rm = TRUE),
Quantile3 = quantile(Value, c(0.75), q3 = c(0.75), na.rm = TRUE),
SD=sd(Value, na.rm = TRUE),
min=min(Value, na.rm = TRUE),
max=max(Value, na.rm = TRUE),
)
faostat_summary1
# A tibble: 205 × 8
Area Mean Quantile1 Median Quant…¹ SD min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1790863. 1029500 1150000 2.6 e6 1.08e6 7 e5 4.05e6
2 Albania 288774. 167625 284650 3.59e5 1.13e5 1.17e5 4.88e5
3 Algeria 659719. 455000 640000 8.69e5 2.50e5 2.63e5 1.11e6
4 American Samoa 25.8 22.2 24 2.88e1 6.05e0 2 e1 4 e1
5 Angola 324717. 285163. 320000 3.54e5 5.53e4 2.2 e5 4.36e5
6 Antigua and Barbuda 5196. 3205 5525 6.6 e3 2.18e3 2.3 e3 9.07e3
7 Argentina 2458013. 2194750 2400000 2.75e6 3.89e5 1.60e6 3.45e6
8 Armenia 281229. 267382. 277840 2.97e5 2.05e4 2.51e5 3.19e5
9 Australia 2080423. 1691265. 1882060. 2.35e6 4.91e5 1.52e6 3.26e6
10 Austria 825008. 548830 885372 1.02e6 2.34e5 5.25e5 1.14e6
# … with 195 more rows, and abbreviated variable name ¹Quantile3
First, I have filtered Milk Animals, and then I grouped it using Area. Grouping by area is quite intuitive as we have observation for region for multiple years. We can observe the mean, values for the three quantiles, standard deviation, minimum and maximum. Next I also explicitly show the three quantiles by area.
# A tibble: 615 × 3
# Groups: Area [205]
Area Quantile1 q
<chr> <dbl> <dbl>
1 Afghanistan 1029500 0.25
2 Afghanistan 1150000 0.5
3 Afghanistan 2600000 0.75
4 Albania 167625 0.25
5 Albania 284650 0.5
6 Albania 359491. 0.75
7 Algeria 455000 0.25
8 Algeria 640000 0.5
9 Algeria 868881. 0.75
10 American Samoa 22.2 0.25
# … with 605 more rows
A similar grouping could be done by filtering the data with elements production as well as yield.
Grouping again by Area, but filtering using Yield would give us:
faostat_summary3 <- faostat_filter %>%
filter(Element == "Yield") %>%
group_by(Area) %>%
summarise(
Mean=mean(Value, na.rm = TRUE),
Quantile1 = quantile(Value, c(0.25), q1 = c(0.25), na.rm = TRUE),
Median=median(Value, na.rm = TRUE),
Quantile3 = quantile(Value, c(0.75), q3 = c(0.75), na.rm = TRUE),
SD=sd(Value, na.rm = TRUE),
min=min(Value, na.rm = TRUE),
max=max(Value, na.rm = TRUE),
) %>%
arrange(desc(Mean))
faostat_summary3
# A tibble: 204 × 8
Area Mean Quantile1 Median Quanti…¹ SD min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Israel 82874. 60312. 86202. 97929 25084. 42941 134121
2 United States of America 67685. 47596. 66011 85774. 21965. 33068 104633
3 Luxembourg 65046. 61606. 64045 67626 6235. 54412 77429
4 Republic of Korea 64932. 39392. 59580 97440. 29276. 17144 106757
5 Czechia 63992. 52134 64825 76416. 14758. 39513 87559
6 Belgium 63894. 57585 60777 72116 8099. 55091 77537
7 Denmark 62537. 45102. 62164 77245 18897. 35943 98952
8 Sweden 62221. 47403. 60110. 81047 17808. 31900 88172
9 Netherlands 59649. 46500. 60230. 71319 13211. 40885 85870
10 Estonia 59447. 42630 57508 73334. 19527. 31838 93615
# … with 194 more rows, and abbreviated variable name ¹Quantile3
Here, I have arranged the output in descending order of mean. This allows us to arrange the countries according to the highest mean yields. We can see that Israel occupies the top of the list, while Cote d’Ivorie is at the bottom.
faostat_summary4 <- faostat_filter %>%
filter(Element == "Yield") %>%
group_by(Year) %>%
summarise(
Mean=mean(Value, na.rm = TRUE),
Quantile1 = quantile(Value, c(0.25), q1 = c(0.25), na.rm = TRUE),
Median=median(Value, na.rm = TRUE),
Quantile3 = quantile(Value, c(0.75), q3 = c(0.75), na.rm = TRUE),
SD=sd(Value, na.rm = TRUE),
min=min(Value, na.rm = TRUE),
max=max(Value, na.rm = TRUE),
) %>%
arrange(desc(SD))
faostat_summary4
# A tibble: 58 × 8
Year Mean Quantile1 Median Quantile3 SD min max
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 2017 27727. 7941 14205 39320 27443. 955 131817
2 2018 27565. 7808 14115 39315 27428. 999 134121
3 2016 27442. 7944 14359 39302 27016. 994 129531
4 2015 26874. 7948 14186 37444 26324. 960 119815
5 2014 26749. 8001 14127 39546 26146. 1000 122949
6 2012 26284. 7532. 13906. 35807. 25614. 1000 115553
7 2013 26295. 7707 13908 38600. 25551. 1000 116872
8 2011 26103. 7338. 13901 36252. 25307. 1000 113931
9 2010 25605. 7288. 13890 37084. 24863. 1000 107672
10 2009 25175. 7151. 13892. 36079 24419. 1000 106757
# … with 48 more rows
This would be another alternative way of grouping the data, where Yield has been filtered, but instead of grouping the dataset by country, we have grouped it by year. This allows the possibility of looking at yield over the years. For example, we can see that 2017 was the year with the highest variance in yield while 1961 was the year with the lowest variance.
---
title: "Challenge 2"
author: "Aritra Basu"
description: "Data wrangling: using group() and summarise()"
date: "03/18/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- Aritra Basu
- faostat
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(dplyr)
library(descr)
library(readr)
library(summarytools)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Reading in the Data
In this way, I have read in the data:
```{r}
faostat <- read.csv("_data/FAOSTAT_cattle_dairy.csv")
```
## Describing the Data
```{r}
view(dfSummary(faostat, style="grid", graph.magnif = 0.75, valid.col = FALSE))
```
Using dfsummary from the package summarytools, we can see that the dataset has 36449 rows and 14 columns. From dfsummary, we can see that the Area variable includes both unique region names as well as greater geographical aggregations. We can see that there are three Elements: milk animals, production and yield. The units are Head, hg/An and tonnes respectively. The data is for years 1961 to 2018, that is, 58 distinct years.
```{r}
flag_description <- faostat %>% distinct(Flag.Description)
```
It can be observed that
Let us see what the unique regional units are:
```{r}
unique_units <- faostat %>% distinct(Area)
```
It can be seen that there are 204 unique region names.
`
I will locate the distinct regions to eliminate overlaps.
```{r}
distinct_regions_identify <- faostat %>% distinct(Area, Area.Code)
```
I use this to only retain the unique regions. We have seen that the area codes of the unique regional units have area code less than 5000.
```{r}
faostat_filter <- faostat %>% filter(Area.Code < 5000)
```
```{r}
faostat_summary1 <- faostat_filter %>%
filter(Element == "Milk Animals") %>%
group_by(Area) %>%
summarise(
Mean=mean(Value, na.rm = TRUE),
Quantile1 = quantile(Value, c(0.25), q1 = c(0.25), na.rm = TRUE),
Median=median(Value, na.rm = TRUE),
Quantile3 = quantile(Value, c(0.75), q3 = c(0.75), na.rm = TRUE),
SD=sd(Value, na.rm = TRUE),
min=min(Value, na.rm = TRUE),
max=max(Value, na.rm = TRUE),
)
faostat_summary1
```
First, I have filtered Milk Animals, and then I grouped it using Area. Grouping by area is quite intuitive as we have observation for region for multiple years. We can observe the mean, values for the three quantiles, standard deviation, minimum and maximum. Next I also explicitly show the three quantiles by area.
```{r}
three_quantiles <- faostat_filter %>%
filter(Element == "Milk Animals") %>%
group_by(Area) %>%
summarise( Quantile1 = quantile(Value, c(0.25, 0.5, 0.75), na.rm = TRUE), q = c(0.25, 0.5, 0.75))
three_quantiles
```
A similar grouping could be done by filtering the data with elements production as well as yield.
Grouping again by Area, but filtering using Yield would give us:
```{r}
faostat_summary3 <- faostat_filter %>%
filter(Element == "Yield") %>%
group_by(Area) %>%
summarise(
Mean=mean(Value, na.rm = TRUE),
Quantile1 = quantile(Value, c(0.25), q1 = c(0.25), na.rm = TRUE),
Median=median(Value, na.rm = TRUE),
Quantile3 = quantile(Value, c(0.75), q3 = c(0.75), na.rm = TRUE),
SD=sd(Value, na.rm = TRUE),
min=min(Value, na.rm = TRUE),
max=max(Value, na.rm = TRUE),
) %>%
arrange(desc(Mean))
faostat_summary3
```
Here, I have arranged the output in descending order of mean. This allows us to arrange the countries according to the highest mean yields. We can see that Israel occupies the top of the list, while Cote d'Ivorie is at the bottom.
```{r}
faostat_summary4 <- faostat_filter %>%
filter(Element == "Yield") %>%
group_by(Year) %>%
summarise(
Mean=mean(Value, na.rm = TRUE),
Quantile1 = quantile(Value, c(0.25), q1 = c(0.25), na.rm = TRUE),
Median=median(Value, na.rm = TRUE),
Quantile3 = quantile(Value, c(0.75), q3 = c(0.75), na.rm = TRUE),
SD=sd(Value, na.rm = TRUE),
min=min(Value, na.rm = TRUE),
max=max(Value, na.rm = TRUE),
) %>%
arrange(desc(SD))
faostat_summary4
```
This would be another alternative way of grouping the data, where Yield has been filtered, but instead of grouping the dataset by country, we have grouped it by year. This allows the possibility of looking at yield over the years. For example, we can see that 2017 was the year with the highest variance in yield while 1961 was the year with the lowest variance.