library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 2
Setup
Read in the Data
For this challenge I will be using the birds.csv dataset.
Code
<- read_csv("_data/birds.csv")
birds head(birds)
# A tibble: 6 × 14
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anima… 2 Afgh… 5112 Stocks 1057
2 QA Live Anima… 2 Afgh… 5112 Stocks 1057
3 QA Live Anima… 2 Afgh… 5112 Stocks 1057
4 QA Live Anima… 2 Afgh… 5112 Stocks 1057
5 QA Live Anima… 2 Afgh… 5112 Stocks 1057
6 QA Live Anima… 2 Afgh… 5112 Stocks 1057
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
Lets print out the Column names to check which columns we are interested in.
Code
colnames(birds)
[1] "Domain Code" "Domain" "Area Code" "Area"
[5] "Element Code" "Element" "Item Code" "Item"
[9] "Year Code" "Year" "Unit" "Value"
[13] "Flag" "Flag Description"
All the Columns that contain “Code” in them can be removed as we have the normal columns as well. For example, since we have the “Area” column, we can remove the “Area Code” column.
Now, lets go through the unique values of certain columns to see if we need them.
Code
unique(birds$`Flag Description`)
[1] "FAO estimate"
[2] "Official data"
[3] "FAO data based on imputation methodology"
[4] "Data not available"
[5] "Unofficial figure"
[6] "Aggregate, may include official, semi-official, estimated or calculated data"
The “Flag Description” and “Flag” columns give us information about the manner of data collection and the authenticity of the data and can be removed for our analysis.
Code
unique(birds$Unit)
[1] "1000 Head"
Code
unique(birds$Domain)
[1] "Live Animals"
The “Unit” and “Domain” columns contain only one unique value and hence can be removed.
Now we will take a subset of the data with only the columns that we require and perform our analysis.
Code
<- birds[c( "Area", "Element", "Item", "Year", "Value")]
birds_subset head(birds_subset)
# A tibble: 6 × 5
Area Element Item Year Value
<chr> <chr> <chr> <dbl> <dbl>
1 Afghanistan Stocks Chickens 1961 4700
2 Afghanistan Stocks Chickens 1962 4900
3 Afghanistan Stocks Chickens 1963 5000
4 Afghanistan Stocks Chickens 1964 5300
5 Afghanistan Stocks Chickens 1965 5500
6 Afghanistan Stocks Chickens 1966 5800
Code
<- na.omit(birds_subset) birds_subset
Describe the data
By looking at the data we can infer that it is about the stock of various poultry birds for major Areas in the world, for each year. Lets take an example of a row in the data set:
Code
1,] birds_subset[
# A tibble: 1 × 5
Area Element Item Year Value
<chr> <chr> <chr> <dbl> <dbl>
1 Afghanistan Stocks Chickens 1961 4700
This shows that, for the year 1961, the stock of poultry Chickens in Afghanistan was 4700.
Code
summary(birds_subset)
Area Element Item Year
Length:29941 Length:29941 Length:29941 Min. :1961
Class :character Class :character Class :character 1st Qu.:1976
Mode :character Mode :character Mode :character Median :1992
Mean :1991
3rd Qu.:2005
Max. :2018
Value
Min. : 0
1st Qu.: 171
Median : 1800
Mean : 99411
3rd Qu.: 15404
Max. :23707134
From the above summary we see that the range of years in this data set goes from 1961 to 2018.
We will now check the unique values for the other columns: Areas:
Code
length(unique(birds_subset$Area))
[1] 248
Items:
Code
unique(birds_subset$Item)
[1] "Chickens" "Ducks" "Geese and guinea fowls"
[4] "Turkeys" "Pigeons, other birds"
Grouped Summary Statistics
Analysis of Items
We will first group the data by Items to see which Poultry Item has been more popular. We will sort it by descending order
Code
%>%
birds_subsetgroup_by(Item)%>%
summarise(Avg_stocks = mean(Value))%>%
arrange(desc(Avg_stocks))
# A tibble: 5 × 2
Item Avg_stocks
<chr> <dbl>
1 Chickens 207931.
2 Ducks 23072.
3 Turkeys 15228.
4 Geese and guinea fowls 10292.
5 Pigeons, other birds 6163.
We can see that Chickens have been the most popular Poultry Item across the world as its stock is significantly higher than the other items. The sum of the stocks of all other items is also less than the stock for Chickens.
Analysis of Areas
Code
%>%
birds_subsetfilter(Year==2010)%>%
group_by(Area)%>%
summarise(Total_stocks = sum(Value))%>%
arrange(desc(Total_stocks))
# A tibble: 238 × 2
Area Total_stocks
<chr> <dbl>
1 World 22311214
2 Asia 12535841
3 Eastern Asia 6906604
4 China, mainland 6312000
5 Americas 5635098
6 South-eastern Asia 2570583
7 Northern America 2379199
8 South America 2299071
9 Europe 2265162
10 Southern Asia 2222019
# ℹ 228 more rows
Here, we can see that the larger continental areas, have a larger stock. This makes sense as the aggregate population of a continent will be greater than that of a country and will have higher consumption.
Asia has the highest poultry consumption. This could be due to the high population density in Asia which adds to the higher consumption.
United States of America is the country with the highest poultry consumption, while Falkland Islands (Malvinas) has the lowest consumption.
Analysis by decade
We can see the trends of poultry as we go through various decades.
Code
%>%
birds_subset mutate(Decade = 10 * floor(Year / 10)) %>%
group_by(Item,Decade) %>%
summarise(Avg_stocks = mean(Value),
Median_stocks = median(Value),
Standard_Deviation = sd(Value))
# A tibble: 30 × 5
# Groups: Item [5]
Item Decade Avg_stocks Median_stocks Standard_Deviation
<chr> <dbl> <dbl> <dbl> <dbl>
1 Chickens 1960 83286. 4745 360089.
2 Chickens 1970 111927. 7000 481532.
3 Chickens 1980 162130. 9200 703217.
4 Chickens 1990 216889. 12000 1007193.
5 Chickens 2000 287001. 15375 1375191.
6 Chickens 2010 361437. 18719 1732596.
7 Ducks 1960 8323. 372. 31291.
8 Ducks 1970 11006. 430. 42341.
9 Ducks 1980 16187. 560. 63333.
10 Ducks 1990 26600. 485 113131.
# ℹ 20 more rows
Here we observe that for each Item, the consumption has gradually increased over the decades. The increase in standard deviation tells us that a few countries are consuming a large chunk of the poultry in the World