Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
knitr
options(dplyr.summarise.inform = FALSE)
Vishnupriya Varadharaju
October 12, 2022
Today’s challenge is to
# A tibble: 6 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1961 1961 1000…
2 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1962 1962 1000…
3 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1963 1963 1000…
4 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1964 1964 1000…
5 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1965 1965 1000…
6 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1966 1966 1000…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
[1] 30977 14
spc_tbl_ [30,977 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Domain Code : chr [1:30977] "QA" "QA" "QA" "QA" ...
$ Domain : chr [1:30977] "Live Animals" "Live Animals" "Live Animals" "Live Animals" ...
$ Area Code : num [1:30977] 2 2 2 2 2 2 2 2 2 2 ...
$ Area : chr [1:30977] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Element Code : num [1:30977] 5112 5112 5112 5112 5112 ...
$ Element : chr [1:30977] "Stocks" "Stocks" "Stocks" "Stocks" ...
$ Item Code : num [1:30977] 1057 1057 1057 1057 1057 ...
$ Item : chr [1:30977] "Chickens" "Chickens" "Chickens" "Chickens" ...
$ Year Code : num [1:30977] 1961 1962 1963 1964 1965 ...
$ Year : num [1:30977] 1961 1962 1963 1964 1965 ...
$ Unit : chr [1:30977] "1000 Head" "1000 Head" "1000 Head" "1000 Head" ...
$ Value : num [1:30977] 4700 4900 5000 5300 5500 5800 6600 6290 6300 6000 ...
$ Flag : chr [1:30977] "F" "F" "F" "F" ...
$ Flag Description: chr [1:30977] "FAO estimate" "FAO estimate" "FAO estimate" "FAO estimate" ...
- attr(*, "spec")=
.. cols(
.. `Domain Code` = col_character(),
.. Domain = col_character(),
.. `Area Code` = col_double(),
.. Area = col_character(),
.. `Element Code` = col_double(),
.. Element = col_character(),
.. `Item Code` = col_double(),
.. Item = col_character(),
.. `Year Code` = col_double(),
.. Year = col_double(),
.. Unit = col_character(),
.. Value = col_double(),
.. Flag = col_character(),
.. `Flag Description` = col_character()
.. )
- attr(*, "problems")=<externalptr>
# A tibble: 6 × 8
Domain Area Element Item Year Unit Value Flag Descrip…¹
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 Live Animals Afghanistan Stocks Chickens 1961 1000 Head 4700 FAO estimate
2 Live Animals Afghanistan Stocks Chickens 1962 1000 Head 4900 FAO estimate
3 Live Animals Afghanistan Stocks Chickens 1963 1000 Head 5000 FAO estimate
4 Live Animals Afghanistan Stocks Chickens 1964 1000 Head 5300 FAO estimate
5 Live Animals Afghanistan Stocks Chickens 1965 1000 Head 5500 FAO estimate
6 Live Animals Afghanistan Stocks Chickens 1966 1000 Head 5800 FAO estimate
# … with abbreviated variable name ¹`Flag Description`
Domain Area Element Item
Length:30977 Length:30977 Length:30977 Length:30977
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Year Unit Value Flag Description
Min. :1961 Length:30977 Min. : 0 Length:30977
1st Qu.:1976 Class :character 1st Qu.: 171 Class :character
Median :1992 Mode :character Median : 1800 Mode :character
Mean :1991 Mean : 99411
3rd Qu.:2005 3rd Qu.: 15404
Max. :2018 Max. :23707134
NA's :1036
[1] 1
Live Animals
30977
[1] 248
Afghanistan Africa
58 290
Albania Algeria
232 232
American Samoa Americas
58 232
Angola Antigua and Barbuda
58 58
Argentina Armenia
232 54
Aruba Asia
29 290
Australia Australia and New Zealand
174 232
Austria Azerbaijan
232 54
Bahamas Bahrain
58 58
Bangladesh Barbados
116 116
Belarus Belgium
81 76
Belgium-Luxembourg Belize
156 174
Benin Bermuda
58 110
Bhutan Bolivia (Plurinational State of)
58 174
Bosnia and Herzegovina Botswana
108 58
Brazil Brunei Darussalam
174 116
Bulgaria Burkina Faso
232 58
Burundi Cabo Verde
86 58
Cambodia Cameroon
116 58
Canada Caribbean
232 232
Cayman Islands Central African Republic
53 116
Central America Central Asia
174 108
Chad Chile
58 116
China, Hong Kong SAR China, Macao SAR
279 58
China, mainland China, Taiwan Province of
174 232
Colombia Comoros
58 58
Congo Cook Islands
58 109
Costa Rica Côte d'Ivoire
58 86
Croatia Cuba
108 58
Cyprus Czechia
277 104
Czechoslovakia Democratic People's Republic of Korea
128 116
Democratic Republic of the Congo Denmark
58 232
Dominica Dominican Republic
58 58
Eastern Africa Eastern Asia
232 290
Eastern Europe Ecuador
232 232
Egypt El Salvador
290 58
Equatorial Guinea Eritrea
116 26
Estonia Eswatini
108 58
Ethiopia Ethiopia PDR
26 32
Europe Falkland Islands (Malvinas)
290 58
Fiji Finland
174 166
France French Guyana
290 116
French Polynesia Gabon
116 58
Gambia Georgia
58 54
Germany Ghana
232 58
Greece Grenada
290 58
Guadeloupe Guam
154 58
Guatemala Guinea
58 58
Guinea-Bissau Guyana
58 58
Haiti Honduras
232 58
[1] 1
Stocks
30977
[1] 5
Chickens Ducks Geese and guinea fowls
13074 6909 4136
Pigeons, other birds Turkeys
1165 5693
[1] 58
1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976
493 493 493 493 494 495 495 495 498 498 498 498 498 499 499 499
1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
498 498 497 496 498 498 495 498 499 499 500 502 503 512 514 569
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
574 574 574 574 574 574 574 575 575 575 575 575 575 576 576 576
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
576 576 576 577 577 577 577 577 577 577
[1] 0
< table of extent 0 >
[1] 3.344417
[1] 29941 8
ANALYSIS:
From the above data analysis, we can see that the total number of observations are 30977 and 14 different fields. Out of the 14 only the important 8 fields were selected for further analysis as the remaining columns were redundant. In the 8 fields, 7 fields are categorical and 1 field is numerical. This data contains entries of the quantities of livestock in different countries around the world across different years. All the entries are “Live Animals” and “Stocks” with Unit of “1000 Head”. There are 248 different countries starting from Afghanistan going all the way to Zimbabwe. Out of the 30977 entries, there are Chickens (13074) Ducks (6909), Geese and Guinea Fowls (4136), Pigeons, other birds (1165) and Turkeys (5693). The Year spans from 1961 to 2018 with values increasing from 493 to 577. The Value for each entry is highly varying as it is a numerical field. Under the Flag.Description field, there are 6 sub-groups. While checking for null values in the Value field it’s seen that there are 1036 entries with null values. These entries can be removed as they will not be intuitive to the analysis. After omitting null values, the total number of entries are 29941.
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
# A tibble: 5 × 4
Item `mean(Value)` `median(Value)` `sd(Value)`
<chr> <dbl> <dbl> <dbl>
1 Chickens 207931. 10784. 1081629.
2 Ducks 23072. 510 110621.
3 Geese and guinea fowls 10292. 258 44489.
4 Pigeons, other birds 6163. 2800 8481.
5 Turkeys 15228. 528 56416.
# A tibble: 20 × 3
# Groups: Area [7]
Area Item `mean(Value)`
<chr> <chr> <dbl>
1 Afghanistan Chickens 8099.
2 Africa Chickens 936779.
3 Africa Ducks 13639.
4 Africa Geese and guinea fowls 12164.
5 Africa Pigeons, other birds 11222.
6 Africa Turkeys 9004.
7 Albania Chickens 4055.
8 Albania Ducks 558.
9 Albania Geese and guinea fowls 396.
10 Albania Turkeys 750.
11 Algeria Chickens 70378.
12 Algeria Ducks 36.3
13 Algeria Geese and guinea fowls 19.9
14 Algeria Turkeys 50.6
15 American Samoa Chickens 41.4
16 Americas Chickens 3163543.
17 Americas Ducks 20757.
18 Americas Geese and guinea fowls 689.
19 Americas Turkeys 240435.
20 Angola Chickens 9453.
# A tibble: 2 × 3
# Groups: Area [1]
Area Item `mean(Value)`
<chr> <chr> <dbl>
1 India Chickens 335921.
2 India Ducks 19518.
# A tibble: 10 × 3
# Groups: Area [7]
Area Item mode
<chr> <chr> <int>
1 Aruba Chickens 3
2 South Sudan Chickens 7
3 Sudan Chickens 7
4 Lebanon Turkeys 8
5 Montenegro Chickens 13
6 Serbia Chickens 13
7 Serbia Ducks 13
8 Serbia Geese and guinea fowls 13
9 Serbia Turkeys 13
10 Belgium Geese and guinea fowls 14
# A tibble: 10 × 3
# Groups: Area [4]
Area Item mode
<chr> <chr> <int>
1 Afghanistan Chickens 58
2 Africa Chickens 58
3 Africa Ducks 58
4 Africa Geese and guinea fowls 58
5 Africa Pigeons, other birds 58
6 Africa Turkeys 58
7 Albania Chickens 58
8 Algeria Chickens 58
9 Algeria Ducks 58
10 Algeria Geese and guinea fowls 58
# Calculating the dispersion - min, max, std
# Finding the min, max and std of Value for Items for each Year globally
new_birds_data %>%
select(Year, Area, Item, Value)%>%
group_by(Year, Item) %>%
summarize(minVal = min(Value, na.rm = TRUE), maxVal = max(Value, na.rm = TRUE),
stdVal = sd(Value, na.rm = TRUE)) %>%
arrange(desc(Year)) %>%
head(20)
# A tibble: 20 × 5
# Groups: Year [4]
Year Item minVal maxVal stdVal
<dbl> <chr> <dbl> <dbl> <dbl>
1 2018 Chickens 4 23707134 1910971.
2 2018 Ducks 1 1124917 163342.
3 2018 Geese and guinea fowls 0 365485 73359.
4 2018 Pigeons, other birds 153 28795 7013.
5 2018 Turkeys 0 466787 63609.
6 2017 Chickens 4 23212565 1867386.
7 2017 Ducks 1 1110234 162060.
8 2017 Geese and guinea fowls 0 359384 72359.
9 2017 Pigeons, other birds 152 29333 7156.
10 2017 Turkeys 0 464073 63422.
11 2016 Chickens 4 22826754 1838043.
12 2016 Ducks 1 1119590 164051.
13 2016 Geese and guinea fowls 0 361012 72914.
14 2016 Pigeons, other birds 150 28737 7029.
15 2016 Turkeys 0 465359 63495.
16 2015 Chickens 4 21678753 1732607.
17 2015 Ducks 0 1096515 158538.
18 2015 Geese and guinea fowls 0 355167 69373.
19 2015 Pigeons, other birds 137 31290 7691.
20 2015 Turkeys 0 445796 60814.
# Calculating the dispersion - quantile
# Finding the min, max and std of Value for Items for each Year globally
new_birds_data %>%
select(Area, Item, Value)%>%
group_by(Area, Item) %>%
summarize(quantVal = quantile(Value, na.rm = TRUE)) %>%
filter(Area == "Bahamas" || Area == "Zimbabwe" || Area == "Seychelles") %>%
arrange(desc(Area))
# A tibble: 30 × 3
# Groups: Area, Item [6]
Area Item quantVal
<chr> <chr> <dbl>
1 Zimbabwe Chickens 6500
2 Zimbabwe Chickens 8025
3 Zimbabwe Chickens 11500
4 Zimbabwe Chickens 19937.
5 Zimbabwe Chickens 30000
6 Zimbabwe Ducks 2
7 Zimbabwe Ducks 7.5
8 Zimbabwe Ducks 42.5
9 Zimbabwe Ducks 75
10 Zimbabwe Ducks 86
# … with 20 more rows
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
ANALYSIS:
Exploratory data analysis is performed on the cleaned data. First, the dataset is grouped by Item to find the mean, median and standard deviation of Values for each of the Item entries across all Areas and Years. Mean, median and std is the highest for the Item Chickens. Looking at the data, the mean values seem to be higher than the median indicating that they’re all right skewed. Next, group by Area and Item is done and the mean Value of each category of items for each Country is calculated. Then on filtering specifically for the Area ‘India’, the mean Values for the item “Chickens” and “Ducks” are shown. Then, the mode of the Item for different Areas is calculated and arranged in ascending order. The Item Chicken occurs the least in Aruba with a mode of 3 and plenty of values with the maximum mode of 58 across Areas and Items. Next, the data is grouped by Year and Item and the minimum, maximum and standard deviation of the Values across all Area is calculated. On arranging them in descending order of Years, we can see that the minimum values start from 0 and the maximum values go beyond 7 digits as well. The standard deviation of these values is also very large, especially for the Item “Chickens”. Finally, the Quantile is calculated after grouping by Area and Item. The data is then filtered to show the entries for the Area - “Bahamas, Zimbabwe, and Seychelles” arranged in descending order.
---
title: "Challenge 2 Solutions"
author: "Vishnupriya Varadharaju"
desription: "Data wrangling: using group() and summarise()"
date: "10/12/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
options(dplyr.summarise.inform = FALSE)
```
### Working with Birds data set
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## 1. Read in the Data
```{r}
# Reading the birds data into variable birds_data
birds_data <- read_csv("_data/birds.csv")
head(birds_data)
```
## 2. Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
```{r}
#| label: summary
# Finding the dimensions of the data
dim(birds_data)
# Finding the columns of the data and their respective datatypes
str(birds_data)
```
```{r}
# Removing the code columns and retaining the columns with detailed description
# for better readability
new_birds_data <- select(birds_data, !ends_with('Code') & !('Flag'))
head(new_birds_data)
# Getting the summary of the new data
summary(new_birds_data)
```
```{r}
# Finding the number of distinct elements in each column
# Domain
n_distinct(new_birds_data$Domain)
table(new_birds_data$Domain)
```
```{r}
# Area
# Printing only the first 100 entries to save on space
n_distinct(new_birds_data$Area)
table(new_birds_data$Area) %>%
head(100)
```
```{r}
# Element
n_distinct(new_birds_data$Element)
table(new_birds_data$Element)
```
```{r}
# Item
n_distinct(new_birds_data$Item)
table(new_birds_data$Item)
```
```{r}
# Year
n_distinct(new_birds_data$Year)
table(new_birds_data$Year)
```
```{r}
# Unit
n_distinct(new_birds_data$Unit)
table(new_birds_data$Unit)
```
```{r}
# Value
# n_distinct(new_birds_data$Value)
# table(new_birds_data$Value)
```
```{r}
# Flag Description
n_distinct(new_birds_data$Flag.Description)
table(new_birds_data$Flag.Description)
```
```{r}
# Checking for null values in Value column
# Percentage of Null Values in Value Field
nulsval <- ((sum(is.na(new_birds_data$Value)))/length(new_birds_data$Value)) * 100
nulsval
# Omit all the entries with null values
new_birds_data <- na.omit(new_birds_data)
dim(new_birds_data)
```
ANALYSIS:
From the above data analysis, we can see that the total number of observations are 30977 and 14 different fields. Out of the 14 only the important 8 fields were selected for further analysis as the remaining columns were redundant. In the 8 fields, 7 fields are categorical and 1 field is numerical. This data contains entries of the quantities of livestock in different countries around the world across different years. All the entries are "Live Animals" and "Stocks" with Unit of "1000 Head". There are 248 different countries starting from Afghanistan going all the way to Zimbabwe. Out of the 30977 entries, there are Chickens (13074) Ducks (6909), Geese and Guinea Fowls (4136), Pigeons, other birds (1165) and Turkeys (5693). The Year spans from 1961 to 2018 with values increasing from 493 to 577. The Value for each entry is highly varying as it is a numerical field. Under the Flag.Description field, there are 6 sub-groups. While checking for null values in the Value field it's seen that there are 1036 entries with null values. These entries can be removed as they will not be intuitive to the analysis. After omitting null values, the total number of entries are 29941.
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
# Finding the central tendency - mean, median and std for subgroup Item
# Not calculating mode here as Value is numeric
new_birds_data %>%
select(Item, Value)%>%
group_by(Item) %>%
summarize(mean(Value), median(Value), sd(Value))
```
```{r}
# Finding the mean of Values for subgroups Area and Item
new_birds_data %>%
select(Year, Area, Item, Value)%>%
group_by(Area, Item) %>%
summarize(mean(Value)) %>%
head(20)
```
```{r}
# Finding the mean of Values for subgroups Area and Item
# Getting the mean value of items specifically for India
new_birds_data %>%
select(Year, Area, Item, Value)%>%
group_by(Area, Item) %>%
summarize(mean(Value)) %>%
filter(Area == "India")
```
```{r}
# Finding the mode of Item for subgroups Area and Item and arranging in
# ascending order
new_birds_data %>%
select(Year, Area, Item, Value)%>%
group_by(Area, Item) %>%
summarize(mode = sum(n())) %>%
arrange(mode) %>%
head(10)
new_birds_data %>%
select(Year, Area, Item, Value)%>%
group_by(Area, Item) %>%
summarize(mode = sum(n())) %>%
arrange(desc(mode)) %>%
head(10)
```
```{r}
# Calculating the dispersion - min, max, std
# Finding the min, max and std of Value for Items for each Year globally
new_birds_data %>%
select(Year, Area, Item, Value)%>%
group_by(Year, Item) %>%
summarize(minVal = min(Value, na.rm = TRUE), maxVal = max(Value, na.rm = TRUE),
stdVal = sd(Value, na.rm = TRUE)) %>%
arrange(desc(Year)) %>%
head(20)
```
```{r}
# Calculating the dispersion - quantile
# Finding the min, max and std of Value for Items for each Year globally
new_birds_data %>%
select(Area, Item, Value)%>%
group_by(Area, Item) %>%
summarize(quantVal = quantile(Value, na.rm = TRUE)) %>%
filter(Area == "Bahamas" || Area == "Zimbabwe" || Area == "Seychelles") %>%
arrange(desc(Area))
```
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
ANALYSIS:
Exploratory data analysis is performed on the cleaned data. First, the dataset is grouped by Item to find the mean, median and standard deviation of Values for each of the Item entries across all Areas and Years. Mean, median and std is the highest for the Item Chickens. Looking at the data, the mean values seem to be higher than the median indicating that they’re all right skewed.
Next, group by Area and Item is done and the mean Value of each category of items for each Country is calculated. Then on filtering specifically for the Area ‘India’, the mean Values for the item “Chickens” and “Ducks” are shown.
Then, the mode of the Item for different Areas is calculated and arranged in ascending order. The Item Chicken occurs the least in Aruba with a mode of 3 and plenty of values with the maximum mode of 58 across Areas and Items.
Next, the data is grouped by Year and Item and the minimum, maximum and standard deviation of the Values across all Area is calculated. On arranging them in descending order of Years, we can see that the minimum values start from 0 and the maximum values go beyond 7 digits as well. The standard deviation of these values is also very large, especially for the Item “Chickens”.
Finally, the Quantile is calculated after grouping by Area and Item. The data is then filtered to show the entries for the Area - “Bahamas, Zimbabwe, and Seychelles” arranged in descending order.