challenge_2
Anirudh Lakkaraju
birds.csv
Data wrangling: using group() and summarise()
Author

Anirudh Lakkaraju

Published

April 30, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐
Code
df <- read_csv("_data/birds.csv")
df
# A tibble: 30,977 × 14
   `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
   <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
 1 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 2 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 3 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 4 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 5 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 6 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 7 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 8 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 9 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
10 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
# ℹ 30,967 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>
Code
head(df)
# A tibble: 6 × 14
  `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
  <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
1 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
2 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
3 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
4 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
5 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
6 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>
Code
tail(df)
# A tibble: 6 × 14
  `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
  <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
1 QA            Live Anima…        5504 Poly…           5112 Stocks         1068
2 QA            Live Anima…        5504 Poly…           5112 Stocks         1068
3 QA            Live Anima…        5504 Poly…           5112 Stocks         1068
4 QA            Live Anima…        5504 Poly…           5112 Stocks         1068
5 QA            Live Anima…        5504 Poly…           5112 Stocks         1068
6 QA            Live Anima…        5504 Poly…           5112 Stocks         1068
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>
Code
nrow(df)
[1] 30977
Code
summary(df)
 Domain Code           Domain            Area Code        Area          
 Length:30977       Length:30977       Min.   :   1   Length:30977      
 Class :character   Class :character   1st Qu.:  79   Class :character  
 Mode  :character   Mode  :character   Median : 156   Mode  :character  
                                       Mean   :1202                     
                                       3rd Qu.: 231                     
                                       Max.   :5504                     
                                                                        
  Element Code    Element            Item Code        Item          
 Min.   :5112   Length:30977       Min.   :1057   Length:30977      
 1st Qu.:5112   Class :character   1st Qu.:1057   Class :character  
 Median :5112   Mode  :character   Median :1068   Mode  :character  
 Mean   :5112                      Mean   :1066                     
 3rd Qu.:5112                      3rd Qu.:1072                     
 Max.   :5112                      Max.   :1083                     
                                                                    
   Year Code         Year          Unit               Value         
 Min.   :1961   Min.   :1961   Length:30977       Min.   :       0  
 1st Qu.:1976   1st Qu.:1976   Class :character   1st Qu.:     171  
 Median :1992   Median :1992   Mode  :character   Median :    1800  
 Mean   :1991   Mean   :1991                      Mean   :   99411  
 3rd Qu.:2005   3rd Qu.:2005                      3rd Qu.:   15404  
 Max.   :2018   Max.   :2018                      Max.   :23707134  
                                                  NA's   :1036      
     Flag           Flag Description  
 Length:30977       Length:30977      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      
Code
str(df)
spc_tbl_ [30,977 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Domain Code     : chr [1:30977] "QA" "QA" "QA" "QA" ...
 $ Domain          : chr [1:30977] "Live Animals" "Live Animals" "Live Animals" "Live Animals" ...
 $ Area Code       : num [1:30977] 2 2 2 2 2 2 2 2 2 2 ...
 $ Area            : chr [1:30977] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Element Code    : num [1:30977] 5112 5112 5112 5112 5112 ...
 $ Element         : chr [1:30977] "Stocks" "Stocks" "Stocks" "Stocks" ...
 $ Item Code       : num [1:30977] 1057 1057 1057 1057 1057 ...
 $ Item            : chr [1:30977] "Chickens" "Chickens" "Chickens" "Chickens" ...
 $ Year Code       : num [1:30977] 1961 1962 1963 1964 1965 ...
 $ Year            : num [1:30977] 1961 1962 1963 1964 1965 ...
 $ Unit            : chr [1:30977] "1000 Head" "1000 Head" "1000 Head" "1000 Head" ...
 $ Value           : num [1:30977] 4700 4900 5000 5300 5500 5800 6600 6290 6300 6000 ...
 $ Flag            : chr [1:30977] "F" "F" "F" "F" ...
 $ Flag Description: chr [1:30977] "FAO estimate" "FAO estimate" "FAO estimate" "FAO estimate" ...
 - attr(*, "spec")=
  .. cols(
  ..   `Domain Code` = col_character(),
  ..   Domain = col_character(),
  ..   `Area Code` = col_double(),
  ..   Area = col_character(),
  ..   `Element Code` = col_double(),
  ..   Element = col_character(),
  ..   `Item Code` = col_double(),
  ..   Item = col_character(),
  ..   `Year Code` = col_double(),
  ..   Year = col_double(),
  ..   Unit = col_character(),
  ..   Value = col_double(),
  ..   Flag = col_character(),
  ..   `Flag Description` = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Cleaning the data

We clean all variables with the word "code" in the name as these variables are only unique identifiers for other variables.

Code
df_cleaned<-df%>%select(-c(contains("Code")))
head(df_cleaned)
# A tibble: 6 × 9
  Domain       Area     Element Item   Year Unit  Value Flag  `Flag Description`
  <chr>        <chr>    <chr>   <chr> <dbl> <chr> <dbl> <chr> <chr>             
1 Live Animals Afghani… Stocks  Chic…  1961 1000…  4700 F     FAO estimate      
2 Live Animals Afghani… Stocks  Chic…  1962 1000…  4900 F     FAO estimate      
3 Live Animals Afghani… Stocks  Chic…  1963 1000…  5000 F     FAO estimate      
4 Live Animals Afghani… Stocks  Chic…  1964 1000…  5300 F     FAO estimate      
5 Live Animals Afghani… Stocks  Chic…  1965 1000…  5500 F     FAO estimate      
6 Live Animals Afghani… Stocks  Chic…  1966 1000…  5800 F     FAO estimate      

Describe the data

We now try to describe the data and further filter the data if the necessary.

We first find the number of unique values in each variable

Code
length(unique(df_cleaned$Domain))
[1] 1
Code
length(unique(df_cleaned$Area))
[1] 248
Code
length(unique(df_cleaned$Element))
[1] 1
Code
length(unique(df_cleaned$Item))
[1] 5
Code
length(unique(df_cleaned$Year))
[1] 58
Code
length(unique(df_cleaned$Unit))
[1] 1
Code
length(unique(df_cleaned$Value))
[1] 11496
Code
length(unique(df_cleaned$Flag))
[1] 6
Code
length(unique(df_cleaned$`Flag Description`))
[1] 6

As we can see that Domain, Element and Unit have only 1 unique value, they can also be dropped

Code
df_cleaned<-df_cleaned%>%select(-c(Domain, Element, Unit))
unique(df_cleaned$Item)
[1] "Chickens"               "Ducks"                  "Geese and guinea fowls"
[4] "Turkeys"                "Pigeons, other birds"  

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code
df_cleaned %>%
  group_by(Item)%>%
  summarise(avg_stocks = mean(Value, na.rm = TRUE))
# A tibble: 5 × 2
  Item                   avg_stocks
  <chr>                       <dbl>
1 Chickens                  207931.
2 Ducks                      23072.
3 Geese and guinea fowls     10292.
4 Pigeons, other birds        6163.
5 Turkeys                    15228.

Explain and Interpret

We try to understand the sizes of the five livestock in the datset and which countries have how much poultry during the given time frame. By grouping in the following way, it can be seen that the countries have a lot more chicken as livestock compared to the other poultry.

Code
df_cleaned %>%
  group_by(Area)%>%
  summarize(avg_stocks = mean(Value, na.rm = TRUE))
# A tibble: 248 × 2
   Area                avg_stocks
   <chr>                    <dbl>
 1 Afghanistan             8099. 
 2 Africa                196561. 
 3 Albania                 2278. 
 4 Algeria                17621. 
 5 American Samoa            41.4
 6 Americas              856356. 
 7 Angola                  9453. 
 8 Antigua and Barbuda       93.6
 9 Argentina              18844. 
10 Armenia                 2062. 
# ℹ 238 more rows