Challenge 2 Solutions

challenge_2

railroads

faostat

hotel_bookings

Author

Vishnupriya Varadharaju

Published

October 12, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

options(dplyr.summarise.inform = FALSE)

Working with Birds data set

Challenge Overview

Today’s challenge is to

read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
provide summary statistics for different interesting groups within the data, and interpret those statistics

1. Read in the Data

Code

# Reading the birds data into variable birds_data
birds_data <- read_csv("_data/birds.csv")
head(birds_data)

# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1961  1961 1000…
2 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1962  1962 1000…
3 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1963  1963 1000…
4 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1964  1964 1000…
5 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1965  1965 1000…
6 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1966  1966 1000…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
#   ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`

2. Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code

# Finding the dimensions of the data
dim(birds_data)

[1] 30977    14

Code

# Finding the columns of the data and their respective datatypes
str(birds_data)

spc_tbl_ [30,977 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Domain Code     : chr [1:30977] "QA" "QA" "QA" "QA" ...
 $ Domain          : chr [1:30977] "Live Animals" "Live Animals" "Live Animals" "Live Animals" ...
 $ Area Code       : num [1:30977] 2 2 2 2 2 2 2 2 2 2 ...
 $ Area            : chr [1:30977] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Element Code    : num [1:30977] 5112 5112 5112 5112 5112 ...
 $ Element         : chr [1:30977] "Stocks" "Stocks" "Stocks" "Stocks" ...
 $ Item Code       : num [1:30977] 1057 1057 1057 1057 1057 ...
 $ Item            : chr [1:30977] "Chickens" "Chickens" "Chickens" "Chickens" ...
 $ Year Code       : num [1:30977] 1961 1962 1963 1964 1965 ...
 $ Year            : num [1:30977] 1961 1962 1963 1964 1965 ...
 $ Unit            : chr [1:30977] "1000 Head" "1000 Head" "1000 Head" "1000 Head" ...
 $ Value           : num [1:30977] 4700 4900 5000 5300 5500 5800 6600 6290 6300 6000 ...
 $ Flag            : chr [1:30977] "F" "F" "F" "F" ...
 $ Flag Description: chr [1:30977] "FAO estimate" "FAO estimate" "FAO estimate" "FAO estimate" ...
 - attr(*, "spec")=
  .. cols(
  ..   `Domain Code` = col_character(),
  ..   Domain = col_character(),
  ..   `Area Code` = col_double(),
  ..   Area = col_character(),
  ..   `Element Code` = col_double(),
  ..   Element = col_character(),
  ..   `Item Code` = col_double(),
  ..   Item = col_character(),
  ..   `Year Code` = col_double(),
  ..   Year = col_double(),
  ..   Unit = col_character(),
  ..   Value = col_double(),
  ..   Flag = col_character(),
  ..   `Flag Description` = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

Code

# Removing the code columns and retaining the columns with detailed description
# for better readability
new_birds_data <- select(birds_data, !ends_with('Code') & !('Flag'))
head(new_birds_data)

# A tibble: 6 × 8
  Domain       Area        Element Item      Year Unit      Value Flag Descrip…¹
  <chr>        <chr>       <chr>   <chr>    <dbl> <chr>     <dbl> <chr>         
1 Live Animals Afghanistan Stocks  Chickens  1961 1000 Head  4700 FAO estimate  
2 Live Animals Afghanistan Stocks  Chickens  1962 1000 Head  4900 FAO estimate  
3 Live Animals Afghanistan Stocks  Chickens  1963 1000 Head  5000 FAO estimate  
4 Live Animals Afghanistan Stocks  Chickens  1964 1000 Head  5300 FAO estimate  
5 Live Animals Afghanistan Stocks  Chickens  1965 1000 Head  5500 FAO estimate  
6 Live Animals Afghanistan Stocks  Chickens  1966 1000 Head  5800 FAO estimate  
# … with abbreviated variable name ¹`Flag Description`

Code

# Getting the summary of the new data
summary(new_birds_data)

    Domain              Area             Element              Item          
 Length:30977       Length:30977       Length:30977       Length:30977      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
      Year          Unit               Value          Flag Description  
 Min.   :1961   Length:30977       Min.   :       0   Length:30977      
 1st Qu.:1976   Class :character   1st Qu.:     171   Class :character  
 Median :1992   Mode  :character   Median :    1800   Mode  :character  
 Mean   :1991                      Mean   :   99411                     
 3rd Qu.:2005                      3rd Qu.:   15404                     
 Max.   :2018                      Max.   :23707134                     
                                   NA's   :1036

Code

# Finding the number of distinct elements in each column 

# Domain
n_distinct(new_birds_data$Domain)

[1] 1

Code

table(new_birds_data$Domain)


Live Animals 
       30977

Code

# Area
# Printing only the first 100 entries to save on space
n_distinct(new_birds_data$Area)

[1] 248

Code

table(new_birds_data$Area) %>%
  head(100)


                          Afghanistan                                Africa 
                                   58                                   290 
                              Albania                               Algeria 
                                  232                                   232 
                       American Samoa                              Americas 
                                   58                                   232 
                               Angola                   Antigua and Barbuda 
                                   58                                    58 
                            Argentina                               Armenia 
                                  232                                    54 
                                Aruba                                  Asia 
                                   29                                   290 
                            Australia             Australia and New Zealand 
                                  174                                   232 
                              Austria                            Azerbaijan 
                                  232                                    54 
                              Bahamas                               Bahrain 
                                   58                                    58 
                           Bangladesh                              Barbados 
                                  116                                   116 
                              Belarus                               Belgium 
                                   81                                    76 
                   Belgium-Luxembourg                                Belize 
                                  156                                   174 
                                Benin                               Bermuda 
                                   58                                   110 
                               Bhutan      Bolivia (Plurinational State of) 
                                   58                                   174 
               Bosnia and Herzegovina                              Botswana 
                                  108                                    58 
                               Brazil                     Brunei Darussalam 
                                  174                                   116 
                             Bulgaria                          Burkina Faso 
                                  232                                    58 
                              Burundi                            Cabo Verde 
                                   86                                    58 
                             Cambodia                              Cameroon 
                                  116                                    58 
                               Canada                             Caribbean 
                                  232                                   232 
                       Cayman Islands              Central African Republic 
                                   53                                   116 
                      Central America                          Central Asia 
                                  174                                   108 
                                 Chad                                 Chile 
                                   58                                   116 
                 China, Hong Kong SAR                      China, Macao SAR 
                                  279                                    58 
                      China, mainland             China, Taiwan Province of 
                                  174                                   232 
                             Colombia                               Comoros 
                                   58                                    58 
                                Congo                          Cook Islands 
                                   58                                   109 
                           Costa Rica                         Côte d'Ivoire 
                                   58                                    86 
                              Croatia                                  Cuba 
                                  108                                    58 
                               Cyprus                               Czechia 
                                  277                                   104 
                       Czechoslovakia Democratic People's Republic of Korea 
                                  128                                   116 
     Democratic Republic of the Congo                               Denmark 
                                   58                                   232 
                             Dominica                    Dominican Republic 
                                   58                                    58 
                       Eastern Africa                          Eastern Asia 
                                  232                                   290 
                       Eastern Europe                               Ecuador 
                                  232                                   232 
                                Egypt                           El Salvador 
                                  290                                    58 
                    Equatorial Guinea                               Eritrea 
                                  116                                    26 
                              Estonia                              Eswatini 
                                  108                                    58 
                             Ethiopia                          Ethiopia PDR 
                                   26                                    32 
                               Europe           Falkland Islands (Malvinas) 
                                  290                                    58 
                                 Fiji                               Finland 
                                  174                                   166 
                               France                         French Guyana 
                                  290                                   116 
                     French Polynesia                                 Gabon 
                                  116                                    58 
                               Gambia                               Georgia 
                                   58                                    54 
                              Germany                                 Ghana 
                                  232                                    58 
                               Greece                               Grenada 
                                  290                                    58 
                           Guadeloupe                                  Guam 
                                  154                                    58 
                            Guatemala                                Guinea 
                                   58                                    58 
                        Guinea-Bissau                                Guyana 
                                   58                                    58 
                                Haiti                              Honduras 
                                  232                                    58

Code

# Element
n_distinct(new_birds_data$Element)

[1] 1

Code

table(new_birds_data$Element)


Stocks 
 30977

Code

# Item
n_distinct(new_birds_data$Item)

[1] 5

Code

table(new_birds_data$Item)


              Chickens                  Ducks Geese and guinea fowls 
                 13074                   6909                   4136 
  Pigeons, other birds                Turkeys 
                  1165                   5693

Code

# Year
n_distinct(new_birds_data$Year)

[1] 58

Code

table(new_birds_data$Year)


1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 
 493  493  493  493  494  495  495  495  498  498  498  498  498  499  499  499 
1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 
 498  498  497  496  498  498  495  498  499  499  500  502  503  512  514  569 
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 
 574  574  574  574  574  574  574  575  575  575  575  575  575  576  576  576 
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 
 576  576  576  577  577  577  577  577  577  577

Code

# Unit
n_distinct(new_birds_data$Unit)

[1] 1

Code

table(new_birds_data$Unit)


1000 Head 
    30977

Code

# Value
# n_distinct(new_birds_data$Value)
# table(new_birds_data$Value)

Code

# Flag Description
n_distinct(new_birds_data$Flag.Description)

[1] 0

Code

table(new_birds_data$Flag.Description)

< table of extent 0 >

Code

# Checking for null values in Value column
# Percentage of Null Values in Value Field
nulsval <- ((sum(is.na(new_birds_data$Value)))/length(new_birds_data$Value)) * 100
nulsval

[1] 3.344417

Code

# Omit all the entries with null values
new_birds_data <- na.omit(new_birds_data)
dim(new_birds_data)

[1] 29941     8

ANALYSIS:

From the above data analysis, we can see that the total number of observations are 30977 and 14 different fields. Out of the 14 only the important 8 fields were selected for further analysis as the remaining columns were redundant. In the 8 fields, 7 fields are categorical and 1 field is numerical. This data contains entries of the quantities of livestock in different countries around the world across different years. All the entries are “Live Animals” and “Stocks” with Unit of “1000 Head”. There are 248 different countries starting from Afghanistan going all the way to Zimbabwe. Out of the 30977 entries, there are Chickens (13074) Ducks (6909), Geese and Guinea Fowls (4136), Pigeons, other birds (1165) and Turkeys (5693). The Year spans from 1961 to 2018 with values increasing from 493 to 577. The Value for each entry is highly varying as it is a numerical field. Under the Flag.Description field, there are 6 sub-groups. While checking for null values in the Value field it’s seen that there are 1036 entries with null values. These entries can be removed as they will not be intuitive to the analysis. After omitting null values, the total number of entries are 29941.

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code

# Finding the central tendency - mean, median and std for subgroup Item
# Not calculating mode here as Value is numeric
new_birds_data %>%
  select(Item, Value)%>%
  group_by(Item) %>%
  summarize(mean(Value), median(Value), sd(Value))

# A tibble: 5 × 4
  Item                   `mean(Value)` `median(Value)` `sd(Value)`
  <chr>                          <dbl>           <dbl>       <dbl>
1 Chickens                     207931.          10784.    1081629.
2 Ducks                         23072.            510      110621.
3 Geese and guinea fowls        10292.            258       44489.
4 Pigeons, other birds           6163.           2800        8481.
5 Turkeys                       15228.            528       56416.

Code

# Finding the mean of Values for subgroups Area and Item
new_birds_data %>%
  select(Year, Area, Item, Value)%>%
  group_by(Area, Item) %>%
  summarize(mean(Value)) %>%
  head(20)

# A tibble: 20 × 3
# Groups:   Area [7]
   Area           Item                   `mean(Value)`
   <chr>          <chr>                          <dbl>
 1 Afghanistan    Chickens                      8099. 
 2 Africa         Chickens                    936779. 
 3 Africa         Ducks                        13639. 
 4 Africa         Geese and guinea fowls       12164. 
 5 Africa         Pigeons, other birds         11222. 
 6 Africa         Turkeys                       9004. 
 7 Albania        Chickens                      4055. 
 8 Albania        Ducks                          558. 
 9 Albania        Geese and guinea fowls         396. 
10 Albania        Turkeys                        750. 
11 Algeria        Chickens                     70378. 
12 Algeria        Ducks                           36.3
13 Algeria        Geese and guinea fowls          19.9
14 Algeria        Turkeys                         50.6
15 American Samoa Chickens                        41.4
16 Americas       Chickens                   3163543. 
17 Americas       Ducks                        20757. 
18 Americas       Geese and guinea fowls         689. 
19 Americas       Turkeys                     240435. 
20 Angola         Chickens                      9453.

Code

# Finding the mean of Values for subgroups Area and Item
# Getting the mean value of items specifically for India
new_birds_data %>%
  select(Year, Area, Item, Value)%>%
  group_by(Area, Item) %>%
  summarize(mean(Value)) %>%
  filter(Area == "India")

# A tibble: 2 × 3
# Groups:   Area [1]
  Area  Item     `mean(Value)`
  <chr> <chr>            <dbl>
1 India Chickens       335921.
2 India Ducks           19518.

Code

# Finding the mode of Item for subgroups Area and Item and arranging in 
# ascending order

new_birds_data %>%
  select(Year, Area, Item, Value)%>%
  group_by(Area, Item) %>%
  summarize(mode = sum(n())) %>%
  arrange(mode) %>%
  head(10)

# A tibble: 10 × 3
# Groups:   Area [7]
   Area        Item                    mode
   <chr>       <chr>                  <int>
 1 Aruba       Chickens                   3
 2 South Sudan Chickens                   7
 3 Sudan       Chickens                   7
 4 Lebanon     Turkeys                    8
 5 Montenegro  Chickens                  13
 6 Serbia      Chickens                  13
 7 Serbia      Ducks                     13
 8 Serbia      Geese and guinea fowls    13
 9 Serbia      Turkeys                   13
10 Belgium     Geese and guinea fowls    14

Code

new_birds_data %>%
  select(Year, Area, Item, Value)%>%
  group_by(Area, Item) %>%
  summarize(mode = sum(n())) %>%
  arrange(desc(mode)) %>%
  head(10)

# A tibble: 10 × 3
# Groups:   Area [4]
   Area        Item                    mode
   <chr>       <chr>                  <int>
 1 Afghanistan Chickens                  58
 2 Africa      Chickens                  58
 3 Africa      Ducks                     58
 4 Africa      Geese and guinea fowls    58
 5 Africa      Pigeons, other birds      58
 6 Africa      Turkeys                   58
 7 Albania     Chickens                  58
 8 Algeria     Chickens                  58
 9 Algeria     Ducks                     58
10 Algeria     Geese and guinea fowls    58

Code

# Calculating the dispersion - min, max, std
# Finding the min, max and std of Value for Items for each Year globally

new_birds_data %>%
  select(Year, Area, Item, Value)%>%
  group_by(Year, Item) %>%
  summarize(minVal = min(Value, na.rm = TRUE), maxVal = max(Value, na.rm = TRUE),
            stdVal = sd(Value, na.rm = TRUE)) %>%
  arrange(desc(Year)) %>%
  head(20)

# A tibble: 20 × 5
# Groups:   Year [4]
    Year Item                   minVal   maxVal   stdVal
   <dbl> <chr>                   <dbl>    <dbl>    <dbl>
 1  2018 Chickens                    4 23707134 1910971.
 2  2018 Ducks                       1  1124917  163342.
 3  2018 Geese and guinea fowls      0   365485   73359.
 4  2018 Pigeons, other birds      153    28795    7013.
 5  2018 Turkeys                     0   466787   63609.
 6  2017 Chickens                    4 23212565 1867386.
 7  2017 Ducks                       1  1110234  162060.
 8  2017 Geese and guinea fowls      0   359384   72359.
 9  2017 Pigeons, other birds      152    29333    7156.
10  2017 Turkeys                     0   464073   63422.
11  2016 Chickens                    4 22826754 1838043.
12  2016 Ducks                       1  1119590  164051.
13  2016 Geese and guinea fowls      0   361012   72914.
14  2016 Pigeons, other birds      150    28737    7029.
15  2016 Turkeys                     0   465359   63495.
16  2015 Chickens                    4 21678753 1732607.
17  2015 Ducks                       0  1096515  158538.
18  2015 Geese and guinea fowls      0   355167   69373.
19  2015 Pigeons, other birds      137    31290    7691.
20  2015 Turkeys                     0   445796   60814.

Code

# Calculating the dispersion - quantile
# Finding the min, max and std of Value for Items for each Year globally

new_birds_data %>%
  select(Area, Item, Value)%>%
  group_by(Area, Item) %>%
  summarize(quantVal = quantile(Value, na.rm = TRUE)) %>%
  filter(Area == "Bahamas" || Area == "Zimbabwe" || Area == "Seychelles") %>%
  arrange(desc(Area))

# A tibble: 30 × 3
# Groups:   Area, Item [6]
   Area     Item     quantVal
   <chr>    <chr>       <dbl>
 1 Zimbabwe Chickens   6500  
 2 Zimbabwe Chickens   8025  
 3 Zimbabwe Chickens  11500  
 4 Zimbabwe Chickens  19937. 
 5 Zimbabwe Chickens  30000  
 6 Zimbabwe Ducks         2  
 7 Zimbabwe Ducks         7.5
 8 Zimbabwe Ducks        42.5
 9 Zimbabwe Ducks        75  
10 Zimbabwe Ducks        86  
# … with 20 more rows

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

ANALYSIS:

Exploratory data analysis is performed on the cleaned data. First, the dataset is grouped by Item to find the mean, median and standard deviation of Values for each of the Item entries across all Areas and Years. Mean, median and std is the highest for the Item Chickens. Looking at the data, the mean values seem to be higher than the median indicating that they’re all right skewed. Next, group by Area and Item is done and the mean Value of each category of items for each Country is calculated. Then on filtering specifically for the Area ‘India’, the mean Values for the item “Chickens” and “Ducks” are shown. Then, the mode of the Item for different Areas is calculated and arranged in ascending order. The Item Chicken occurs the least in Aruba with a mode of 3 and plenty of values with the maximum mode of 58 across Areas and Items. Next, the data is grouped by Year and Item and the minimum, maximum and standard deviation of the Values across all Area is calculated. On arranging them in descending order of Years, we can see that the minimum values start from 0 and the maximum values go beyond 7 digits as well. The standard deviation of these values is also very large, especially for the Item “Chickens”. Finally, the Quantile is calculated after grouping by Area and Item. The data is then filtered to show the entries for the Area - “Bahamas, Zimbabwe, and Seychelles” arranged in descending order.