Challenge 2_PriyankaThatikonda

challenge_2
railroads
faostat
hotel_bookings
Data wrangling: using group() and summarise()
Author

Meredith Rolfe

Published

August 16, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐
Code
data <- read_csv("_data/birds.csv")
print(data,show_col_types = FALSE)
# A tibble: 30,977 × 14
   Domain Cod…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
   <chr>        <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
 1 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1961  1961
 2 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1962  1962
 3 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1963  1963
 4 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1964  1964
 5 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1965  1965
 6 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1966  1966
 7 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1967  1967
 8 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1968  1968
 9 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1969  1969
10 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1970  1970
# … with 30,967 more rows, 4 more variables: Unit <chr>, Value <dbl>,
#   Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
#   ¹​`Domain Code`, ²​`Area Code`, ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
Code
head(data)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1961  1961 1000…
2 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1962  1962 1000…
3 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1963  1963 1000…
4 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1964  1964 1000…
5 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1965  1965 1000…
6 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1966  1966 1000…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
summary(data)
 Domain Code           Domain            Area Code        Area          
 Length:30977       Length:30977       Min.   :   1   Length:30977      
 Class :character   Class :character   1st Qu.:  79   Class :character  
 Mode  :character   Mode  :character   Median : 156   Mode  :character  
                                       Mean   :1202                     
                                       3rd Qu.: 231                     
                                       Max.   :5504                     
                                                                        
  Element Code    Element            Item Code        Item          
 Min.   :5112   Length:30977       Min.   :1057   Length:30977      
 1st Qu.:5112   Class :character   1st Qu.:1057   Class :character  
 Median :5112   Mode  :character   Median :1068   Mode  :character  
 Mean   :5112                      Mean   :1066                     
 3rd Qu.:5112                      3rd Qu.:1072                     
 Max.   :5112                      Max.   :1083                     
                                                                    
   Year Code         Year          Unit               Value         
 Min.   :1961   Min.   :1961   Length:30977       Min.   :       0  
 1st Qu.:1976   1st Qu.:1976   Class :character   1st Qu.:     171  
 Median :1992   Median :1992   Mode  :character   Median :    1800  
 Mean   :1991   Mean   :1991                      Mean   :   99411  
 3rd Qu.:2005   3rd Qu.:2005                      3rd Qu.:   15404  
 Max.   :2018   Max.   :2018                      Max.   :23707134  
                                                  NA's   :1036      
     Flag           Flag Description  
 Length:30977       Length:30977      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      
Code
dim(data)
[1] 30977    14
Code
colnames(data)
 [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
 [5] "Element Code"     "Element"          "Item Code"        "Item"            
 [9] "Year Code"        "Year"             "Unit"             "Value"           
[13] "Flag"             "Flag Description"
Code
nrow(data)
[1] 30977
Code
num_unique_items <- length(unique(data$Item))
message(sprintf("There are %d unique varieties of animals present in the dataset\n", num_unique_items))
cat(paste(num_unique_items, collapse = ", "))
5
Code
num_unique_items <- length(unique(data$Area))
message("Total number of areas present in the dataset:\n", num_unique_items)
cat(paste(num_unique_items, collapse = ", "))
248

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code
data%>%
  select(`Area`) %>%
  n_distinct(.)
[1] 248
Code
  cat("The above shows the numbeer of distinct areas\n")
The above shows the numbeer of distinct areas
Code
data%>%
  select(`Year`) %>%
  n_distinct(.)
[1] 58
Code
  cat("The above shows the numbeer of distinct years\n")
The above shows the numbeer of distinct years
Code
library(dplyr)
BirdsGrouped <- data %>%
  group_by(Year)
print("Data grouped by year:")
[1] "Data grouped by year:"
Code
view(BirdsGrouped)

ChickenOnly <- BirdsGrouped %>%
  filter(Item == "Chickens")
cat("Data filtered to include only chickens:\n")
Data filtered to include only chickens:
Code
view(ChickenOnly)

ChickenSummary <- ChickensOnly %>%
  summarise(
    mean_stocks = mean(Value, na.rm = TRUE),
    median_stocks = median(Value, na.rm = TRUE),
    mode_stocks = Value[which.max(tabulate(match(Value, unique(Value))))])
Error in summarise(., mean_stocks = mean(Value, na.rm = TRUE), median_stocks = median(Value, : object 'ChickensOnly' not found
Code
cat("Mean and Median Valyes of the chicken stocks\n")
Mean and Median Valyes of the chicken stocks
Code
view(ChickenSummary)
Error in view(ChickenSummary): object 'ChickenSummary' not found
Code
GroupByArea <- data %>%
  group_by(`Area`) %>%
  summarise(mean_stock_value = mean(Value, na.rm=TRUE),
            median_stock_value = median(Value, na.rm=TRUE),
            stock_value_sd = sd(Value, na.rm=TRUE),
            min_stock_value = min(Value, na.rm=TRUE),
            max_stock_value = max(Value, na.rm=TRUE),
            first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
            third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
print(GroupByArea)
# A tibble: 248 × 8
   Area                mean_st…¹ media…² stock…³ min_s…⁴ max_s…⁵ first…⁶ third…⁷
   <chr>                   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Afghanistan            8099.   6700    2.82e3    4700   14414  6222.    10565
 2 Africa               196561.  12910.   4.36e5    1213 1897326  5926.    25395
 3 Albania                2278.   1300    2.27e3     200    9494   520      3752
 4 Algeria               17621.     42.5  3.88e4      10  136078    24      2072
 5 American Samoa           41.4    38    1.40e1      24      80    35        40
 6 Americas             856356.  66924.   1.54e6     553 5796289  7497    552649
 7 Angola                 9453.   6075    8.93e3    3400   36500  4925      6930
 8 Antigua and Barbuda      93.6    85    3.97e1      43     160    60.2     130
 9 Argentina             18844.   2355    3.36e4      75  118300   530     10350
10 Armenia                2062.   1528.   2.04e3     120    8934   209.     3865
# … with 238 more rows, and abbreviated variable names ¹​mean_stock_value,
#   ²​median_stock_value, ³​stock_value_sd, ⁴​min_stock_value, ⁵​max_stock_value,
#   ⁶​first_quartile_stock_value, ⁷​third_quartile_stock_value
Code
view(GroupByArea)

GroupByYear <- data %>%
  group_by(`Year`) %>%
  summarise(mean_stock_value = mean(Value, na.rm=TRUE),
            median_stock_value = median(Value, na.rm=TRUE),
            stock_value_sd = sd(Value, na.rm=TRUE),
            min_stock_value = min(Value, na.rm=TRUE),
            max_stock_value = max(Value, na.rm=TRUE),
            first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
            third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
print(GroupByYear)
# A tibble: 58 × 8
    Year mean_stock_value median_stock…¹ stock…² min_s…³ max_s…⁴ first…⁵ third…⁶
   <dbl>            <dbl>          <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1  1961           36752.          1033  216931.       1 3906690    102.   6223.
 2  1962           37787.          1014  224935.       1 4048728    105.   6441.
 3  1963           38736.          1106  230985.       1 4163131    106.   6530.
 4  1964           39325.          1103  234108.       0 4231221    106.   6818 
 5  1965           40334.          1104  240537.       0 4349674    106    7000 
 6  1966           41229.          1088. 245576.       0 4445629    109.   7463.
 7  1967           43240.          1193  257592.       0 4666511    119    7935.
 8  1968           44420.          1252. 265750.       0 4823170    117.   7915 
 9  1969           45607.          1267  273871.       0 4988438    106    8255 
10  1970           47706.          1259  285751.       0 5209733    119    8676.
# … with 48 more rows, and abbreviated variable names ¹​median_stock_value,
#   ²​stock_value_sd, ³​min_stock_value, ⁴​max_stock_value,
#   ⁵​first_quartile_stock_value, ⁶​third_quartile_stock_value
Code
view(GroupByYear)

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

The goal was to analyze the sizes of the five livestock categories within the dataset and determine the poultry quantities in different countries within the specified time period. Through a specific grouping approach by year, I observed a notable disparity between the presence of chickens and other types of poultry in these countries. It became evident that the countries had significantly higher numbers of chicken livestock compared to other poultry varieties. By grouping the data by year, we can explore interesting trends in chicken stocks over time. The chicken_summary and chicken_range tables provide valuable insights into the central tendency and dispersion of chicken stocks across different years. These tables reveal the highest and lowest mean, median, and mode chicken stocks, as well as the minimum, maximum, and quartile ranges. Analyzing the trends over time can help us identify any interesting patterns and understand how chicken stocks have varied throughout the specified time period.