challenge_2
Jaswanth Reddy Kommuru
birds
Data wrangling: using group() and summarise()
Author

Jaswanth Reddy Kommuru

Published

May 8, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐
Code
birdsdata<-read_csv("~/Documents/601/601_Spring_2023/posts/_data/birds.csv")

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

The data appears to be related to live animal stocks, specifically chickens, in various countries over multiple years. Here’s a high-level description of the data:

The dataset consists of cases representing different observations and variables providing specific information about those observations. Each case represents a particular combination of country, year, and other attributes. The variables in the dataset are as follows:

Domain Code: Code indicating the domain or category of the data. Domain: Descriptive label for the domain. Area Code: Code representing the country or area. Area: Name of the country or area. Element Code: Code indicating the element or attribute being measured. Element: Description of the element or attribute. Item Code: Code representing the specific item or category being measured. Item: Description of the item or category. Year Code: Code representing the year of observation. Year: Specific year of observation. Unit: Unit of measurement for the values. Value: Numerical value or measurement for the specific observation. Flag: Flag indicating any additional information or peculiarities about the observation. Flag Description: Description providing further details about the flag.

It seems that the data was likely gathered by the Food and Agriculture Organization (FAO) as estimates of live animal stocks, focusing on chickens. The observations include the country, year, and the stock of chickens measured in units of 1000 head. The dataset also includes flags that might indicate specific circumstances or notes related to the observations.

Code
head(birdsdata)
# A tibble: 6 × 14
  `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
  <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
1 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
2 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
3 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
4 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
5 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
6 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>
Code
dim(birdsdata)
[1] 30977    14
Code
tail(birdsdata)
# A tibble: 6 × 14
  `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
  <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
1 QA            Live Anima…        5504 Poly…           5112 Stocks         1068
2 QA            Live Anima…        5504 Poly…           5112 Stocks         1068
3 QA            Live Anima…        5504 Poly…           5112 Stocks         1068
4 QA            Live Anima…        5504 Poly…           5112 Stocks         1068
5 QA            Live Anima…        5504 Poly…           5112 Stocks         1068
6 QA            Live Anima…        5504 Poly…           5112 Stocks         1068
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>
Code
colnames(birdsdata)
 [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
 [5] "Element Code"     "Element"          "Item Code"        "Item"            
 [9] "Year Code"        "Year"             "Unit"             "Value"           
[13] "Flag"             "Flag Description"
Code
nrow(birdsdata)
[1] 30977
Code
str(birdsdata)
spc_tbl_ [30,977 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Domain Code     : chr [1:30977] "QA" "QA" "QA" "QA" ...
 $ Domain          : chr [1:30977] "Live Animals" "Live Animals" "Live Animals" "Live Animals" ...
 $ Area Code       : num [1:30977] 2 2 2 2 2 2 2 2 2 2 ...
 $ Area            : chr [1:30977] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Element Code    : num [1:30977] 5112 5112 5112 5112 5112 ...
 $ Element         : chr [1:30977] "Stocks" "Stocks" "Stocks" "Stocks" ...
 $ Item Code       : num [1:30977] 1057 1057 1057 1057 1057 ...
 $ Item            : chr [1:30977] "Chickens" "Chickens" "Chickens" "Chickens" ...
 $ Year Code       : num [1:30977] 1961 1962 1963 1964 1965 ...
 $ Year            : num [1:30977] 1961 1962 1963 1964 1965 ...
 $ Unit            : chr [1:30977] "1000 Head" "1000 Head" "1000 Head" "1000 Head" ...
 $ Value           : num [1:30977] 4700 4900 5000 5300 5500 5800 6600 6290 6300 6000 ...
 $ Flag            : chr [1:30977] "F" "F" "F" "F" ...
 $ Flag Description: chr [1:30977] "FAO estimate" "FAO estimate" "FAO estimate" "FAO estimate" ...
 - attr(*, "spec")=
  .. cols(
  ..   `Domain Code` = col_character(),
  ..   Domain = col_character(),
  ..   `Area Code` = col_double(),
  ..   Area = col_character(),
  ..   `Element Code` = col_double(),
  ..   Element = col_character(),
  ..   `Item Code` = col_double(),
  ..   Item = col_character(),
  ..   `Year Code` = col_double(),
  ..   Year = col_double(),
  ..   Unit = col_character(),
  ..   Value = col_double(),
  ..   Flag = col_character(),
  ..   `Flag Description` = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
Code
summary(birdsdata)
 Domain Code           Domain            Area Code        Area          
 Length:30977       Length:30977       Min.   :   1   Length:30977      
 Class :character   Class :character   1st Qu.:  79   Class :character  
 Mode  :character   Mode  :character   Median : 156   Mode  :character  
                                       Mean   :1202                     
                                       3rd Qu.: 231                     
                                       Max.   :5504                     
                                                                        
  Element Code    Element            Item Code        Item          
 Min.   :5112   Length:30977       Min.   :1057   Length:30977      
 1st Qu.:5112   Class :character   1st Qu.:1057   Class :character  
 Median :5112   Mode  :character   Median :1068   Mode  :character  
 Mean   :5112                      Mean   :1066                     
 3rd Qu.:5112                      3rd Qu.:1072                     
 Max.   :5112                      Max.   :1083                     
                                                                    
   Year Code         Year          Unit               Value         
 Min.   :1961   Min.   :1961   Length:30977       Min.   :       0  
 1st Qu.:1976   1st Qu.:1976   Class :character   1st Qu.:     171  
 Median :1992   Median :1992   Mode  :character   Median :    1800  
 Mean   :1991   Mean   :1991                      Mean   :   99411  
 3rd Qu.:2005   3rd Qu.:2005                      3rd Qu.:   15404  
 Max.   :2018   Max.   :2018                      Max.   :23707134  
                                                  NA's   :1036      
     Flag           Flag Description  
 Length:30977       Length:30977      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code
 birdsdata%>%
  select(`Area`) %>%
  n_distinct(.)
[1] 248
Code
 birdsdata%>%
  select(`Year`) %>%
  n_distinct(.)
[1] 58
Code
distinct(birdsdata,`Domain`)
# A tibble: 1 × 1
  Domain      
  <chr>       
1 Live Animals
Code
req_data<-select(birdsdata,Area,Item,Year,Value)
req_data %>%
  group_by(Year) %>%
  summarise(Total_sum=sum(Value,na.rm=TRUE)) %>% 
  arrange(desc(`Total_sum`))
# A tibble: 58 × 2
    Year Total_sum
   <dbl>     <dbl>
 1  2018 102772478
 2  2017 100702358
 3  2016  99205812
 4  2015  94430088
 5  2014  92055902
 6  2013  91572785
 7  2012  89857937
 8  2010  89244853
 9  2011  87518486
10  2009  86817163
# ℹ 48 more rows

Population of the live animals in the available data of all the countries in different years and we can observe from the above table that as the years are increasing generally the count of the live animals is also increasing.

Code
birdsdata %>%
  group_by(Item)%>%
  summarise(average_value = mean(Value, na.rm = TRUE), Medain_value=median(Value, na.rm = TRUE))
# A tibble: 5 × 3
  Item                   average_value Medain_value
  <chr>                          <dbl>        <dbl>
1 Chickens                     207931.       10784.
2 Ducks                         23072.         510 
3 Geese and guinea fowls        10292.         258 
4 Pigeons, other birds           6163.        2800 
5 Turkeys                       15228.         528 
Code
Item_group <-
birdsdata %>%
  group_by(`Item`) %>%
  select(Item,Value) %>%
  summarise(mean_stock_value = mean(Value, na.rm=TRUE),
            median_stock_value = median(Value, na.rm=TRUE),
            stock_value_sd = sd(Value, na.rm=TRUE),
            min_stock_value = min(Value, na.rm=TRUE),
            max_stock_value = max(Value, na.rm=TRUE),
            first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
            third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
Item_group
# A tibble: 5 × 8
  Item        mean_stock_value median_stock_value stock_value_sd min_stock_value
  <chr>                  <dbl>              <dbl>          <dbl>           <dbl>
1 Chickens             207931.             10784.       1081629.               0
2 Ducks                 23072.               510         110621.               0
3 Geese and …           10292.               258          44489.               0
4 Pigeons, o…            6163.              2800           8481.               0
5 Turkeys               15228.               528          56416.               0
# ℹ 3 more variables: max_stock_value <dbl>, first_quartile_stock_value <dbl>,
#   third_quartile_stock_value <dbl>
Code
Area_group <-
birdsdata %>%
  group_by(`Area`) %>%
  summarise(mean_stock_value = mean(Value, na.rm=TRUE),
            median_stock_value = median(Value, na.rm=TRUE),
            stock_value_sd = sd(Value, na.rm=TRUE),
            min_stock_value = min(Value, na.rm=TRUE),
            max_stock_value = max(Value, na.rm=TRUE),
            first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
            third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
Area_group
# A tibble: 248 × 8
   Area       mean_stock_value median_stock_value stock_value_sd min_stock_value
   <chr>                 <dbl>              <dbl>          <dbl>           <dbl>
 1 Afghanist…           8099.              6700           2819.             4700
 2 Africa             196561.             12910.        435740.             1213
 3 Albania              2278.              1300           2268.              200
 4 Algeria             17621.                42.5        38830.               10
 5 American …             41.4               38             14.0              24
 6 Americas           856356.             66924.       1539316.              553
 7 Angola               9453.              6075           8928.             3400
 8 Antigua a…             93.6               85             39.7              43
 9 Argentina           18844.              2355          33620.               75
10 Armenia              2062.              1528.          2044.              120
# ℹ 238 more rows
# ℹ 3 more variables: max_stock_value <dbl>, first_quartile_stock_value <dbl>,
#   third_quartile_stock_value <dbl>
Code
Year_group <- birdsdata %>%
  group_by(`Year`) %>%
  summarise(mean_stock_value = mean(Value, na.rm=TRUE),
            median_stock_value = median(Value, na.rm=TRUE),
            stock_value_sd = sd(Value, na.rm=TRUE),
            min_stock_value = min(Value, na.rm=TRUE),
            max_stock_value = max(Value, na.rm=TRUE),
            first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
            third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
print(Year_group)
# A tibble: 58 × 8
    Year mean_stock_value median_stock_value stock_value_sd min_stock_value
   <dbl>            <dbl>              <dbl>          <dbl>           <dbl>
 1  1961           36752.              1033         216931.               1
 2  1962           37787.              1014         224935.               1
 3  1963           38736.              1106         230985.               1
 4  1964           39325.              1103         234108.               0
 5  1965           40334.              1104         240537.               0
 6  1966           41229.              1088.        245576.               0
 7  1967           43240.              1193         257592.               0
 8  1968           44420.              1252.        265750.               0
 9  1969           45607.              1267         273871.               0
10  1970           47706.              1259         285751.               0
# ℹ 48 more rows
# ℹ 3 more variables: max_stock_value <dbl>, first_quartile_stock_value <dbl>,
#   third_quartile_stock_value <dbl>

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

Our objective was to analyze the sizes of the five livestock categories within the dataset and determine the poultry quantities in different countries within the specified time period. Through a specific grouping approach, we observed a notable disparity between the presence of chickens and other types of poultry in these countries. It became evident that the countries had significantly higher numbers of chicken livestock compared to other poultry varieties.

Code
birdsdata %>%
  group_by(Area)%>%
  summarize(mean_stock_value = mean(Value, na.rm = TRUE))
# A tibble: 248 × 2
   Area                mean_stock_value
   <chr>                          <dbl>
 1 Afghanistan                   8099. 
 2 Africa                      196561. 
 3 Albania                       2278. 
 4 Algeria                      17621. 
 5 American Samoa                  41.4
 6 Americas                    856356. 
 7 Angola                        9453. 
 8 Antigua and Barbuda             93.6
 9 Argentina                    18844. 
10 Armenia                       2062. 
# ℹ 238 more rows