challenge_2
Data wrangling: using group() and summarise()
Author

Daniel Manning

Published

December 28, 2022

Code
library(tidyverse)
library(here)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xlsx ⭐
  • FAOstat*.csv ⭐⭐⭐
  • hotel_bookings ⭐⭐⭐⭐
Code
livestock <- here("posts","_data","FAOSTAT_livestock.csv")%>%
  read_csv()
livestock
# A tibble: 82,116 × 14
   Domain Cod…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
   <chr>        <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
 1 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1961  1961
 2 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1962  1962
 3 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1963  1963
 4 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1964  1964
 5 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1965  1965
 6 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1966  1966
 7 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1967  1967
 8 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1968  1968
 9 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1969  1969
10 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1970  1970
# … with 82,106 more rows, 4 more variables: Unit <chr>, Value <dbl>,
#   Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
#   ¹​`Domain Code`, ²​`Area Code`, ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

In the above cell, I used read.csv() to store the FAOSTAT dataset within my livestock variable. Based on the output when calling the variable livestock, the variable types are as follows:

Domain.Code: String Domain: String Area.Code:Integer Area:String Element.Code: Integer Element:String Item.Code: Integer Item: String Year.Code: Integer Year: Integer Unit: String Value: Integer Flag: String Flag.Description String

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

First I used the head() function to determine that the dataset has fourteen variables. This dataset was likely gathered by survey. Then I used dim() to show that their are 82116 rows, representing the total number of values of livestock counts/estimates spanning various countries (Areas), Types (Items), and years (Years). Next I stored the columns into individual variables for ease of use later on. I used the n_distinct() and distinct() functions to show the number of unique entries in each column and to ouput these values.

Code
head(livestock)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1961  1961 Head 
2 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1962  1962 Head 
3 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1963  1963 Head 
4 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1964  1964 Head 
5 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1965  1965 Head 
6 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1966  1966 Head 
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
Code
dim(livestock)
[1] 82116    14
Code
Domain <- livestock$Domain
Area <- livestock$Area
Element <- livestock$Element
Item <- livestock$Item
Year <- livestock$Year
Unit <- livestock$Unit
Value <- livestock$Value


livestock %>% 
  select(Domain) %>%
  n_distinct(.)
[1] 1
Code
livestock %>% 
  select(Domain) %>%
  distinct()
# A tibble: 1 × 1
  Domain      
  <chr>       
1 Live Animals
Code
livestock %>% 
  select(Area) %>%
  n_distinct(.)
[1] 253
Code
livestock %>% 
  select(Area) %>%
  distinct()
# A tibble: 253 × 1
   Area               
   <chr>              
 1 Afghanistan        
 2 Albania            
 3 Algeria            
 4 American Samoa     
 5 Angola             
 6 Antigua and Barbuda
 7 Argentina          
 8 Armenia            
 9 Aruba              
10 Australia          
# … with 243 more rows
Code
livestock %>% 
  select(Element) %>%
  n_distinct(.)
[1] 1
Code
livestock %>% 
  select(Element) %>%
  distinct()
# A tibble: 1 × 1
  Element
  <chr>  
1 Stocks 
Code
livestock %>% 
  select(Item) %>%
  n_distinct(.)
[1] 9
Code
livestock %>% 
  select(Item) %>%
  distinct()
# A tibble: 9 × 1
  Item     
  <chr>    
1 Asses    
2 Camels   
3 Cattle   
4 Goats    
5 Horses   
6 Mules    
7 Sheep    
8 Buffaloes
9 Pigs     
Code
livestock %>% 
  select(Year) %>%
  n_distinct(.)
[1] 58
Code
livestock %>% 
  select(Year) %>%
  distinct()
# A tibble: 58 × 1
    Year
   <dbl>
 1  1961
 2  1962
 3  1963
 4  1964
 5  1965
 6  1966
 7  1967
 8  1968
 9  1969
10  1970
# … with 48 more rows
Code
livestock %>% 
  select(Unit) %>%
  n_distinct(.)
[1] 1
Code
livestock %>% 
  select(Unit) %>%
  distinct()
# A tibble: 1 × 1
  Unit 
  <chr>
1 Head 

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

I first used the select() function to remove any columns/variables that were redundant or contained only 1 distinct value. I then used the head() to show the first 10 rows our new dataset to confirm that the desired variables were removed. Next, I stored the value (number of a given livestock) in a new variable. I calculated summary statistics of Value, and included the ‘na.rm = TRUE’ parameter so NA values were ignored. Following this, I created a new variable (Area) using the group_by() function. Using the summarise function, I then computed the mean, median, standard deviation, and IQR for the number of livestock, grouped by country. I calculated the summary statistics of the the data grouped by type of animal (Item) and year using the same method.

Code
livestock_new <- livestock %>% select(-c('Domain', 'Domain.Code', 'Area.Code', 'Element', 'Element.Code', 'Item.Code', 'Year.Code', 'Unit', 'Flag'))
Error in `select()`:
! Can't subset columns that don't exist.
✖ Column `Domain.Code` doesn't exist.
Code
head(livestock_new)
Error in head(livestock_new): object 'livestock_new' not found
Code
Value <- livestock_new$Value
Error in eval(expr, envir, enclos): object 'livestock_new' not found
Code
mean(Value, na.rm = TRUE)
[1] 11625569
Code
median(Value, na.rm = TRUE)
[1] 224667
Code
min(Value, na.rm = TRUE)
[1] 0
Code
max(Value, na.rm = TRUE)
[1] 1489744504
Code
sd(Value, na.rm = TRUE)
[1] 64779790
Code
var(Value, na.rm = TRUE)
[1] 4.196421e+15
Code
IQR(Value, na.rm = TRUE)
[1] 2364200
Code
Area <- livestock_new %>% group_by(Area)
Error in group_by(., Area): object 'livestock_new' not found
Code
Area %>% summarise(Value_mean = mean(Value, na.rm = TRUE), Value_median = median(Value, na.rm = TRUE), Value_sd = sd(Value, na.rm = TRUE), Value_IQR = IQR(Value, na.rm = TRUE))
Error in UseMethod("summarise"): no applicable method for 'summarise' applied to an object of class "character"
Code
Item <- livestock_new %>% group_by(Item)
Error in group_by(., Item): object 'livestock_new' not found
Code
Item %>% summarise(Value_mean = mean(Value, na.rm = TRUE), Value_median = median(Value, na.rm = TRUE), Value_sd = sd(Value, na.rm = TRUE), Value_IQR = IQR(Value, na.rm = TRUE))
Error in UseMethod("summarise"): no applicable method for 'summarise' applied to an object of class "character"
Code
Year <- livestock_new %>% group_by(Year)
Error in group_by(., Year): object 'livestock_new' not found
Code
Year %>% summarise(Value_mean = mean(Value, na.rm = TRUE), Value_median = median(Value, na.rm = TRUE), Value_sd = sd(Value, na.rm = TRUE), Value_IQR = IQR(Value, na.rm = TRUE))
Error in UseMethod("summarise"): no applicable method for 'summarise' applied to an object of class "c('double', 'numeric')"

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

I chose to group the data by Area, Unit, and Year in order to see how the values differed across these observations. I noticed that the mean Values differed by country, year, and animal type. One interesting observation was that greatest average Value was the number of Cattle while the smallest average Value was mules. In addition, the total Value tended to increase every year.