Challenge 2

challenge_2

Data wrangling: using group() and summarise()

Author

Daniel Manning

Published

December 28, 2022

Code

library(tidyverse)
library(here)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

railroad*.csv or StateCounty2012.xlsx ⭐
FAOstat*.csv ⭐⭐⭐
hotel_bookings ⭐⭐⭐⭐

Code

livestock <- here("posts","_data","FAOSTAT_livestock.csv")%>%
  read_csv()
livestock

# A tibble: 82,116 × 14
   Domain Cod…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
   <chr>        <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
 1 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1961  1961
 2 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1962  1962
 3 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1963  1963
 4 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1964  1964
 5 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1965  1965
 6 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1966  1966
 7 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1967  1967
 8 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1968  1968
 9 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1969  1969
10 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1970  1970
# … with 82,106 more rows, 4 more variables: Unit <chr>, Value <dbl>,
#   Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
#   ¹`Domain Code`, ²`Area Code`, ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

In the above cell, I used read.csv() to store the FAOSTAT dataset within my livestock variable. Based on the output when calling the variable livestock, the variable types are as follows:

Domain.Code: String Domain: String Area.Code:Integer Area:String Element.Code: Integer Element:String Item.Code: Integer Item: String Year.Code: Integer Year: Integer Unit: String Value: Integer Flag: String Flag.Description String

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

First I used the head() function to determine that the dataset has fourteen variables. This dataset was likely gathered by survey. Then I used dim() to show that their are 82116 rows, representing the total number of values of livestock counts/estimates spanning various countries (Areas), Types (Items), and years (Years). Next I stored the columns into individual variables for ease of use later on. I used the n_distinct() and distinct() functions to show the number of unique entries in each column and to ouput these values.

Code

head(livestock)

# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1961  1961 Head 
2 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1962  1962 Head 
3 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1963  1963 Head 
4 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1964  1964 Head 
5 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1965  1965 Head 
6 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1966  1966 Head 
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
#   ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`

Code

dim(livestock)

[1] 82116    14

Code

Domain <- livestock$Domain
Area <- livestock$Area
Element <- livestock$Element
Item <- livestock$Item
Year <- livestock$Year
Unit <- livestock$Unit
Value <- livestock$Value


livestock %>% 
  select(Domain) %>%
  n_distinct(.)

[1] 1

Code

livestock %>% 
  select(Domain) %>%
  distinct()

# A tibble: 1 × 1
  Domain      
  <chr>       
1 Live Animals

Code

livestock %>% 
  select(Area) %>%
  n_distinct(.)

[1] 253

Code

livestock %>% 
  select(Area) %>%
  distinct()

# A tibble: 253 × 1
   Area               
   <chr>              
 1 Afghanistan        
 2 Albania            
 3 Algeria            
 4 American Samoa     
 5 Angola             
 6 Antigua and Barbuda
 7 Argentina          
 8 Armenia            
 9 Aruba              
10 Australia          
# … with 243 more rows

Code

livestock %>% 
  select(Element) %>%
  n_distinct(.)

[1] 1

Code

livestock %>% 
  select(Element) %>%
  distinct()

# A tibble: 1 × 1
  Element
  <chr>  
1 Stocks

Code

livestock %>% 
  select(Item) %>%
  n_distinct(.)

[1] 9

Code

livestock %>% 
  select(Item) %>%
  distinct()

# A tibble: 9 × 1
  Item     
  <chr>    
1 Asses    
2 Camels   
3 Cattle   
4 Goats    
5 Horses   
6 Mules    
7 Sheep    
8 Buffaloes
9 Pigs

Code

livestock %>% 
  select(Year) %>%
  n_distinct(.)

[1] 58

Code

livestock %>% 
  select(Year) %>%
  distinct()

# A tibble: 58 × 1
    Year
   <dbl>
 1  1961
 2  1962
 3  1963
 4  1964
 5  1965
 6  1966
 7  1967
 8  1968
 9  1969
10  1970
# … with 48 more rows

Code

livestock %>% 
  select(Unit) %>%
  n_distinct(.)

[1] 1

Code

livestock %>% 
  select(Unit) %>%
  distinct()

# A tibble: 1 × 1
  Unit 
  <chr>
1 Head

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

I first used the select() function to remove any columns/variables that were redundant or contained only 1 distinct value. I then used the head() to show the first 10 rows our new dataset to confirm that the desired variables were removed. Next, I stored the value (number of a given livestock) in a new variable. I calculated summary statistics of Value, and included the ‘na.rm = TRUE’ parameter so NA values were ignored. Following this, I created a new variable (Area) using the group_by() function. Using the summarise function, I then computed the mean, median, standard deviation, and IQR for the number of livestock, grouped by country. I calculated the summary statistics of the the data grouped by type of animal (Item) and year using the same method.

Code

livestock_new <- livestock %>% select(-c('Domain', 'Domain.Code', 'Area.Code', 'Element', 'Element.Code', 'Item.Code', 'Year.Code', 'Unit', 'Flag'))

Error in `select()`:
! Can't subset columns that don't exist.
✖ Column `Domain.Code` doesn't exist.

Code

head(livestock_new)

Error in head(livestock_new): object 'livestock_new' not found

Code

Value <- livestock_new$Value

Error in eval(expr, envir, enclos): object 'livestock_new' not found

Code

mean(Value, na.rm = TRUE)

[1] 11625569

Code

median(Value, na.rm = TRUE)

[1] 224667

Code

min(Value, na.rm = TRUE)

[1] 0

Code

max(Value, na.rm = TRUE)

[1] 1489744504

Code

sd(Value, na.rm = TRUE)

[1] 64779790

Code

var(Value, na.rm = TRUE)

[1] 4.196421e+15

Code

IQR(Value, na.rm = TRUE)

[1] 2364200

Code

Area <- livestock_new %>% group_by(Area)

Error in group_by(., Area): object 'livestock_new' not found

Code

Area %>% summarise(Value_mean = mean(Value, na.rm = TRUE), Value_median = median(Value, na.rm = TRUE), Value_sd = sd(Value, na.rm = TRUE), Value_IQR = IQR(Value, na.rm = TRUE))

Error in UseMethod("summarise"): no applicable method for 'summarise' applied to an object of class "character"

Code

Item <- livestock_new %>% group_by(Item)

Error in group_by(., Item): object 'livestock_new' not found

Code

Item %>% summarise(Value_mean = mean(Value, na.rm = TRUE), Value_median = median(Value, na.rm = TRUE), Value_sd = sd(Value, na.rm = TRUE), Value_IQR = IQR(Value, na.rm = TRUE))

Error in UseMethod("summarise"): no applicable method for 'summarise' applied to an object of class "character"

Code

Year <- livestock_new %>% group_by(Year)

Error in group_by(., Year): object 'livestock_new' not found

Code

Year %>% summarise(Value_mean = mean(Value, na.rm = TRUE), Value_median = median(Value, na.rm = TRUE), Value_sd = sd(Value, na.rm = TRUE), Value_IQR = IQR(Value, na.rm = TRUE))

Error in UseMethod("summarise"): no applicable method for 'summarise' applied to an object of class "c('double', 'numeric')"

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

I chose to group the data by Area, Unit, and Year in order to see how the values differed across these observations. I noticed that the mean Values differed by country, year, and animal type. One interesting observation was that greatest average Value was the number of Cattle while the smallest average Value was mules. In addition, the total Value tended to increase every year.