Challenge 2 Final Solutions

challenge_2
railroads
faostat
hotel_bookings
Data wrangling: using group() and summarise()
Author

Moira AChiong

Published

June 7, 2023

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐

#install packages to read in an excel file

install.packages(“readxl”) library(“readxl”) birds <- read_excel(“C:.xlsx”)

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

#This data set was a much larger dataset than in Challenge 1. There was a mix of categorical and numerical data. There were 14 columns of variables, with 30977 rows of observations. The variable, BIrds$Value had the following summary statistics:

# Value Summary Statistics #Mean 99410.63 #Median 1800 #IQR 15233

#I did some simple proportion statistics on Birds$Item, or seeing the proportionality of chickens, ducks, geese and guinea fowls, pigeons and other birds and turkeys:

# #Chickens Ducks Geese and guinea fowls Pigeons, other birds #0.42205507 0.22303645 0.13351842 0.03760855 #Turkeys #0.18378152

#From this analysis it can be seen that Chickens at 42%, then Ducks at 22% followed by Turkeys (18%) were most prevalent in the dataset. I also tabulated the data by Area and Item, and found that

#In 2018, there were the following instances of Chickens Ducks Geese and guinea fowls Pigeons, other birds Turkeys:

# Chickens Ducks Geese and guinea fowls Pigeons, other birds Turkeys #2018 240 128 78 21 110

#The summary statistics on the item category for the Value variable is as follows. Note that Chickens had the highest mean and median, followed by Ducks and Turkeys.

#$Chickens #Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s # 0 1136 10784 207931 53794 23707134 104

#$Ducks # Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s #0 61 510 23072 4700 1195783 417

#$Geese and guinea fowls #Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s # 0 41 258 10292 1561 390848 139

#$Pigeons, other birds` # Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s #0 1034 2800 6163 7600 57909 58

#$Turkeys #Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s # 0 93 528 15228 3186 473583 318

#Here are some more summary statistics, by the categorical variable Item:

#birds$Item Mean Median IQR #1 Chickens 207930.808 10783.5 52657.0 #2 Ducks 23071.673 510.0 4639.0 #3 Geese and guinea fowls 10291.937 258.0 1520.0 #4 Pigeons, other birds 6163.375 2800.0 6566.5 #5 Turkeys 15227.919 528.0 3093.5

#It is clear from the data set that Chickens, Ducks and Turkeys are the most consumed in the world.

Provide Grouped Summary Statistics

#See above Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

#Code # count the number of observations (rows) nrow(birds)

#count the number of variables (columns) ncol(birds)

#drop code columnns, flag, flag description,unit birds_df=subset(birds, select=-c(Year Code, Domain Code, Area Code, Element Code, Item Code, Flag,Flag Description,Unit))

#see what type of data I have

str(birds$Value)

#convert value variable to numeric as.numeric(birds$Value)

#find the mean of the value column mean(birds$Value,na.rm=TRUE)

#find the median of the value column median(birds$Value,na.rm=TRUE)

#find the IQR of the value column IQR(birds$Value, na.rm=TRUE)

#group the data install.packages(“dplyr”) read(“dplyr”) library(dplyr)

#Count frequencies for Domain table1 <-table(birds$Domain)

#count frequencies for Area table2 <-table(birds\(Area) #count percentages by Area table2a <- table(birds\)Area) prop.table(table2a)

#count frequencies for Item table3 <-table(birds\(Item) #count percentages by Item table3a <- table(birds\)Item) prop.table(table3a)

#count frequencies for Area by item table4 <-table(birds\(Area, birds\)Item) ftable(table4) #sort `data

#count frequencies by year table5 <-table(birds\(Year, birds\)Item) ftable(table5) install.packages(“vtable”) library(vtable) st(birds) print(st(birds_df))

#install dplyr install.packages(‘dplyr’) library(dplyr) #group data grp_tbl <- birds %>% group_by(birds$Item) grp_tbl

#summarize on grouped data, run summary statistics birds_grouped <-tapply(birds\(Value, birds\)Item, summary, na.rm=TRUE) birds_grouped

#run more summary statistics on grouped data agg_tbl_1 <- grp_tbl %>% summarize(Mean =mean(Value, na.rm=TRUE), Median =median(Value, na.rm=TRUE), IQR = IQR(Value, na.rm=TRUE)) agg_tbl_1

#END

Provide Grouped Summary Statistics

#Answered above.

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Explain and Interpret

#Answered Above

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.