Challenge 2 Solutions

challenge_2
solution
Data wrangling: using group() and summarise()
Author

Meredith Rolfe

Published

March 1, 2023

library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE,
                      message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

The railroad data contain 2931 county-level aggregated counts of the number of railroad employees in 2012. Counties are embedded within States, and all 50 states plus Canada, overseas addresses in Asia and Europe, and Washington, DC are represented.

Data Summaries

This is a concise summary of more extensive work we did in Challenge 1, and is an example of how you should describe data in public-facing work. A “public-facing” version of your work contains all critical details needed to replicate your work, but doesn’t contain a point-by-point rundown of the mental process you went through to get to that point. (Your internal analysis file should probably walk through that mental process, however!)

Read the data

Here, we are just reusing the code from Challenge 1. We are using the excel version, to ensure that we get Canada, and are renaming the missing data in county for Canada so that we don’t accidently filter that observation out.

railroad<-read_excel("_data/StateCounty2012.xls",
                     skip = 4,
                     col_names= c("state", "delete",  "county",
                                  "delete", "employees"))%>%
  select(!contains("delete"))%>%
  filter(!str_detect(state, "Total"))

railroad<-head(railroad, -2)%>%
  mutate(county = ifelse(state=="CANADA", "CANADA", county))

railroad

How many values does X take on?

Now, lets practice grouping our data and using other dplyr commands that make data wrangling super easy. First, lets take a closer look at how we counted the number of unique states last week. First, we selected the state column. Then we used the n_distinct command - which replicates the base R commands length(unique(var)).

across()

Instead of counting the number of distinct values one at a time, I am doing an operation on two columns at the same time using across.

railroad%>%
  summarise(across(c(state,county), n_distinct))

Check this out - many counties have the same name! There are 2931 state-county cases, but only 1710 distinct county names. This is one reason it is so critical to understand “what is a case” when you are working with your data - otherwise you might accidentally collapse or group information that isn’t intended to be grouped.

How many total X are in group Y?

Suppose we want to know the total number of railroad employees was in 2012, what is the best way to sum up all of the values in the data? The summarize function is useful for doing calculations across some or all of a data set.

railroad%>%
  summarise(total_employees = sum(employees))

Around a quarter of a million people were employed in the railroad industry in 2012. While this may seem like a lot, it was a significant decrease in employment from a few decades earlier, according to official Bureau of Labor Statistics (BLS) estimates.

You may notice that the BLS estimates are significantly lower than the ones we are using, provided by the Railroad Retirement Board. Given that the Railroad Retirement Board has “gold-standard” data on railroad employees, this discrepancy suggests that many people who work in the railroad industry are being classified in a different way by BLS statistics.

Which X have the most Y?

Suppose we are interested in which county names are duplicated most often, or in which states have the most railroad employees. We can use the same basic approach to answer both “Which X have the most Y?” style questions

df-print: paged (YAML)

When you are using df-print: paged in your yaml header, or are using tibbles, there is no need to rely on the head(data) command to limit your results to the top 10 of a list.

railroad%>%
  group_by(state)%>%
  summarise(total_employees = sum(employees),
            num_counties = n())%>%
  arrange(desc(total_employees))

Looking at the top 10 states in terms of total railroad employment, a few trends emerge. Several of the top 10 states with geographical activity are highly populous and geographically large. California, Texas, New York, Pennsylvania, Ohio, Illinois, and Georgia are all amonst the top-10 largest states - so it would make sense if there are more railroad employees in large states.

But railroads are spread out along geography, and thus we might also expect square mileage within a state to be related to state railroad employment - not just state population. For example, Texas is around 65% larger (in area) than California, and has around 50% more railroad employees.

There appear to be multiple exceptions to both rules, however. If geography plus population were the primary factors explaining railroad employment, then California would be ranked higher than New York and Illinois, and New York would likely rank higher than Illinois. However, Illinois - Chicago in particular - is a hub of railroad activity, and thus Illinois’ higher ranking is likely reflecting hub activity and employment. New York is a hub for the East Coast in particular. While California may have hubs of train activity in Los Angeles or San Francisco, the Northeast has a higher density of train stations and almost certainly generates more passenger and freight miles than the larger and more populous California.

This final factor - the existence of heavily used train routes probably explains the high railroad employment in states like Nebraska, Indiana and Missouri - all of which lay along a major railway route between New York and Chicago, and then out to California. Anyway who has played Ticket to Ride probably recognizes many of these routes!

Go further

A fun exercise once you are comfortable with joins and map-based visualizatinos would be to join the railroad employment data to information about state population and geographic area, and also mapping information about railway routes, to get more insight into the factors driving railroad employment.

The FAOSTAT sheets are excerpts of the FAOSTAT database provided by the Food and Agriculture Association, an agency of the United Nations. We are using the file birds.csv that includes estimates of the stock of five different types of poultry (Chickens, Ducks, Geese and guinea fowls, Turkeys, and Pigeons/Others) for 248 areas for 58 years between 1961-2018. Estimated stocks are given in 1000 head.

Because we know (from challenge 1) that several of those areas include aggregated data (e.g., ) we are going to remove the aggregations, remove the unnecessary variables, and only work with the grouping variables available in the data. In a future challenge, we will join back on more data from the FAO to recreate regional groupings.

birds<-read_csv("_data/birds.csv")%>%
  select(-c(contains("Code"), Element, Domain, Unit))%>%
  filter(Flag!="A")
birds

What is the average of Y for X groups?

Lets suppose we are starting off and know nothing about poultry stocks around the world, where could we start? Perhaps we could try to get a sense of the relative sizes of stocks of each of the five types of poultry, identified in the variable Item. Additionally, because some of the values may be missing, lets find out how many of the estimates are missing.

birds%>%
  group_by(Item)%>%
  summarise(avg_stocks = mean(Value, na.rm=TRUE),
            med_stocks = median(Value, na.rm=TRUE),
            n_missing = sum(is.na(Value)))

On average, we can see that countries have far more chickens as livestock (\(\bar{x}\)=58.4million head) than other livestock birds (average stocks range between 2 and 10 million head). However, the information from the median stock counts suggest that there is significant variation across countries along with a strong right hand skew with regards to chicken stocks. The median number of chickens in a country is 3.8 million head - significantly less than the mean of almost 60 million. Overall, missing data doesn’t seem to be a huge issue, so we will just use na.rm=TRUE and not worry too much about the missingness for now.

It could be that stock head counts have changed over time, so lets try selecting two points in time and seeing whether or not average livestock counts are changing.

pivot-wider

It can be difficult to visually report data in tidy format. For example, it is tough to compare two values when they are on different rows. In this example, I use pivot-wider to swap a tidy grouping variable into multiple columns to be more “table-like.” I then do some manual formatting to make it easy to compare the grouped estimates.

t1<-birds%>%
  filter(Year %in% c(1966, 2016))%>%
  group_by(Year, Item)%>%
  summarise(avg_stocks = mean(Value, na.rm=TRUE),
            med_stocks = median(Value, na.rm=TRUE))%>%
  pivot_wider(names_from = Year, values_from = c(avg_stocks, med_stocks))

knitr::kable(t1,
             digits=0,format.args = list(big.mark = ","),
             col.names = c("Type", "1966", "2016",
                           "1996", "2016"))%>%
  kableExtra::kable_styling(htmltable_class = "lightable-minimal")%>%
  kableExtra::add_header_above(c(" " = 1, "Mean Stock (1000)" = 2,
                                 "Median Stock (1000)" = 2))
Error in loadNamespace(x): there is no package called 'kableExtra'
kable and kableExtra

I manually adjust table formatting (column names, setting significant digits, adding a comma) using kable and add a header row using kableExtra. Because df-print=paged option is set to make it easier to scroll through longer data frames, I need to directly specify that I want to produce an htmltable - not a default kable/rmarkdown table - when I use kable and kableExtra formatting directly.

Sure enough, it does look like stocks have changed significantly over time. The expansion of country-level chicken stocks over five decades between 1966 and 2016 are most noteworthy, with both average and median stock count going up by a factor of 4. Pigeons have never been very popular, and average stocks have actually decreased over the same time period while the other less popular bird - turkeys - saw significant incrases in stock count. Some countries increased specialization in goose and/or guinea fowl production, as the average stock count went up but the median went down over the same period.

Go further

It could be really interesting to graph the rise and fall of poultry stocks overtime with these data, and match these changes to changes in population size and country GDP. Another option would be to match the countries back to regional groupings available from the UN FAO, a future “data join” challenge.

This data set contains 119,390 hotel bookings from two hotels (“City Hotel” and “Resort Hotel”) with an arrival date between July 2015 and August 2017 (more detail needed), including bookings that were later cancelled. Each row contains extensive information about a single booking:

  • the booking process (e.g., lead time, booking agent, deposit, changes made)
  • booking details (e.g., scheduled arrival date, length of stay)
  • guest requests (e.g., type of room, meal(s) included, car parking)
  • booking channel (e.g., distribution, market segment, corporate affiliation for )
  • guest information (e.g., child/adult, passport country)
  • guest prior bookings (e.g., repeat customer, prior cancellations)

The data are a de-identified extract of real hotel demand data, made available by the authors.

Read and make sense of the data

The hotel bookings data set is new to challenge 2, so we need to go through the same process we did during challenge 1 to find out more about the data. Lets read in the data and use the summmaryTools package to get an overview of the data set.

bookings<-read_csv("_data/hotel_bookings.csv")

print(summarytools::dfSummary(bookings,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

bookings

Dimensions: 119390 x 32
Duplicates: 31994
Variable Stats / Values Freqs (% of Valid) Graph Missing
hotel [character]
1. City Hotel
2. Resort Hotel
79330 ( 66.4% )
40060 ( 33.6% )
0 (0.0%)
is_canceled [numeric]
Min : 0
Mean : 0.4
Max : 1
0 : 75166 ( 63.0% )
1 : 44224 ( 37.0% )
0 (0.0%)
lead_time [numeric]
Mean (sd) : 104 (106.9)
min ≤ med ≤ max:
0 ≤ 69 ≤ 737
IQR (CV) : 142 (1)
479 distinct values 0 (0.0%)
arrival_date_year [numeric]
Mean (sd) : 2016.2 (0.7)
min ≤ med ≤ max:
2015 ≤ 2016 ≤ 2017
IQR (CV) : 1 (0)
2015 : 21996 ( 18.4% )
2016 : 56707 ( 47.5% )
2017 : 40687 ( 34.1% )
0 (0.0%)
arrival_date_month [character]
1. August
2. July
3. May
4. October
5. April
6. June
7. September
8. March
9. February
10. November
[ 2 others ]
13877 ( 11.6% )
12661 ( 10.6% )
11791 ( 9.9% )
11160 ( 9.3% )
11089 ( 9.3% )
10939 ( 9.2% )
10508 ( 8.8% )
9794 ( 8.2% )
8068 ( 6.8% )
6794 ( 5.7% )
12709 ( 10.6% )
0 (0.0%)
arrival_date_week_number [numeric]
Mean (sd) : 27.2 (13.6)
min ≤ med ≤ max:
1 ≤ 28 ≤ 53
IQR (CV) : 22 (0.5)
53 distinct values 0 (0.0%)
arrival_date_day_of_month [numeric]
Mean (sd) : 15.8 (8.8)
min ≤ med ≤ max:
1 ≤ 16 ≤ 31
IQR (CV) : 15 (0.6)
31 distinct values 0 (0.0%)
stays_in_weekend_nights [numeric]
Mean (sd) : 0.9 (1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 2 (1.1)
17 distinct values 0 (0.0%)
stays_in_week_nights [numeric]
Mean (sd) : 2.5 (1.9)
min ≤ med ≤ max:
0 ≤ 2 ≤ 50
IQR (CV) : 2 (0.8)
35 distinct values 0 (0.0%)
adults [numeric]
Mean (sd) : 1.9 (0.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.3)
14 distinct values 0 (0.0%)
children [numeric]
Mean (sd) : 0.1 (0.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (3.8)
0 : 110796 ( 92.8% )
1 : 4861 ( 4.1% )
2 : 3652 ( 3.1% )
3 : 76 ( 0.1% )
10 : 1 ( 0.0% )
4 (0.0%)
babies [numeric]
Mean (sd) : 0 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (12.3)
0 : 118473 ( 99.2% )
1 : 900 ( 0.8% )
2 : 15 ( 0.0% )
9 : 1 ( 0.0% )
10 : 1 ( 0.0% )
0 (0.0%)
meal [character]
1. BB
2. FB
3. HB
4. SC
5. Undefined
92310 ( 77.3% )
798 ( 0.7% )
14463 ( 12.1% )
10650 ( 8.9% )
1169 ( 1.0% )
0 (0.0%)
country [character]
1. PRT
2. GBR
3. FRA
4. ESP
5. DEU
6. ITA
7. IRL
8. BEL
9. BRA
10. NLD
[ 168 others ]
48590 ( 40.7% )
12129 ( 10.2% )
10415 ( 8.7% )
8568 ( 7.2% )
7287 ( 6.1% )
3766 ( 3.2% )
3375 ( 2.8% )
2342 ( 2.0% )
2224 ( 1.9% )
2104 ( 1.8% )
18590 ( 15.6% )
0 (0.0%)
market_segment [character]
1. Aviation
2. Complementary
3. Corporate
4. Direct
5. Groups
6. Offline TA/TO
7. Online TA
8. Undefined
237 ( 0.2% )
743 ( 0.6% )
5295 ( 4.4% )
12606 ( 10.6% )
19811 ( 16.6% )
24219 ( 20.3% )
56477 ( 47.3% )
2 ( 0.0% )
0 (0.0%)
distribution_channel [character]
1. Corporate
2. Direct
3. GDS
4. TA/TO
5. Undefined
6677 ( 5.6% )
14645 ( 12.3% )
193 ( 0.2% )
97870 ( 82.0% )
5 ( 0.0% )
0 (0.0%)
is_repeated_guest [numeric]
Min : 0
Mean : 0
Max : 1
0 : 115580 ( 96.8% )
1 : 3810 ( 3.2% )
0 (0.0%)
previous_cancellations [numeric]
Mean (sd) : 0.1 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 26
IQR (CV) : 0 (9.7)
15 distinct values 0 (0.0%)
previous_bookings_not_canceled [numeric]
Mean (sd) : 0.1 (1.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 72
IQR (CV) : 0 (10.9)
73 distinct values 0 (0.0%)
reserved_room_type [character]
1. A
2. B
3. C
4. D
5. E
6. F
7. G
8. H
9. L
10. P
85994 ( 72.0% )
1118 ( 0.9% )
932 ( 0.8% )
19201 ( 16.1% )
6535 ( 5.5% )
2897 ( 2.4% )
2094 ( 1.8% )
601 ( 0.5% )
6 ( 0.0% )
12 ( 0.0% )
0 (0.0%)
assigned_room_type [character]
1. A
2. D
3. E
4. F
5. G
6. C
7. B
8. H
9. I
10. K
[ 2 others ]
74053 ( 62.0% )
25322 ( 21.2% )
7806 ( 6.5% )
3751 ( 3.1% )
2553 ( 2.1% )
2375 ( 2.0% )
2163 ( 1.8% )
712 ( 0.6% )
363 ( 0.3% )
279 ( 0.2% )
13 ( 0.0% )
0 (0.0%)
booking_changes [numeric]
Mean (sd) : 0.2 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 21
IQR (CV) : 0 (2.9)
21 distinct values 0 (0.0%)
deposit_type [character]
1. No Deposit
2. Non Refund
3. Refundable
104641 ( 87.6% )
14587 ( 12.2% )
162 ( 0.1% )
0 (0.0%)
agent [character]
1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]
31961 ( 26.8% )
16340 ( 13.7% )
13922 ( 11.7% )
7191 ( 6.0% )
3640 ( 3.0% )
3539 ( 3.0% )
3290 ( 2.8% )
2870 ( 2.4% )
1721 ( 1.4% )
1666 ( 1.4% )
33250 ( 27.8% )
0 (0.0%)
company [character]
1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]
112593 ( 94.3% )
927 ( 0.8% )
784 ( 0.7% )
267 ( 0.2% )
250 ( 0.2% )
215 ( 0.2% )
149 ( 0.1% )
141 ( 0.1% )
138 ( 0.1% )
133 ( 0.1% )
3793 ( 3.2% )
0 (0.0%)
days_in_waiting_list [numeric]
Mean (sd) : 2.3 (17.6)
min ≤ med ≤ max:
0 ≤ 0 ≤ 391
IQR (CV) : 0 (7.6)
128 distinct values 0 (0.0%)
customer_type [character]
1. Contract
2. Group
3. Transient
4. Transient-Party
4076 ( 3.4% )
577 ( 0.5% )
89613 ( 75.1% )
25124 ( 21.0% )
0 (0.0%)
adr [numeric]
Mean (sd) : 101.8 (50.5)
min ≤ med ≤ max:
-6.4 ≤ 94.6 ≤ 5400
IQR (CV) : 56.7 (0.5)
8879 distinct values 0 (0.0%)
required_car_parking_spaces [numeric]
Mean (sd) : 0.1 (0.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (3.9)
0 : 111974 ( 93.8% )
1 : 7383 ( 6.2% )
2 : 28 ( 0.0% )
3 : 3 ( 0.0% )
8 : 2 ( 0.0% )
0 (0.0%)
total_of_special_requests [numeric]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 1 (1.4)
0 : 70318 ( 58.9% )
1 : 33226 ( 27.8% )
2 : 12969 ( 10.9% )
3 : 2497 ( 2.1% )
4 : 340 ( 0.3% )
5 : 40 ( 0.0% )
0 (0.0%)
reservation_status [character]
1. Canceled
2. Check-Out
3. No-Show
43017 ( 36.0% )
75166 ( 63.0% )
1207 ( 1.0% )
0 (0.0%)
reservation_status_date [Date]
min : 2014-10-17
med : 2016-08-07
max : 2017-09-14
range : 2y 10m 28d
926 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-03-03

Wow - there is a lot of information available here. Lets scan it and see what jumps out. First we can see that the summary function claims that there are almost 32,000 duplicates in the data. However, this is likely an artifact of the way that the bookings have been de-identified, and may reflect bookings with identical details but different individuals who made the bookings.

We can see that we are provided with limited information about the hotel. Hotels are identified only as “City” Hotel” or a “Resort Hotel”. Maybe we have bookings from only two hotels? Lets tentatively add that to our data description.

There is a flag for whether a booking is cancelled. This means that our universe of cases includes bookings where the guests showed up, as well as bookings that were later cancelled - we can add that to our data description.

There are multiple fields with the arrival date - year, month, etc. For now, we can tell that the arrival date of the bookings ranges between 2015 and 2017. More precise identification of the date range could be more easily done next challenge when we can recode the arrival date information using lubridate.But maybe it is possible to find out which values of month co-occur with specific years?

Which values of Y are nested within X?

To approach this question, we can narrow the dataset down to just the two variables of interest, and then use the distinct command.

bookings%>%
  select(arrival_date_year, arrival_date_month)%>%
  distinct()

Great - now we now that all bookings have arrival dates between June 2015 and August 2017, and can add that to the data description. Just for fun, lets see if we can confirm that the dates are the same for both hotels.

slice()

This would be easier to investigate with proper date variables, but I am using slice to find the first and last row for each hotel, by position. This avoids printing out a long data list we have to scroll through, but would fail if the hotels had different sets of arrival month-year pairs.

d<-bookings%>%
  select(arrival_date_year, arrival_date_month)%>%
  n_distinct()

bookings%>%
  select(hotel, arrival_date_year, arrival_date_month)%>%
  distinct()%>%
  slice(c(1, d, d+1, d*2))

Lets suppose we want to know whether or not the two hotels offer the same types of rooms? This is another query of the sort Which values of X are nested in y?

bookings%>%
  group_by(hotel)%>%
  count(reserved_room_type)

In this case, however, it is tough to directly compare - it appears that the hotel-roomtype pairs are not as consistent as the year-month pairs for the same hotels. A quick pivot-wider makes this comparison a little easier to visualize. Here we can see that the Resort Hotel has two additional room types: “H” and “L”.

bookings%>%
  group_by(hotel)%>%
  count(reserved_room_type)%>%
  pivot_wider(names_from= hotel, values_from = n)

What is the average of Y for group X?

The breakdown of rooms by hotel doesn’t shed much light on the room codes and what they might mean. Lets see if we can find average number of occupants and average price for each room type, and see if we can learn more about our data.

mean(., na.rm=TRUE)

I am using the mean function with the option na.rm=TRUE to deal with the four NA values in the children field, identified in the summary table above.

t1<-bookings%>%
  group_by(hotel, reserved_room_type)%>%
  summarise(price = mean(adr),
            adults = mean(adults),
            children = mean(children+babies, na.rm=TRUE)
            )%>%
  pivot_wider(names_from= hotel, 
              values_from = c(price, adults, children))

knitr::kable(t1,
             digits=1,
             col.names = c("Type", "City", "Resort",
                           "City", "Resort", "City", "Resort"))%>%
  kableExtra::kable_styling(htmltable_class = "lightable-minimal")%>%
  kableExtra::add_header_above(c("Room" = 1, "Price" = 2,
                                 "Adults" = 2, "Children & Babies" = 2))
Error in loadNamespace(x): there is no package called 'kableExtra'
kable & kableExtra

I manually adjust table formatting (column names, plus adding a header row) using kable and kableExtra package, respectively. Also, because df-print=paged is the option set in the YAML header, I need to directly specify that I want to produce an htmltable - not a default kable/rmarkdown table.

Based on these descriptives broken down by hotel and room type, we can speculate that the “H” and “L” room types at the resort are likely some sort of multi-bedroom suite (because the average number of adults is over 2.) Similarly, we can speculate that the difference between ABC and DEF may be something related to room size or quality (e.g., number and size of beds) and/or related to meals included with the rooms - but this would require further investigation to pin down!

Go further

There is lots more to explore in the hotel bookings dataset, but it will be a lot easier once we recode the date fields using lubridate.