Challenge 2 Hotel Bookings

challenge_2
hotel_bookings
Author

Kekai Liu

Published

February 26, 2023

Code
library(tidyverse)
library(kableExtra)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐
Code
bookings <- read_csv("_data/hotel_bookings.csv") #read in the data; no special procedures are required as data is clean

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

The hotel_bookings.csv contains 119,390 rows and 32 columns. A case corresponds to a hotel booking at a hotel, and the data covers hotel bookings at the City Hotel and Resort Hotel from 2015 to 2017. The columns correspond to variables providing descriptive and numeric information on each booking, such as the lead time (which indicates how far ahead of the intended stay date the booking was made), length of stay, the reservation history of the guest, the number of guests, the type of guests, country of origin of guests, additional perks requested, the status and type of the booking, the agent and/or company used to book (if applicable), and etc.

Code
str(bookings) #obtain a summary of data's variables
spc_tbl_ [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
 $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
 $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
 $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
 $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
 $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
 $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
 $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
 $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
 $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
 $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
 $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
 $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
 $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
 $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
 $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
 $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
 $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
 $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
 $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
 $ adr                           : num [1:119390] 0 0 75 75 98 ...
 $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
 $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
 $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
 - attr(*, "spec")=
  .. cols(
  ..   hotel = col_character(),
  ..   is_canceled = col_double(),
  ..   lead_time = col_double(),
  ..   arrival_date_year = col_double(),
  ..   arrival_date_month = col_character(),
  ..   arrival_date_week_number = col_double(),
  ..   arrival_date_day_of_month = col_double(),
  ..   stays_in_weekend_nights = col_double(),
  ..   stays_in_week_nights = col_double(),
  ..   adults = col_double(),
  ..   children = col_double(),
  ..   babies = col_double(),
  ..   meal = col_character(),
  ..   country = col_character(),
  ..   market_segment = col_character(),
  ..   distribution_channel = col_character(),
  ..   is_repeated_guest = col_double(),
  ..   previous_cancellations = col_double(),
  ..   previous_bookings_not_canceled = col_double(),
  ..   reserved_room_type = col_character(),
  ..   assigned_room_type = col_character(),
  ..   booking_changes = col_double(),
  ..   deposit_type = col_character(),
  ..   agent = col_character(),
  ..   company = col_character(),
  ..   days_in_waiting_list = col_double(),
  ..   customer_type = col_character(),
  ..   adr = col_double(),
  ..   required_car_parking_spaces = col_double(),
  ..   total_of_special_requests = col_double(),
  ..   reservation_status = col_character(),
  ..   reservation_status_date = col_date(format = "")
  .. )
 - attr(*, "problems")=<externalptr> 
Code
#create for loop of table() to create frequency tables of all variables in data

#for (i in colnames(bookings)){
  #print(table(select(bookings, i)))
#}

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code
#Recode arrival_date_month to numeric variable for convenience

bookings$arrival_date_month <- recode(bookings$arrival_date_month, "January" = 1, "February" = 2, "March" = 3, "April" = 4, "May" = 5, "June" = 6, "July" = 7, "August" = 8, "September" = 9, "October" = 10, "November" = 11, "December" = 12)

#Concatenate arrival_date_month and arrival_date_year

bookings$arrival_date_month <- as.character(bookings$arrival_date_month)
bookings$arrival_date_year <- as.character(bookings$arrival_date_year)
bookings$arrival_date <- paste(bookings$arrival_date_year, bookings$arrival_date_month)

#Calculate total days booked

bookings$total_days_stays = bookings$stays_in_weekend_nights + bookings$stays_in_week_nights

#Calculate total guests

bookings$total_guests = bookings$adults + bookings$children + bookings$babies

#Calculate total number of children and babies

bookings$under_18_guests = bookings$children + bookings$babies

City Hotel has had more bookings than Resort Hotel for the entire period of the data; although the number of bookings differ, bookings at both hotels follow a similar seasonal pattern.

Code
#Find the total number of bookings at each hotel per each year-month

bookings %>%
  group_by(hotel, arrival_date) %>% #want to count the number of reservations at each hotelin a year and month
  mutate(count = n()) %>% #this does the counting
  select(hotel, arrival_date, count) %>% #this subsets the data into the columns we are interested in
  ungroup() %>%
  group_by(hotel, arrival_date) %>% #only want to keep top country per hotel per year and month
  arrange(hotel, arrival_date, desc(count)) %>% #order from most reservations to least by hotel, year, month
  unique() %>% #eliminate duplicate rows 
  filter(row_number()==1) %>%
  ggplot(aes(x=arrival_date, y=count, group=hotel, color=hotel)) +
  geom_line() +
  theme(axis.text=element_text(size=4.5)) +
  ggtitle("Total Bookings in Year/Month") +
  xlab("Year/Month") + ylab("Number of Bookings")

City Hotel has had more cancellations than Resort Hotel for the entire period of the data; although the number of cancellations differ, cancellations at both hotels follow a similar seasonal pattern.

Code
#Find the total number of cancellations at each hotel per each year-month

bookings_canceled <- bookings[bookings$is_canceled==1,] #subset to cancellations

bookings_canceled %>%
  group_by(hotel, arrival_date, is_canceled) %>% #want to count the number of cancellations
  mutate(count = n()) %>% #this does the counting
  select(hotel, arrival_date, is_canceled, count) %>% #this subsets the data into the columns we are interested in
  ungroup() %>%
  group_by(hotel, arrival_date) %>% 
  arrange(hotel, arrival_date, desc(count)) %>% #order from most reservations to least by hotel, year, month
  unique() %>% #eliminate duplicate rows 
  filter(row_number()==1) %>%
  ggplot(aes(x=arrival_date, y=count, group=hotel, color=hotel)) +
  geom_line() +
  theme(axis.text=element_text(size=4.5)) +
  ggtitle("Total Cancelled Bookings in Year/Month") +
  xlab("Year/Month") + ylab("Number of Cancellations")

Portugal top bookings for both City Hotel and Resort Hotel.

Code
#Find the top ten countries that book the most at each hotel

city <- bookings[bookings$hotel=="City Hotel",]

resort <- bookings[bookings$hotel=="Resort Hotel",]

city %>%
   group_by(country) %>% #want to count the number of reservations per country
   mutate(count = n()) %>% #this does the counting
   select(country, count) %>% #this subsets the data into the columns we are interested in
  arrange(desc(count)) %>% #order from most reservations to least
   unique() %>% #eliminate duplicate rows 
  ungroup() %>%
   slice(1:10) %>% #identify the top five countries with the most reservations in a year and month for each hotel %>%
   ggplot(aes(y=count, x=reorder(country, -count))) +
  geom_bar(stat="identity") +
  ggtitle("Top 10 Bookings by Country of Origin at City Hotel") +
  xlab("Year/Month") + ylab("Number of Bookings")

Code
resort %>%
   group_by(country) %>% #want to count the number of reservations per country
   mutate(count = n()) %>% #this does the counting
   select(country, count) %>% #this subsets the data into the columns we are interested in
  arrange(desc(count)) %>% #order from most reservations to least
   unique() %>% #eliminate duplicate rows 
  ungroup() %>%
   slice(1:10) %>% #identify the top five countries with the most reservations in a year and month for each hotel %>%
   ggplot(aes(y=count, x=reorder(country, -count))) +
  geom_bar(stat="identity") +
  ggtitle("Top 10 Bookings by Country of Origin at Resort Hotel") +
  xlab("Year/Month") + ylab("Number of Bookings")

For both City Hotel and Resort Hotel, transient customers book the most, followed by transient-party, contract, and group.

Code
#Find the most common customer_type at each hotel

city %>%
  group_by(customer_type) %>%
  mutate(count = n()) %>%
  select(customer_type, count) %>%
  arrange(desc(count)) %>%
  unique() %>% #eliminate duplicate rows
  ggplot(aes(y=count, x=reorder(customer_type, -count))) +
  geom_bar(stat="identity") +
  ggtitle("Bookings by Customer Type at City Hotel") +
  xlab("Year/Month") + ylab("Number of Bookings")

Code
resort %>%
  group_by(customer_type) %>%
  mutate(count = n()) %>%
  select(customer_type, count) %>%
  arrange(desc(count)) %>%
  unique() %>% #eliminate duplicate rows
  ungroup() %>%
  ggplot(aes(y=count, x=reorder(customer_type, -count))) +
  geom_bar(stat="identity") +
  ggtitle("Bookings by Customer Type at Resort Hotel") +
  xlab("Year/Month") + ylab("Number of Bookings")

Code
#Find the mean and standard deviation days_in_waiting_list for hotel

bookings %>% 
  group_by(hotel) %>% #group data by hotel, year, month
  select(hotel, days_in_waiting_list) %>% #keep the variables we are interested in
  summarise(mean=mean(days_in_waiting_list), sd=sd(days_in_waiting_list), na.rm=TRUE) %>% #calculate the mean by hotel
  kbl()
hotel mean sd na.rm
City Hotel 3.2267742 20.87089 TRUE
Resort Hotel 0.5277584 7.42858 TRUE
Code
boxplot(days_in_waiting_list~hotel,data=bookings, main="Days in Waiting List",xlab="Customer Type", ylab="Days")

Code
bookings %>% 
  group_by(hotel) %>% #group data by hotel, year, month
  select(hotel, days_in_waiting_list) %>% #keep the variables we are interested in
  summarise(n = n(),
            min = fivenum(days_in_waiting_list)[1],
            Q1 = fivenum(days_in_waiting_list)[2],
            median = fivenum(days_in_waiting_list)[3],
            Q3 = fivenum(days_in_waiting_list)[4],
            max = fivenum(days_in_waiting_list)[5]) %>%
  kbl()
hotel n min Q1 median Q3 max
City Hotel 79330 0 0 0 0 391
Resort Hotel 40060 0 0 0 0 185
Code
#Find the mean and standard deviation days_in_waiting_list for hotel by customer_type

bookings %>% 
  group_by(hotel, customer_type) %>%
    select(hotel, customer_type, days_in_waiting_list) %>% #keep the variables we are interested in
  summarise(mean=mean(days_in_waiting_list), sd=sd(days_in_waiting_list), na.rm=TRUE) %>% #calculate the mean by hotel and customer_type
  kbl()
hotel customer_type mean sd na.rm
City Hotel Contract 0.0000000 0.000000 TRUE
City Hotel Group 0.3003413 3.629016 TRUE
City Hotel Transient 1.9799845 15.139183 TRUE
City Hotel Transient-Party 7.9774419 34.329947 TRUE
Resort Hotel Contract 0.0906532 2.070096 TRUE
Resort Hotel Group 0.4401408 7.417385 TRUE
Resort Hotel Transient 0.0129101 1.003255 TRUE
Resort Hotel Transient-Party 2.6268772 16.474777 TRUE
Code
bookings %>% 
  group_by(hotel, customer_type) %>%
    select(hotel, customer_type, days_in_waiting_list) %>% #keep the variables we are interested in
   summarise(n = n(),
            min = fivenum(days_in_waiting_list)[1],
            Q1 = fivenum(days_in_waiting_list)[2],
            median = fivenum(days_in_waiting_list)[3],
            Q3 = fivenum(days_in_waiting_list)[4],
            max = fivenum(days_in_waiting_list)[5]) %>%
  kbl()
hotel customer_type n min Q1 median Q3 max
City Hotel Contract 2300 0 0 0 0 0
City Hotel Group 293 0 0 0 0 44
City Hotel Transient 59404 0 0 0 0 391
City Hotel Transient-Party 17333 0 0 0 0 379
Resort Hotel Contract 1776 0 0 0 0 50
Resort Hotel Group 284 0 0 0 0 125
Resort Hotel Transient 30209 0 0 0 0 122
Resort Hotel Transient-Party 7791 0 0 0 0 185

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

First, we compared the mean and standard deviation of days_in_waiting_list between City and Resort Hotel. Comparing the number of days in the waiting list between hotels is helpful, as it provides customers information on how quickly they can expect to get a room. The mean tells us that on average, customers can expect to wait three days at City Hotel versus half a day at Resort Hotel. The standard deviation of days_in_waiting_list for City Hotel is about three times higher than Resort Hotel. This implies that the wait time at City Hotel is also less consistent than Resort Hotel; it can vary widely.

The plot shows that City Hotel has a wider range and higher max days in waiting list value. This corresponds to what was observed from the standard deviation. So, customers at the City Hotel can expect to wait significantly longer than customers at the Resort Hotel.

Second, we compared the mean and standard deviation of days_in_waiting_list between City and Resort Hotel broken down by customer type. Breaking this comparison down by customer type is helpful, because it reveals whether certain customer types are more likely to wait longer. The means reveal that Transient-Party customers wait the longest for both City Hotel and Resort Hotel; City Hotel Transient-Party customers wait on average 8 days while Resort Hotel Transient-Party customers wait on average 2 and a half days. The standard deviation for City Hotel Transient-Party customers is twice that of the Resort Hotel Transient-Party customers, meaning that the average wait time for City Hotel Transient-Party customers varies by a large margin.

Finally, the five number summary consisting of the minimum, first quartile, median, third quartile, and maximum reveals that the majority of customers (irregardless of customer type) do not wait. The third quartile is the 75th percentile, meaning that 75% of customers fall under this value, which happens to be 0 across all customer types and hotels. This means 75% of customers do not wait. The maximum reveals that Transient and Transient-Party customers at City Hotel account for the largest outliers - 391 and 379 days in the waiting list respectively - compared to 122 and 185 at Resort Hotel. These outliers are pulling up the mean days in waiting list upward for City Hotel; this explains why the mean and standard deviation at City Hotel is much higher than Resort Hotel.