challenge_2
hotel_bookings
Data wrangling: using group() and summarise()
Author

Ananya Pujary

Published

August 16, 2022

Code
library(tidyverse)
library(dplyr)
library(skimr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

I’ll be working with the ‘hotel_bookings.csv’ dataset.

Code
hotelbookings <- read_csv("_data/hotel_bookings.csv")

Describe the data

First, we’ll generate a broad overview of the data.

Code
skim(hotelbookings)
Data summary
Name hotelbookings
Number of rows 119390
Number of columns 32
_______________________
Column type frequency:
character 13
Date 1
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
hotel 0 1 10 12 0 2 0
arrival_date_month 0 1 3 9 0 12 0
meal 0 1 2 9 0 5 0
country 0 1 2 4 0 178 0
market_segment 0 1 6 13 0 8 0
distribution_channel 0 1 3 9 0 5 0
reserved_room_type 0 1 1 1 0 10 0
assigned_room_type 0 1 1 1 0 12 0
deposit_type 0 1 10 10 0 3 0
agent 0 1 1 4 0 334 0
company 0 1 1 4 0 353 0
customer_type 0 1 5 15 0 4 0
reservation_status 0 1 7 9 0 3 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
reservation_status_date 0 1 2014-10-17 2017-09-14 2016-08-07 926

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
is_canceled 0 1 0.37 0.48 0.00 0.00 0.00 1 1 ▇▁▁▁▅
lead_time 0 1 104.01 106.86 0.00 18.00 69.00 160 737 ▇▂▁▁▁
arrival_date_year 0 1 2016.16 0.71 2015.00 2016.00 2016.00 2017 2017 ▃▁▇▁▆
arrival_date_week_number 0 1 27.17 13.61 1.00 16.00 28.00 38 53 ▅▇▇▇▅
arrival_date_day_of_month 0 1 15.80 8.78 1.00 8.00 16.00 23 31 ▇▇▇▇▆
stays_in_weekend_nights 0 1 0.93 1.00 0.00 0.00 1.00 2 19 ▇▁▁▁▁
stays_in_week_nights 0 1 2.50 1.91 0.00 1.00 2.00 3 50 ▇▁▁▁▁
adults 0 1 1.86 0.58 0.00 2.00 2.00 2 55 ▇▁▁▁▁
children 4 1 0.10 0.40 0.00 0.00 0.00 0 10 ▇▁▁▁▁
babies 0 1 0.01 0.10 0.00 0.00 0.00 0 10 ▇▁▁▁▁
is_repeated_guest 0 1 0.03 0.18 0.00 0.00 0.00 0 1 ▇▁▁▁▁
previous_cancellations 0 1 0.09 0.84 0.00 0.00 0.00 0 26 ▇▁▁▁▁
previous_bookings_not_canceled 0 1 0.14 1.50 0.00 0.00 0.00 0 72 ▇▁▁▁▁
booking_changes 0 1 0.22 0.65 0.00 0.00 0.00 0 21 ▇▁▁▁▁
days_in_waiting_list 0 1 2.32 17.59 0.00 0.00 0.00 0 391 ▇▁▁▁▁
adr 0 1 101.83 50.54 -6.38 69.29 94.58 126 5400 ▇▁▁▁▁
required_car_parking_spaces 0 1 0.06 0.25 0.00 0.00 0.00 0 8 ▇▁▁▁▁
total_of_special_requests 0 1 0.57 0.79 0.00 0.00 0.00 1 5 ▇▁▁▁▁

This dataset has 32 variables and 119390 cases. There are 14 columns of character type, 1 of date type (‘reservation_status_date’), and 18 of numeric type. Only the ‘children’ column has missing values (4).

Code
distinct(hotelbookings,arrival_date_year) 
# A tibble: 3 × 1
  arrival_date_year
              <dbl>
1              2015
2              2016
3              2017
Code
distinct(hotelbookings,is_canceled)
# A tibble: 2 × 1
  is_canceled
        <dbl>
1           0
2           1

From the variable names, this dataset seems to be logging the data of a hotel’s bookings for the years 2015-2017, such as their customers’ arrival date, reservation status, and the number of adults/children/babies checking in. The variable ‘is_canceled’ has binary values and indicates whether a booking has been canceled (0) or not (1).

Code
distinct(hotelbookings,reservation_status)
# A tibble: 3 × 1
  reservation_status
  <chr>             
1 Check-Out         
2 Canceled          
3 No-Show           
Code
distinct(hotelbookings,hotel) 
# A tibble: 2 × 1
  hotel       
  <chr>       
1 Resort Hotel
2 City Hotel  
Code
hotelbookings %>%
  group_by(hotel) %>%
  tally()
# A tibble: 2 × 2
  hotel            n
  <chr>        <int>
1 City Hotel   79330
2 Resort Hotel 40060

The ‘reservation_status’ variable has three values: ‘Check-Out’, ‘Canceled’, and ‘No-Show’. There are two types of hotels from which data are collected: ‘Resort Hotel’ and ‘City Hotel’. 79330 of the rows contain data related to the City Hotel, while 40060 rows are related to the Resort Hotel.

Provide Grouped Summary Statistics

The following command returns central tendency and dispersion values for the numeric values in the dataset.

Code
summary(hotelbookings)
    hotel            is_canceled       lead_time   arrival_date_year
 Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
 Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
 Mode  :character   Median :0.0000   Median : 69   Median :2016     
                    Mean   :0.3704   Mean   :104   Mean   :2016     
                    3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
                    Max.   :1.0000   Max.   :737   Max.   :2017     
                                                                    
 arrival_date_month arrival_date_week_number arrival_date_day_of_month
 Length:119390      Min.   : 1.00            Min.   : 1.0             
 Class :character   1st Qu.:16.00            1st Qu.: 8.0             
 Mode  :character   Median :28.00            Median :16.0             
                    Mean   :27.17            Mean   :15.8             
                    3rd Qu.:38.00            3rd Qu.:23.0             
                    Max.   :53.00            Max.   :31.0             
                                                                      
 stays_in_weekend_nights stays_in_week_nights     adults      
 Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
 1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
 Median : 1.0000         Median : 2.0         Median : 2.000  
 Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
 3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
 Max.   :19.0000         Max.   :50.0         Max.   :55.000  
                                                              
    children           babies              meal             country         
 Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
 1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
 Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
 Mean   : 0.1039   Mean   : 0.007949                                        
 3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
 Max.   :10.0000   Max.   :10.000000                                        
 NA's   :4                                                                  
 market_segment     distribution_channel is_repeated_guest
 Length:119390      Length:119390        Min.   :0.00000  
 Class :character   Class :character     1st Qu.:0.00000  
 Mode  :character   Mode  :character     Median :0.00000  
                                         Mean   :0.03191  
                                         3rd Qu.:0.00000  
                                         Max.   :1.00000  
                                                          
 previous_cancellations previous_bookings_not_canceled reserved_room_type
 Min.   : 0.00000       Min.   : 0.0000                Length:119390     
 1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
 Median : 0.00000       Median : 0.0000                Mode  :character  
 Mean   : 0.08712       Mean   : 0.1371                                  
 3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
 Max.   :26.00000       Max.   :72.0000                                  
                                                                         
 assigned_room_type booking_changes   deposit_type          agent          
 Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
 Class :character   1st Qu.: 0.0000   Class :character   Class :character  
 Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
                    Mean   : 0.2211                                        
                    3rd Qu.: 0.0000                                        
                    Max.   :21.0000                                        
                                                                           
   company          days_in_waiting_list customer_type           adr         
 Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
 Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
 Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
                    Mean   :  2.321                         Mean   : 101.83  
                    3rd Qu.:  0.000                         3rd Qu.: 126.00  
                    Max.   :391.000                         Max.   :5400.00  
                                                                             
 required_car_parking_spaces total_of_special_requests reservation_status
 Min.   :0.00000             Min.   :0.0000            Length:119390     
 1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
 Median :0.00000             Median :0.0000            Mode  :character  
 Mean   :0.06252             Mean   :0.5714                              
 3rd Qu.:0.00000             3rd Qu.:1.0000                              
 Max.   :8.00000             Max.   :5.0000                              
                                                                         
 reservation_status_date
 Min.   :2014-10-17     
 1st Qu.:2016-02-01     
 Median :2016-08-07     
 Mean   :2016-07-30     
 3rd Qu.:2017-02-08     
 Max.   :2017-09-14     
                        
Code
# calculating the number of cancellations and number of repeat guests grouped by hotel
hotel_stats<-hotelbookings %>%
  select(hotel,is_canceled,is_repeated_guest) %>%
  group_by(hotel)%>%
  summarize(cancellations = sum(is_canceled),guests_repeated = sum(is_repeated_guest))

hotelbookings$room_assignment = (ifelse(hotelbookings$reserved_room_type==hotelbookings$assigned_room_type, 'Same', 'Different'))

rooms <- hotelbookings %>%
  select(hotel,room_assignment) %>%
  group_by(hotel,room_assignment) %>%
  tally()

City Hotel saw 33102 cancellations and Resort Hotel had 11122 cancellations during this time period. City Hotel also had more repeat guests (2032) than the Resort Hotel (1778). Also, I created a new column in the dataset called ‘room_assignment’ that indicates whether guests got their preferred room assignment (‘Same’) or not (‘Different’). City Hotel guests were less likely to be assigned a room not of their preference (7192) than Resort Hotel guests (7725).

Code
hotel_reservations <- hotelbookings %>%
  select(hotel,reservation_status)%>%
  group_by(hotel, reservation_status)%>%
  tally()

Then, I compared reservation statuses (‘Canceled, ’Check-Out’,‘No Show’) across hotels. For both, the number of guests checked out were more than those that cancelled or didn’t show up.

Code
hotelbookings %>% 
  select(hotel,arrival_date_year,arrival_date_month) %>% 
  group_by(hotel) %>% 
  count(arrival_date_year,arrival_date_month) %>%
  slice(which.max(n))
# A tibble: 2 × 4
# Groups:   hotel [2]
  hotel        arrival_date_year arrival_date_month     n
  <chr>                    <dbl> <chr>              <int>
1 City Hotel                2017 May                 4556
2 Resort Hotel              2016 October             1984

I also wanted to look at which month-year was the busiest in terms of bookings for both hotels. The busiest month was May 2017 for City Hotel (4556), and October 2016 for Resort Hotel (1984).

Explain and Interpret

I chose certain variables that would objectively indicate which hotel (City Hotel or Resort Hotel) was more successful, such as the number of cancellations and reservation status. However, City Hotel had more entries in the dataset than Resort Hotel, which could skew the results. I also wanted to compare which hotel assigned most customers their preferred room type (reserved_room_type versus assigned_room_type) which could be contributing to customer satisfaction, and found that City Hotel was better in this aspect. It would also be interesting to look at both hotels’ statistics through the years (2015-2017) to check whether their businesses have improved or declined.