challenge_2
Teresa Lardo
hotel_bookings
Grouping a dataset on hotel bookings and providing summary statistics
Author

Teresa Lardo

Published

February 22, 2023

Read in the Data

For this challenge, I’ll read in the csv file on hotel bookings.

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
library(readr)
hotel_bookings <- read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)

First 6 Rows

Code
head(hotel_bookings)
# A tibble: 6 × 32
  hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <chr>     <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 Resort…       0     342    2015 July         27       1       0       0      2
2 Resort…       0     737    2015 July         27       1       0       0      2
3 Resort…       0       7    2015 July         27       1       0       1      1
4 Resort…       0      13    2015 July         27       1       0       1      1
5 Resort…       0      14    2015 July         27       1       0       2      2
6 Resort…       0      14    2015 July         27       1       0       2      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Last 6 Rows

Code
tail(hotel_bookings)
# A tibble: 6 × 32
  hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <chr>     <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 City H…       0      21    2017 August       35      30       2       5      2
2 City H…       0      23    2017 August       35      30       2       5      2
3 City H…       0     102    2017 August       35      31       2       5      3
4 City H…       0      34    2017 August       35      31       2       5      2
5 City H…       0     109    2017 August       35      31       2       5      2
6 City H…       0     205    2017 August       35      29       2       7      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Describe the data

Dimensions

Code
#Find out the dimensions of the tibble
dim(hotel_bookings)
[1] 119390     32

Unique Hotels

Code
#Find out which hotels are listed in the hotel column
unique(hotel_bookings$hotel)
[1] "Resort Hotel" "City Hotel"  

The cases in this dataset are just under 120,000 individual bookings at two hotels (Resort Hotel and City Hotel) over the time period July 2015 through August 2017. There are 32 variables in this dataset, including:

- meal packages,

- how far in advance the booking was created (lead time),

- average daily rate,

- whether a booking was canceled or not,

- type of room reserved/assigned, and

- how many adults, children, and babies are associated with each booking.

Because each variable ought to have a column, we can look at all the variables in this dataset by using the colnames() function.

Code
colnames(hotel_bookings)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       

The “is_canceled” and “reservation_status” columns clue us in to the fact that the cases are bookings made but not necessarily actual stays at the hotels, as some customers canceled or no-showed. This indicates that the data was most likely gathered from the booking systems of the hotels themselves.

Code
# Show unique values under is_canceled column
unique(hotel_bookings$is_canceled)
[1] 0 1
Code
# Show unique values under reservation_status column
unique(hotel_bookings$reservation_status)
[1] "Check-Out" "Canceled"  "No-Show"  

Grouped Summary Statistics

Because this dataset has so many variables, there are many potential ways to group data. We could group our data by hotel, by bookings that turned into actual hotel stays (where reservation_status is ‘Check-Out’), by bookings that include children or babies (where the children or babies column is not ‘0’), by group bookings (where market_segment is ‘groups’), by bookings with a full deposit (where deposit_type is ‘Non Refund’), by bookings for which the reserved and assign room types differ (where reserved_room_type is not the same as the value in assigned_room_type), by bookings with different types of meal packages, etc.

When are the hotels in demand?

Let’s look at average lead time (number of days between booking and anticipated arrival date) to get clues into when during the year each hotel is more likely to be booked far in advanced. Let’s also filter out bookings that were cancelled.

Code
library(dplyr)
hotel_bookings %>%
  filter(reservation_status == "Check-Out") %>%
  group_by(arrival_date_month, hotel) %>%
  select(lead_time) %>%
  summarize_all(mean) %>%
  ungroup() %>%
  arrange(desc(lead_time))
# A tibble: 24 × 3
   arrival_date_month hotel        lead_time
   <chr>              <chr>            <dbl>
 1 September          Resort Hotel     146. 
 2 July               City Hotel       133. 
 3 June               Resort Hotel     126. 
 4 May                Resort Hotel     107. 
 5 June               City Hotel       103. 
 6 October            Resort Hotel     103. 
 7 August             City Hotel       101. 
 8 July               Resort Hotel      96.2
 9 August             Resort Hotel      92.8
10 May                City Hotel        90.0
# … with 14 more rows

Reservations for September at Resort Hotel have the highest average lead time, with July reservations at City Hotel following behind. The lowest average lead times for City Hotel are in January, February, and Dcember. The lowest average lead times for Resort Hotel are in November, February, and January.

Central Tendency and Dispersion

Despite the 32 different variables in this dataset, only 13 columns can be directly used to get means, medians, and modes. Other columns may appear to be numeric but answer Yes/No questions with 0 or 1, or list numbers referring to dates (such as week of the year or day of the month), which would not benefit from being averaged.

Hotel Stays

I will filter for bookings that actually turned into hotel stays (where the reservation status is “Check-Out” as opposed to “No show” or “Canceled”).

Code
#Create a vector containing the indices of the columns for summary statistics
stat_cols <- c(3, 8:12, 18, 19, 22, 26, 28:30)

#Load summarytools package to us descr()
library(summarytools)

hotel_stays <- hotel_bookings %>% 
  filter(reservation_status == "Check-Out")

descr(hotel_stays[stat_cols])
Descriptive Statistics  
hotel_stays  
N: 75166  

                         adr     adults     babies   booking_changes   children
----------------- ---------- ---------- ---------- ----------------- ----------
             Mean      99.99       1.83       0.01              0.29       0.10
          Std.Dev      49.21       0.51       0.11              0.74       0.39
              Min      -6.38       0.00       0.00              0.00       0.00
               Q1      67.50       2.00       0.00              0.00       0.00
           Median      92.50       2.00       0.00              0.00       0.00
               Q3     125.00       2.00       0.00              0.00       0.00
              Max     510.00       4.00      10.00             21.00       3.00
              MAD      40.77       0.00       0.00              0.00       0.00
              IQR      57.50       0.00       0.00              0.00       0.00
               CV       0.49       0.28      10.89              2.51       3.82
         Skewness       0.96      -0.39      23.40              5.40       4.05
      SE.Skewness       0.01       0.01       0.01              0.01       0.01
         Kurtosis       2.04       0.86    1416.23             66.61      16.20
          N.Valid   75166.00   75166.00   75166.00          75166.00   75166.00
        Pct.Valid     100.00     100.00     100.00            100.00     100.00

Table: Table continues below

 

                    days_in_waiting_list   lead_time   previous_bookings_not_canceled
----------------- ---------------------- ----------- --------------------------------
             Mean                   1.59       79.98                             0.20
          Std.Dev                  14.78       91.11                             1.81
              Min                   0.00        0.00                             0.00
               Q1                   0.00        9.00                             0.00
           Median                   0.00       45.00                             0.00
               Q3                   0.00      124.00                             0.00
              Max                 379.00      737.00                            72.00
              MAD                   0.00       62.27                             0.00
              IQR                   0.00      115.00                             0.00
               CV                   9.30        1.14                             8.92
         Skewness                  12.60        1.53                            19.60
      SE.Skewness                   0.01        0.01                             0.01
         Kurtosis                 192.72        2.29                           534.16
          N.Valid               75166.00    75166.00                         75166.00
        Pct.Valid                 100.00      100.00                           100.00

Table: Table continues below

 

                    previous_cancellations   required_car_parking_spaces   stays_in_week_nights
----------------- ------------------------ ----------------------------- ----------------------
             Mean                     0.02                          0.10                   2.46
          Std.Dev                     0.27                          0.30                   1.92
              Min                     0.00                          0.00                   0.00
               Q1                     0.00                          0.00                   1.00
           Median                     0.00                          0.00                   2.00
               Q3                     0.00                          0.00                   3.00
              Max                    13.00                          8.00                  50.00
              MAD                     0.00                          0.00                   1.48
              IQR                     0.00                          0.00                   2.00
               CV                    17.25                          3.05                   0.78
         Skewness                    29.14                          3.14                   2.75
      SE.Skewness                     0.01                          0.01                   0.01
         Kurtosis                  1043.46                         17.84                  25.61
          N.Valid                 75166.00                      75166.00               75166.00
        Pct.Valid                   100.00                        100.00                 100.00

Table: Table continues below

 

                    stays_in_weekend_nights   total_of_special_requests
----------------- ------------------------- ---------------------------
             Mean                      0.93                        0.71
          Std.Dev                      0.99                        0.83
              Min                      0.00                        0.00
               Q1                      0.00                        0.00
           Median                      1.00                        1.00
               Q3                      2.00                        1.00
              Max                     19.00                        5.00
              MAD                      1.48                        1.48
              IQR                      2.00                        1.00
               CV                      1.07                        1.17
         Skewness                      1.40                        1.08
      SE.Skewness                      0.01                        0.01
         Kurtosis                      8.36                        0.90
          N.Valid                  75166.00                    75166.00
        Pct.Valid                    100.00                      100.00

Cancelled Bookings

To contrast with the summary statistics of the bookings that turned into hotel stays, let’s filter for only cancelled bookings at the two hotels.

Code
canceled_bookings <- hotel_bookings %>%
  filter(is_canceled == 1) 

descr(canceled_bookings[stat_cols])
Descriptive Statistics  
canceled_bookings  
N: 44224  

                         adr     adults     babies   booking_changes   children
----------------- ---------- ---------- ---------- ----------------- ----------
             Mean     104.96       1.90       0.00              0.10       0.11
          Std.Dev      52.57       0.68       0.06              0.45       0.41
              Min       0.00       0.00       0.00              0.00       0.00
               Q1      72.41       2.00       0.00              0.00       0.00
           Median      96.20       2.00       0.00              0.00       0.00
               Q3     127.62       2.00       0.00              0.00       0.00
              Max    5400.00      55.00       2.00             16.00      10.00
              MAD      40.22       0.00       0.00              0.00       0.00
              IQR      55.21       0.00       0.00              0.00       0.00
               CV       0.50       0.36      16.34              4.59       3.86
         Skewness      23.89      31.03      16.64              7.75       4.19
      SE.Skewness       0.01       0.01       0.01              0.01       0.01
         Kurtosis    2327.31    1935.19     286.17            111.72      21.95
          N.Valid   44224.00   44224.00   44224.00          44224.00   44220.00
        Pct.Valid     100.00     100.00     100.00            100.00      99.99

Table: Table continues below

 

                    days_in_waiting_list   lead_time   previous_bookings_not_canceled
----------------- ---------------------- ----------- --------------------------------
             Mean                   3.56      144.85                             0.03
          Std.Dev                  21.49      118.62                             0.68
              Min                   0.00        0.00                             0.00
               Q1                   0.00       48.00                             0.00
           Median                   0.00      113.00                             0.00
               Q3                   0.00      214.00                             0.00
              Max                 391.00      629.00                            58.00
              MAD                   0.00      111.19                             0.00
              IQR                   0.00      166.00                             0.00
               CV                   6.03        0.82                            27.03
         Skewness                  10.64        1.03                            46.75
      SE.Skewness                   0.01        0.01                             0.01
         Kurtosis                 151.19        0.78                          2794.70
          N.Valid               44224.00    44224.00                         44224.00
        Pct.Valid                 100.00      100.00                           100.00

Table: Table continues below

 

                    previous_cancellations   required_car_parking_spaces   stays_in_week_nights
----------------- ------------------------ ----------------------------- ----------------------
             Mean                     0.21                          0.00                   2.56
          Std.Dev                     1.33                          0.00                   1.88
              Min                     0.00                          0.00                   0.00
               Q1                     0.00                          0.00                   1.00
           Median                     0.00                          0.00                   2.00
               Q3                     0.00                          0.00                   3.00
              Max                    26.00                          0.00                  40.00
              MAD                     0.00                          0.00                   1.48
              IQR                     0.00                          0.00                   2.00
               CV                     6.39                           NaN                   0.73
         Skewness                    16.12                           NaN                   3.08
      SE.Skewness                     0.01                          0.01                   0.01
         Kurtosis                   282.86                           NaN                  21.82
          N.Valid                 44224.00                      44224.00               44224.00
        Pct.Valid                   100.00                        100.00                 100.00

Table: Table continues below

 

                    stays_in_weekend_nights   total_of_special_requests
----------------- ------------------------- ---------------------------
             Mean                      0.93                        0.33
          Std.Dev                      1.01                        0.65
              Min                      0.00                        0.00
               Q1                      0.00                        0.00
           Median                      1.00                        0.00
               Q3                      2.00                        0.00
              Max                     16.00                        5.00
              MAD                      1.48                        0.00
              IQR                      2.00                        0.00
               CV                      1.09                        1.97
         Skewness                      1.35                        2.06
      SE.Skewness                      0.01                        0.01
         Kurtosis                      5.26                        3.93
          N.Valid                  44224.00                    44224.00
        Pct.Valid                    100.00                      100.00

I notice the mean & median values for ADR is different in the summary stats for hotel stays vs. cancelled bookings. The max ADR value in cancelled bookings is 5400 - possibly a reason that reservation was cancelled.

I also notice that the required car parking spaces variable seems to only have values of zero for the canceled bookings subset. This suggests that information about parking spaces is not collected until the check-in process when customers physically arrive to the hotels.