challenge_2
hotel_bookings
Author

Miranda Manka

Published

August 16, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Code
hotel_bookings = read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)

Describe the data

Code
#Looking at the data
view(hotel_bookings)

#Dimensions of the data
dim(hotel_bookings)
[1] 119390     32
Code
#Summary of the variables in the dataset
summary(hotel_bookings)
    hotel            is_canceled       lead_time   arrival_date_year
 Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
 Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
 Mode  :character   Median :0.0000   Median : 69   Median :2016     
                    Mean   :0.3704   Mean   :104   Mean   :2016     
                    3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
                    Max.   :1.0000   Max.   :737   Max.   :2017     
                                                                    
 arrival_date_month arrival_date_week_number arrival_date_day_of_month
 Length:119390      Min.   : 1.00            Min.   : 1.0             
 Class :character   1st Qu.:16.00            1st Qu.: 8.0             
 Mode  :character   Median :28.00            Median :16.0             
                    Mean   :27.17            Mean   :15.8             
                    3rd Qu.:38.00            3rd Qu.:23.0             
                    Max.   :53.00            Max.   :31.0             
                                                                      
 stays_in_weekend_nights stays_in_week_nights     adults      
 Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
 1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
 Median : 1.0000         Median : 2.0         Median : 2.000  
 Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
 3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
 Max.   :19.0000         Max.   :50.0         Max.   :55.000  
                                                              
    children           babies              meal             country         
 Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
 1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
 Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
 Mean   : 0.1039   Mean   : 0.007949                                        
 3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
 Max.   :10.0000   Max.   :10.000000                                        
 NA's   :4                                                                  
 market_segment     distribution_channel is_repeated_guest
 Length:119390      Length:119390        Min.   :0.00000  
 Class :character   Class :character     1st Qu.:0.00000  
 Mode  :character   Mode  :character     Median :0.00000  
                                         Mean   :0.03191  
                                         3rd Qu.:0.00000  
                                         Max.   :1.00000  
                                                          
 previous_cancellations previous_bookings_not_canceled reserved_room_type
 Min.   : 0.00000       Min.   : 0.0000                Length:119390     
 1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
 Median : 0.00000       Median : 0.0000                Mode  :character  
 Mean   : 0.08712       Mean   : 0.1371                                  
 3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
 Max.   :26.00000       Max.   :72.0000                                  
                                                                         
 assigned_room_type booking_changes   deposit_type          agent          
 Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
 Class :character   1st Qu.: 0.0000   Class :character   Class :character  
 Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
                    Mean   : 0.2211                                        
                    3rd Qu.: 0.0000                                        
                    Max.   :21.0000                                        
                                                                           
   company          days_in_waiting_list customer_type           adr         
 Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
 Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
 Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
                    Mean   :  2.321                         Mean   : 101.83  
                    3rd Qu.:  0.000                         3rd Qu.: 126.00  
                    Max.   :391.000                         Max.   :5400.00  
                                                                             
 required_car_parking_spaces total_of_special_requests reservation_status
 Min.   :0.00000             Min.   :0.0000            Length:119390     
 1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
 Median :0.00000             Median :0.0000            Mode  :character  
 Mean   :0.06252             Mean   :0.5714                              
 3rd Qu.:0.00000             3rd Qu.:1.0000                              
 Max.   :8.00000             Max.   :5.0000                              
                                                                         
 reservation_status_date
 Min.   :2014-10-17     
 1st Qu.:2016-02-01     
 Median :2016-08-07     
 Mean   :2016-07-30     
 3rd Qu.:2017-02-08     
 Max.   :2017-09-14     
                        

This dataset has 32 variables with 119,390 observations. The variables include information about hotel bookings, while each observation/case is a different hotel booking. Some variables include hotel type (city hotel vs resort hotel), if the booking was canceled, arrival date, number of nights stayed (week and weekend), number of people and kids and babies, the market segment, if the guest is a repeat guest, and room type (there are more, this just points out a few). Some of the variables have categories (city vs resort hotel, for the type of hotel), some are numeric and continuous (lead time, in days for example 14) and some are numerical but binary (is canceled, 0 or 1). This data likely came from a hotel chain with different locations and/or multiple hotels, as the country variable shows that these are hotels in different countries.

Provide Grouped Summary Statistics & Explain and Interpret

Code
#Find mean and sd for number of stays in week nights grouped by hotel type
hotel_bookings %>%
  group_by(hotel) %>%
  summarise(mean = mean(stays_in_week_nights), sd = sd(stays_in_week_nights))
# A tibble: 2 × 3
  hotel         mean    sd
  <chr>        <dbl> <dbl>
1 City Hotel    2.18  1.46
2 Resort Hotel  3.13  2.46
Code
#Find mean and sd for number of stays in weekend nights grouped by hotel type
hotel_bookings %>%
  group_by(hotel) %>%
  summarise(mean = mean(stays_in_weekend_nights), sd = sd(stays_in_weekend_nights))
# A tibble: 2 × 3
  hotel         mean    sd
  <chr>        <dbl> <dbl>
1 City Hotel   0.795 0.885
2 Resort Hotel 1.19  1.15 
Code
#Find mean and sd for number of stays in week nights grouped by whether the guest is a repeat guest
hotel_bookings %>%
  group_by(is_repeated_guest) %>%
  summarise(mean = mean(stays_in_week_nights), sd = sd(stays_in_week_nights))
# A tibble: 2 × 3
  is_repeated_guest  mean    sd
              <dbl> <dbl> <dbl>
1                 0  2.53  1.91
2                 1  1.48  1.62

I started by picking out a few interesting variables and looking at them. First, I grouped by hotel and looked at number of night stayed during the week to see if there was any difference. The resort hotels had a higher mean (3.1 compared to 2.2 for the city hotels) which was interesting, I thought maybe people staying at resorts plan an extra day more of their trip during the week. I also looked the same hotel grouping for weekend nights and the mean for resort hotels was still higher (1.2 vs 0.8 for city), so maybe people staying at resort hotels simply stay longer. This could be explored more in the future. I also looked grouped by whether someone is a repeated guest (0 for no, 1 for yes), then examined the mean for how many week nights they stayed. Repeat guests tend to stay for shorter amount of nights (1.48 vs 2.53 for non repeat guests). I thought this was interesting because people who tend to stay again aren’t staying as long (maybe more business people for a night rather than a family vacation).

Code
#Find summary statistics for lead time for booking grouped by hotel type
hotel_bookings %>%
  group_by(hotel) %>%
  select(lead_time, hotel) %>%
  summarize_all(list(mean=mean, median = median, min = min, max = max, sd = sd, var = var, IQR = IQR), na.rm = TRUE)
# A tibble: 2 × 8
  hotel         mean median   min   max    sd    var   IQR
  <chr>        <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
1 City Hotel   110.      74     0   629 111.  12310.   140
2 Resort Hotel  92.7     57     0   737  97.3  9464.   145

I wanted to look at the lead time for each type of hotel. Lead time is how many days ahead of their stay someone booked, for example 7 would mean they booked their hotel a week bfore they showed up. The mean lead time for city hotels is 109.7 days (about 3.5 months ahead of time), while the mean lead time for the resort hotels is 92.7 days (about 3 months ahead of time). That is interesting but is only different by a few weeks. The medians for both groups were much lower than the mean, which indicates the data are skewed (positively, or towards the right), meaning more of the lead times were lower values. The maximums were still high though, with 629 days for city hotels and 737 days for resort hotels, although they both had minimums of 0 (same day or walk-in). The standard deviation and other measures of dispersion were fairly large, indicating the data are spread out (looking at the maximums and minimums, this makes sense).

Code
#Creating variable to see if people got the room they booked
different_room = ifelse(hotel_bookings$reserved_room_type != hotel_bookings$assigned_room_type, 1, 0)

#Looking at the results
prop.table(table(different_room))
different_room
        0         1 
0.8750565 0.1249435 

Finally, I thought it would be interesting to look at how many people got the room they booked. I made a binary indicator variable to do this. If someone got a different room they were assigned a 1 for different_room, otherwise a 0 indicating they got the room they booked. The proportion table shows that 87.5% of people got the room they wanted, and 12.5% of people did not.