challenge_2
hotel_bookings
Author

Steve O’Neill

Published

August 16, 2022

Code
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Introduction

I’m going to try to do hotel_bookings: ⭐⭐⭐⭐

Code
hotel_bookings <- read_csv("_data/hotel_bookings.csv")
hotel_bookings

Reading in the data is straightforward.

Describe the data

This dataset is available on Kaggle and was originally published in the journal Hospitality Management.

The data describes two hotels - one ‘city’ and one ‘resort’-style.

Before I begin, I should establish some terminology based on research online:

Term Meaning
TA Travel Agent
TO Tour Operator (distribution channel)
HB Half Board (breakfast + other meal)
FB Full Board (3 meals a day)

Provide Grouped Summary Statistics

There are just two de-identified hotels analyzed in this whole dataset:

Code
hotel_bookings %>% group_by(hotel) %>% summarise()

It is straightforward to give basic descriptive statistics on the Average Daily Rate of both hotels:

Code
hotel_bookings %>% select(adr) %>% summary(adr)
      adr         
 Min.   :  -6.38  
 1st Qu.:  69.29  
 Median :  94.58  
 Mean   : 101.83  
 3rd Qu.: 126.00  
 Max.   :5400.00  

I can also demonstrate some more interesting stats based on included values like visitor nationality and lead time:

Code
by_hotel <- hotel_bookings %>% group_by(hotel)

by_hotel <- by_hotel %>% summarise(
  average.lead.time = mean(lead_time),
  busiest.year = names(which.max(table(arrival_date_year))),
  busiest.month = names(which.max(table(arrival_date_month))),
  most.freq.nationality = names(which.max(table(country))),
  most.infreq.nationality = names(which.min(table(country)))
)

by_hotel

As we can see,

  • The city hotel has a longer lead time on average compared to the resort - could this be based on conferences and business travel?
  • Both hotels’ busiest year in the dataset was 2016, as well as their busiest month being August [annually].
  • The most frequent nationality of guests was that of PRT - Portugal. That makes sense because the hotels are in Portugal.
  • The least frequent visitors in the City and Resort hotels hailed from Anguilla and Burundi, respectively.

To demonstrate select(), I pull out a few values:

Code
visits <- select(hotel_bookings, hotel, market_segment, children, babies, country, reservation_status, reservation_status_date, arrival_date_month, adr)
visits

Using filter(), here are visits that had a higher-than-median amount of babies brought along as compared to other visitors in the same market segment (Direct, Online, Corporate):

Code
visits_with_baby <- visits %>% group_by(market_segment) %>% filter(babies > median(babies, na.rm = TRUE))
visits_with_baby

So let’s compare the general make-up of travelers:

Code
table(visits$market_segment)

     Aviation Complementary     Corporate        Direct        Groups 
          237           743          5295         12606         19811 
Offline TA/TO     Online TA     Undefined 
        24219         56477             2 

…with those traveling with a baby:

Code
table(visits_with_baby$market_segment)

Complementary     Corporate        Direct        Groups Offline TA/TO 
           22             8           281            13           177 
    Online TA 
          416 

Somewhat expectedly, there aren’t too many business travelers bringing their kids!