Challenge 2 - Investigating Hotel Bookings

challenge_2
hotel_bookings
Data wrangling: using group() and summarise()
Author

Audrey Bertin

Published

June 1, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

For this challenge I looked at the hotel_bookings.csv ⭐⭐⭐⭐ dataset:

Code
hotels <- readr::read_csv("_data/hotel_bookings.csv")

Data Overview

Code
glimpse(hotels)
Rows: 119,390
Columns: 32
$ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
$ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
$ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
$ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
$ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
$ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
$ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
$ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
$ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
$ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
$ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
$ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
$ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
$ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
$ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
$ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ customer_type                  <chr> "Transient", "Transient", "Transient", …
$ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
$ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
$ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
$ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…

This dataset has 119,300 rows and 32 columns. It contains many different data points on hotel reservations, including information about the customer (how many adults/children are staying, customer type, previous bookings by them, etc), the reservation (arrival date, length of stay, room type, etc), as well as the hotel (country location, distribution channel, hotel type, etc). Each row appears to refer to a single booking at a single hotel.

The hotels appear to be located all around the world and data covers the years 2015-2017:

Code
unique(hotels$country)
  [1] "PRT"  "GBR"  "USA"  "ESP"  "IRL"  "FRA"  "NULL" "ROU"  "NOR"  "OMN" 
 [11] "ARG"  "POL"  "DEU"  "BEL"  "CHE"  "CN"   "GRC"  "ITA"  "NLD"  "DNK" 
 [21] "RUS"  "SWE"  "AUS"  "EST"  "CZE"  "BRA"  "FIN"  "MOZ"  "BWA"  "LUX" 
 [31] "SVN"  "ALB"  "IND"  "CHN"  "MEX"  "MAR"  "UKR"  "SMR"  "LVA"  "PRI" 
 [41] "SRB"  "CHL"  "AUT"  "BLR"  "LTU"  "TUR"  "ZAF"  "AGO"  "ISR"  "CYM" 
 [51] "ZMB"  "CPV"  "ZWE"  "DZA"  "KOR"  "CRI"  "HUN"  "ARE"  "TUN"  "JAM" 
 [61] "HRV"  "HKG"  "IRN"  "GEO"  "AND"  "GIB"  "URY"  "JEY"  "CAF"  "CYP" 
 [71] "COL"  "GGY"  "KWT"  "NGA"  "MDV"  "VEN"  "SVK"  "FJI"  "KAZ"  "PAK" 
 [81] "IDN"  "LBN"  "PHL"  "SEN"  "SYC"  "AZE"  "BHR"  "NZL"  "THA"  "DOM" 
 [91] "MKD"  "MYS"  "ARM"  "JPN"  "LKA"  "CUB"  "CMR"  "BIH"  "MUS"  "COM" 
[101] "SUR"  "UGA"  "BGR"  "CIV"  "JOR"  "SYR"  "SGP"  "BDI"  "SAU"  "VNM" 
[111] "PLW"  "QAT"  "EGY"  "PER"  "MLT"  "MWI"  "ECU"  "MDG"  "ISL"  "UZB" 
[121] "NPL"  "BHS"  "MAC"  "TGO"  "TWN"  "DJI"  "STP"  "KNA"  "ETH"  "IRQ" 
[131] "HND"  "RWA"  "KHM"  "MCO"  "BGD"  "IMN"  "TJK"  "NIC"  "BEN"  "VGB" 
[141] "TZA"  "GAB"  "GHA"  "TMP"  "GLP"  "KEN"  "LIE"  "GNB"  "MNE"  "UMI" 
[151] "MYT"  "FRO"  "MMR"  "PAN"  "BFA"  "LBY"  "MLI"  "NAM"  "BOL"  "PRY" 
[161] "BRB"  "ABW"  "AIA"  "SLV"  "DMA"  "PYF"  "GUY"  "LCA"  "ATA"  "GTM" 
[171] "ASM"  "MRT"  "NCL"  "KIR"  "SDN"  "ATF"  "SLE"  "LAO" 
Code
unique(hotels$arrival_date_year)
[1] 2015 2016 2017

Beyond this, it is difficult to tell much. Does this cover all hotels in these regions? Only certain hotels that decided to provide their data? Is this just a sample of bookings or is it all of the bookings. The number of data points is too low to cover all hotel bookings around the world over three years.

Additional Details

We can further investigate to get a sense of patterns in the data.

First, we can look into hotel type and customer type. In theory, different types of customers might stay at different varieties of hotels. We can verify if this is true.

Code
xtabs(~hotel + customer_type, hotels)
              customer_type
hotel          Contract Group Transient Transient-Party
  City Hotel       2300   293     59404           17333
  Resort Hotel     1776   284     30209            7791

Groups appear more likely than other customer types to book resort hotels, booking them around half the time. Groups appear to be the standout here, so we will look into them further.

Let’s start by looking at how many people are registered for each group reservation, compared with other customer types:

Mean number of people, by type

Code
hotels %>%
  group_by(customer_type) %>%
  mutate(total_people = adults + children + babies) %>%
  select(adults, children, babies, total_people) %>%
  summarise_all(funs(mean), na.rm = TRUE)

Max number of people, by type

Code
hotels %>%
  group_by(customer_type) %>%
  mutate(total_people = adults + children + babies) %>%
  select(adults, children, babies, total_people) %>%
  summarise_all(funs(max), na.rm = TRUE)

Min number of people, by type

Code
hotels %>%
  group_by(customer_type) %>%
  mutate(total_people = adults + children + babies) %>%
  select(adults, children, babies, total_people) %>%
  summarise_all(funs(min), na.rm = TRUE)

SD of number of people, by type

Code
hotels %>%
  group_by(customer_type) %>%
  mutate(total_people = adults + children + babies) %>%
  select(adults, children, babies, total_people) %>%
  summarise_all(funs(sd), na.rm = TRUE)

Groups appear to have the highest total average number of people, as well as a much higher maximum number of people. Transient and contract contract customers appear to have a higher maximum number of babies and children than group reservations, which appear to be adult-heavy.

Looking at outliers, we can see for Group reservations that there are some stays with 0 total recorded people staying there for a recorded stay (we see this across all types of customers as well), and the maximum number of people is 55 for a single reservation. Both of those numbers are very odd. It does not make sense to reserve a hotel for no one, and 55 people would require a massive space and is bigger than any normal reservation, even a standard group reservation. 55 is MUCH higher than the mean number of people per group reservation, though we note that the standard deviation of group size is way bigger than the standard deviation for any other customer type.

Looking into group reservations further, we can also look into length of stay by hotel type:

Code
hotels %>%
  filter(customer_type == "Group") %>%
  mutate(total_nights = stays_in_weekend_nights + stays_in_week_nights) %>%
  group_by(hotel) %>%
  summarise(n_stays = n(), max_stay = max(total_nights), min_stay = min(total_nights), mean_stay = mean(total_nights), sd_stay = sd(total_nights))

Remember from earlier that groups have an approximately equal number of stays at each hotel type: City and Resort. Maximum stay length is relatively similar between the two (12 vs 14 days). On average, stays are longer at resorts, and there is also a higher standard deviation there. This makes sense logically, as vacations often last longer than, say, a work trip, and they tend to be pretty varied in length depending on the customer.

Again, we note a strange pattern with min stay length of 0 nights.

We’ve noted several outliers here so far. 0 night stays, as well as stays with 0 people and 55 people. Let’s look into those outliers further to try and understand what is going on. A few columns appear specifically relevant/useful in identifying a pattern, so we will just look at those.

Code
hotels %>%
  mutate(total_nights = stays_in_weekend_nights + stays_in_week_nights) %>%
  filter(customer_type == "Group", total_nights==0) %>%
  select(hotel, lead_time, adults, children, reservation_status, country, adr)

Almost all of the 0 night stays belong to resort hotels, many with very short lead time (indicating that potentially people just showed up on the day for a “spa” day or similar). One of them is canceled, which may impact why the result was 0. Perhaps most interestingly, all of those 0 day stays which we have a record for are in the same country: Portugal. Potentially the record keeping in Portugal is off or they allow/record 0 night stays for some reason, which other countries don’t do. The most strange thing about these reservations is that they all have an “average daily rate” (adr) of 0, meaning no charge.

Next, we look at the stays with 0 and 55 people.

Code
hotels %>%
  mutate(total_people = adults + children + babies) %>%
  filter(customer_type == "Group", total_people %in% c(0, 55)) %>%
  select(hotel, country, lead_time, market_segment, booking_changes, deposit_type, reservation_status, total_people, adr)

We have 1 stay with 55 people and 3 with 0 people. There do not appear to be any obvious patterns linking these together. They are from all different countries and market segments. The 55-person hotel booking was cancelled and had no deposit, but long lead time. The two zero-person reservations that were not cancelled had many booking changes, which could potentially indicate an unusual booking in some other way. Again, the average daily rates are mostly 0 with one that is only 6.4, which is extremely low. There is something mysterious going on with these bookings.

In order to understand what happened with these outliers further, and better understand the data more generally, some additional information would be helpful. Are 0 day stays possible at some of these places–for example, spa resorts where you just go in for the day? What does it mean to have a 0 person reservation? It is possible that that is a bug or maybe 0 was used instead of NA to represent missing data, causing confusion. More information on type of hotel would be useful as well. Could “city hotel” include something like a hostel, which would explain super low rates for some cases?