Challenge 2

challenge_2

hotel_bookings.csv

tidyverse

readr

Data Wrangling

Author

Saaradhaa M

Published

August 16, 2022

Code

library(tidyverse)
library(readr)

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Reading in the data

Code

# Read in and view the dataset.
hotel <- read.csv("_data/hotel_bookings.csv")
hotel

Description of data

Code

# Get rows and columns.
dim(hotel)

[1] 119390     32

Code

# Find which columns have missing data.
which(colSums(is.na(hotel))>0)

children 
      11

In the hotel bookings dataset, there are 119390 cases and 32 columns. Only the children column has missing data (N = 11). Interesting columns include assigned room type, previous cancellations, days in waiting list and is_canceled. There are also columns for country and hotel, indicating that the data was likely gathered by surveying different hotels around the world.

Grouped summary statistics #1

I first want to examine the relationship between number of days in the waiting list and whether the booking was cancelled.

Code

# Check if is_canceled is binary.
apply(hotel,2,function(x) { all(x %in% 0:1) })

                         hotel                    is_canceled 
                         FALSE                           TRUE 
                     lead_time              arrival_date_year 
                         FALSE                          FALSE 
            arrival_date_month       arrival_date_week_number 
                         FALSE                          FALSE 
     arrival_date_day_of_month        stays_in_weekend_nights 
                         FALSE                          FALSE 
          stays_in_week_nights                         adults 
                         FALSE                          FALSE 
                      children                         babies 
                         FALSE                          FALSE 
                          meal                        country 
                         FALSE                          FALSE 
                market_segment           distribution_channel 
                         FALSE                          FALSE 
             is_repeated_guest         previous_cancellations 
                          TRUE                          FALSE 
previous_bookings_not_canceled             reserved_room_type 
                         FALSE                          FALSE 
            assigned_room_type                booking_changes 
                         FALSE                          FALSE 
                  deposit_type                          agent 
                         FALSE                          FALSE 
                       company           days_in_waiting_list 
                         FALSE                          FALSE 
                 customer_type                            adr 
                         FALSE                          FALSE 
   required_car_parking_spaces      total_of_special_requests 
                         FALSE                          FALSE 
            reservation_status        reservation_status_date 
                         FALSE                          FALSE

Code

# Check mean and median for days in waiting list.
summarise(hotel, diwl_mean = mean(days_in_waiting_list), diwl_median = median(days_in_waiting_list))

Code

# Check mean of is_canceled, grouped by days in waiting list.
hotel %>%
  group_by(days_in_waiting_list) %>%
  select(`is_canceled`) %>%
summarise(is_canceled_mean = mean(is_canceled))

Common sense tells me that those who wait longer are more likely to cancel their reservations, but I want to check if this can actually be observed in our data. The code chunk above demonstrates that is_canceled is a binary variable (with values 0 and 1). The tables generated also show that on average, people spend about 2 days on the waiting list. However, some people were on the waiting list for over a year!

The mean cancellation rate is 1 for those who waited 391 days (understandable), and 0.36 for those who didn’t wait at all.

Grouped summary statistics #1

I also want to examine the relationship between assigned room type and previous cancellations.

Code

# Check median previous cancellations, grouped by assigned room type.
hotel %>%
  group_by(assigned_room_type) %>%
  select(previous_cancellations) %>%
summarise(previous_cancellations_median = median(previous_cancellations))

Most people assigned types A to K and type P were not likely to have previously cancelled their bookings. However, most people assigned type L were like to have previously cancelled their bookings on one occasion - if we can find the data source, it would be interesting to uncover how type L differs from the other room types (were these people given smaller rooms?).