challenge_2
hotel_bookings.csv
Data wrangling: using group() and summarise()
Author

Siddharth Goel

Published

January 21, 2023

Code
library(tidyverse)
library(readr)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Dataset Description

I am reading the dataset hotel_bookings which contains booking information of 2 hotels Resort Hotel and City Hotel. This data contains information ranging from the booking agency of used by the guest to their arrival and departure time. In this challenge, I will attempt to provide more statistical information relevant to our inpu dataset.

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Code
bookings<-read_csv("_data/hotel_bookings.csv")

Describe the data

Code
spec(bookings)
cols(
  hotel = col_character(),
  is_canceled = col_double(),
  lead_time = col_double(),
  arrival_date_year = col_double(),
  arrival_date_month = col_character(),
  arrival_date_week_number = col_double(),
  arrival_date_day_of_month = col_double(),
  stays_in_weekend_nights = col_double(),
  stays_in_week_nights = col_double(),
  adults = col_double(),
  children = col_double(),
  babies = col_double(),
  meal = col_character(),
  country = col_character(),
  market_segment = col_character(),
  distribution_channel = col_character(),
  is_repeated_guest = col_double(),
  previous_cancellations = col_double(),
  previous_bookings_not_canceled = col_double(),
  reserved_room_type = col_character(),
  assigned_room_type = col_character(),
  booking_changes = col_double(),
  deposit_type = col_character(),
  agent = col_character(),
  company = col_character(),
  days_in_waiting_list = col_double(),
  customer_type = col_character(),
  adr = col_double(),
  required_car_parking_spaces = col_double(),
  total_of_special_requests = col_double(),
  reservation_status = col_character(),
  reservation_status_date = col_date(format = "")
)
Code
head(bookings)
# A tibble: 6 × 32
  hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <chr>     <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 Resort…       0     342    2015 July         27       1       0       0      2
2 Resort…       0     737    2015 July         27       1       0       0      2
3 Resort…       0       7    2015 July         27       1       0       1      1
4 Resort…       0      13    2015 July         27       1       0       1      1
5 Resort…       0      14    2015 July         27       1       0       2      2
6 Resort…       0      14    2015 July         27       1       0       2      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Code
unique(bookings$hotel)
[1] "Resort Hotel" "City Hotel"  
Code
unique(bookings$arrival_date_year)
[1] 2015 2016 2017
Code
unique(bookings$distribution_channel)
[1] "Direct"    "Corporate" "TA/TO"     "Undefined" "GDS"      
Code
unique(bookings$customer_type)
[1] "Transient"       "Contract"        "Transient-Party" "Group"          

The input data contains 119390 rows and 32 columns. It contains the booking data for 2 hotels over 3 years (2015 - 2017)

Provide Grouped Summary Statistics

I will be providing Grouped Summary Statistics as well as their interpretation and explanation in this section.

Number of bookings per hotel per year

Code
count(bookings, hotel, arrival_date_year)
# A tibble: 6 × 3
  hotel        arrival_date_year     n
  <chr>                    <dbl> <int>
1 City Hotel                2015 13682
2 City Hotel                2016 38140
3 City Hotel                2017 27508
4 Resort Hotel              2015  8314
5 Resort Hotel              2016 18567
6 Resort Hotel              2017 13179

Starting with a basic metric here to see which hotel gets more bookings and the trend of bookings with year.

Number of bookings per market segment

Code
count(bookings, hotel, market_segment) %>%
  arrange(hotel, desc(n))
# A tibble: 14 × 3
   hotel        market_segment     n
   <chr>        <chr>          <int>
 1 City Hotel   Online TA      38748
 2 City Hotel   Offline TA/TO  16747
 3 City Hotel   Groups         13975
 4 City Hotel   Direct          6093
 5 City Hotel   Corporate       2986
 6 City Hotel   Complementary    542
 7 City Hotel   Aviation         237
 8 City Hotel   Undefined          2
 9 Resort Hotel Online TA      17729
10 Resort Hotel Offline TA/TO   7472
11 Resort Hotel Direct          6513
12 Resort Hotel Groups          5836
13 Resort Hotel Corporate       2309
14 Resort Hotel Complementary    201

This information can give us the count of complementary bookings per hotel, which is a very important number considering that complementary bookings are loss making for the hotels. Moreover, we also find out which market segment is contributing the most towards bookings at the hotels (Online bookings seem to account for most bookings in both the hotels)

Percentage of canceled bookings per year

Code
canceled_df <- count(bookings, hotel, arrival_date_year, is_canceled)

canceled_df %>%
  group_by(hotel, arrival_date_year) %>%
  mutate(percent = n / sum(n)) %>%
  subset(select = -c(n)) %>%
  filter(is_canceled==1)
# A tibble: 6 × 4
# Groups:   hotel, arrival_date_year [6]
  hotel        arrival_date_year is_canceled percent
  <chr>                    <dbl>       <dbl>   <dbl>
1 City Hotel                2015           1   0.439
2 City Hotel                2016           1   0.404
3 City Hotel                2017           1   0.425
4 Resort Hotel              2015           1   0.257
5 Resort Hotel              2016           1   0.266
6 Resort Hotel              2017           1   0.308

This metric gives us the percentage of bookings canceled every year per hotel. From the metric, we can see that the percentage of canceled bookings has been high in City Hotel as compared to Resort Hotel

Percentage of canceled bookings per year

Code
country_df <- count(bookings, hotel, arrival_date_year, arrival_date_month)
country_df %>% group_by(hotel, arrival_date_year) %>% slice_max(order_by = arrival_date_month, n = 5)
# A tibble: 30 × 4
# Groups:   hotel, arrival_date_year [6]
   hotel      arrival_date_year arrival_date_month     n
   <chr>                  <dbl> <chr>              <int>
 1 City Hotel              2015 September           3529
 2 City Hotel              2015 October             3386
 3 City Hotel              2015 November            1235
 4 City Hotel              2015 July                1398
 5 City Hotel              2015 December            1654
 6 City Hotel              2016 September           3871
 7 City Hotel              2016 October             4219
 8 City Hotel              2016 November            3122
 9 City Hotel              2016 May                 3676
10 City Hotel              2016 March               3046
# … with 20 more rows

This metric gives us the months of the year where the bookings are maximum. This helps the hotels prepare for the busy months of the year in terms of pricing, capacity, staffing, etc.