hw3
hotel_bookings
Author

Siddharth Goel

Published

January 31, 2023

Importing the libraries

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(ggplot2)
library(dplyr)
library(psych)

Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha
Code
library(readr)
library(summarytools)
Warning: no DISPLAY variable so Tk is not available
system might not have X11 capabilities; in case of errors when using dfSummary(), set st_options(use.x11 = FALSE)

Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view
Code
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Reading the dataset

Code
bookings <- readr::read_csv("_data/hotel_bookings.csv")

Taking an initial look into the data

Code
# looking at the dimensions
dim(bookings)
[1] 119390     32
Code
# looking at the schema
spec(bookings)
cols(
  hotel = col_character(),
  is_canceled = col_double(),
  lead_time = col_double(),
  arrival_date_year = col_double(),
  arrival_date_month = col_character(),
  arrival_date_week_number = col_double(),
  arrival_date_day_of_month = col_double(),
  stays_in_weekend_nights = col_double(),
  stays_in_week_nights = col_double(),
  adults = col_double(),
  children = col_double(),
  babies = col_double(),
  meal = col_character(),
  country = col_character(),
  market_segment = col_character(),
  distribution_channel = col_character(),
  is_repeated_guest = col_double(),
  previous_cancellations = col_double(),
  previous_bookings_not_canceled = col_double(),
  reserved_room_type = col_character(),
  assigned_room_type = col_character(),
  booking_changes = col_double(),
  deposit_type = col_character(),
  agent = col_character(),
  company = col_character(),
  days_in_waiting_list = col_double(),
  customer_type = col_character(),
  adr = col_double(),
  required_car_parking_spaces = col_double(),
  total_of_special_requests = col_double(),
  reservation_status = col_character(),
  reservation_status_date = col_date(format = "")
)
Code
# looking at the data values
head(bookings)

As we can see from the spec and head command outputs, this is the hotel booking data for 2 hotels over 3 years. We can also see that the dataset has 119390 total rows and 32 columns.

A bit detailed description about the data

finding the unique values for each column

Code
nrow(bookings)
[1] 119390
Code
unique(bookings$hotel)
[1] "Resort Hotel" "City Hotel"  
Code
unique(bookings$arrival_date_year)
[1] 2015 2016 2017
Code
#unique(bookings$agent)
#unique(bookings$company)
unique(bookings$customer_type)
[1] "Transient"       "Contract"        "Transient-Party" "Group"          

Looking at the data, we can see that there are 119390 rows and 32 columns in the dataset. Moreover, taking a look at the unique values of some columns, we observe that this booking data is collected from 2 hotels, over a period of three years, across various distribution channels. The other columns have values of either character or numerical types and The dataset contains all kinds of booking information regarding the hotels, ranging right from the arrival dates to the type of guests, the booking channel, agent, etc. to the demands of the guests and parking spaces, etc.

Cleaning and mutating the data

By looking at the top values in the dataset, character type columns with values such as NULL can be noticed. Also, 0 valued numerical columns can be seen in the dataset as well. It will be great to find out the percentage of these values in the columns and remove these columns if the percentages are high.

Code
# get zero percentage in data
zero_percent <- (colSums(bookings == 0) / nrow(bookings)) * 100
# get null percent in data
null_percent <- sapply(bookings, function(x) sum(str_detect(x, "NULL")) / length(x))

aggregated_df <- data.frame(null_percent = null_percent, zero_percent = zero_percent)

arrange(aggregated_df, desc(null_percent), desc(zero_percent))

As we can see from the stats above, it is safe to remove columns company and babies due to the high percentage of insignificant "NULL" and 0 values. But before that, I will add the values in adults, children and babies columns to get the total number of guests and also calculate the total_stay_period for guests by adding stays_in_weekend_nights and stays_in_week_nights columns.

Code
bookings <- bookings %>% mutate(total_guests = adults + babies + children,
  total_stay_period = stays_in_weekend_nights + stays_in_week_nights)

bookings <- bookings %>% select(-company, -babies, -adults, -children, -stays_in_weekend_nights, -stays_in_week_nights)

Getting the summary of the dataset

The method dfSummary provides us with the complete summary of the dataset, including the mean and median values, the range and the most important/frequent values of the columns.

Code
print(
  dfSummary(
    bookings, 
    plain.ascii  = FALSE, 
    style        = "grid", 
    graph.magnif = 0.75, 
    valid.col    = FALSE),
  method= 'render', 
  table.classes= 'table-condensed'
  )

Data Frame Summary

bookings

Dimensions: 119390 x 28
Duplicates: 32030
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 hotel [character]
1. City Hotel
2. Resort Hotel
79330(66.4%)
40060(33.6%)
0 (0.0%)
2 is_canceled [numeric]
Min : 0
Mean : 0.4
Max : 1
0:75166(63.0%)
1:44224(37.0%)
0 (0.0%)
3 lead_time [numeric]
Mean (sd) : 104 (106.9)
min ≤ med ≤ max:
0 ≤ 69 ≤ 737
IQR (CV) : 142 (1)
479 distinct values 0 (0.0%)
4 arrival_date_year [numeric]
Mean (sd) : 2016.2 (0.7)
min ≤ med ≤ max:
2015 ≤ 2016 ≤ 2017
IQR (CV) : 1 (0)
2015:21996(18.4%)
2016:56707(47.5%)
2017:40687(34.1%)
0 (0.0%)
5 arrival_date_month [character]
1. August
2. July
3. May
4. October
5. April
6. June
7. September
8. March
9. February
10. November
[ 2 others ]
13877(11.6%)
12661(10.6%)
11791(9.9%)
11160(9.3%)
11089(9.3%)
10939(9.2%)
10508(8.8%)
9794(8.2%)
8068(6.8%)
6794(5.7%)
12709(10.6%)
0 (0.0%)
6 arrival_date_week_number [numeric]
Mean (sd) : 27.2 (13.6)
min ≤ med ≤ max:
1 ≤ 28 ≤ 53
IQR (CV) : 22 (0.5)
53 distinct values 0 (0.0%)
7 arrival_date_day_of_month [numeric]
Mean (sd) : 15.8 (8.8)
min ≤ med ≤ max:
1 ≤ 16 ≤ 31
IQR (CV) : 15 (0.6)
31 distinct values 0 (0.0%)
8 meal [character]
1. BB
2. FB
3. HB
4. SC
5. Undefined
92310(77.3%)
798(0.7%)
14463(12.1%)
10650(8.9%)
1169(1.0%)
0 (0.0%)
9 country [character]
1. PRT
2. GBR
3. FRA
4. ESP
5. DEU
6. ITA
7. IRL
8. BEL
9. BRA
10. NLD
[ 168 others ]
48590(40.7%)
12129(10.2%)
10415(8.7%)
8568(7.2%)
7287(6.1%)
3766(3.2%)
3375(2.8%)
2342(2.0%)
2224(1.9%)
2104(1.8%)
18590(15.6%)
0 (0.0%)
10 market_segment [character]
1. Aviation
2. Complementary
3. Corporate
4. Direct
5. Groups
6. Offline TA/TO
7. Online TA
8. Undefined
237(0.2%)
743(0.6%)
5295(4.4%)
12606(10.6%)
19811(16.6%)
24219(20.3%)
56477(47.3%)
2(0.0%)
0 (0.0%)
11 distribution_channel [character]
1. Corporate
2. Direct
3. GDS
4. TA/TO
5. Undefined
6677(5.6%)
14645(12.3%)
193(0.2%)
97870(82.0%)
5(0.0%)
0 (0.0%)
12 is_repeated_guest [numeric]
Min : 0
Mean : 0
Max : 1
0:115580(96.8%)
1:3810(3.2%)
0 (0.0%)
13 previous_cancellations [numeric]
Mean (sd) : 0.1 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 26
IQR (CV) : 0 (9.7)
15 distinct values 0 (0.0%)
14 previous_bookings_not_canceled [numeric]
Mean (sd) : 0.1 (1.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 72
IQR (CV) : 0 (10.9)
73 distinct values 0 (0.0%)
15 reserved_room_type [character]
1. A
2. B
3. C
4. D
5. E
6. F
7. G
8. H
9. L
10. P
85994(72.0%)
1118(0.9%)
932(0.8%)
19201(16.1%)
6535(5.5%)
2897(2.4%)
2094(1.8%)
601(0.5%)
6(0.0%)
12(0.0%)
0 (0.0%)
16 assigned_room_type [character]
1. A
2. D
3. E
4. F
5. G
6. C
7. B
8. H
9. I
10. K
[ 2 others ]
74053(62.0%)
25322(21.2%)
7806(6.5%)
3751(3.1%)
2553(2.1%)
2375(2.0%)
2163(1.8%)
712(0.6%)
363(0.3%)
279(0.2%)
13(0.0%)
0 (0.0%)
17 booking_changes [numeric]
Mean (sd) : 0.2 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 21
IQR (CV) : 0 (2.9)
21 distinct values 0 (0.0%)
18 deposit_type [character]
1. No Deposit
2. Non Refund
3. Refundable
104641(87.6%)
14587(12.2%)
162(0.1%)
0 (0.0%)
19 agent [character]
1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]
31961(26.8%)
16340(13.7%)
13922(11.7%)
7191(6.0%)
3640(3.0%)
3539(3.0%)
3290(2.8%)
2870(2.4%)
1721(1.4%)
1666(1.4%)
33250(27.8%)
0 (0.0%)
20 days_in_waiting_list [numeric]
Mean (sd) : 2.3 (17.6)
min ≤ med ≤ max:
0 ≤ 0 ≤ 391
IQR (CV) : 0 (7.6)
128 distinct values 0 (0.0%)
21 customer_type [character]
1. Contract
2. Group
3. Transient
4. Transient-Party
4076(3.4%)
577(0.5%)
89613(75.1%)
25124(21.0%)
0 (0.0%)
22 adr [numeric]
Mean (sd) : 101.8 (50.5)
min ≤ med ≤ max:
-6.4 ≤ 94.6 ≤ 5400
IQR (CV) : 56.7 (0.5)
8879 distinct values 0 (0.0%)
23 required_car_parking_spaces [numeric]
Mean (sd) : 0.1 (0.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (3.9)
0:111974(93.8%)
1:7383(6.2%)
2:28(0.0%)
3:3(0.0%)
8:2(0.0%)
0 (0.0%)
24 total_of_special_requests [numeric]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 1 (1.4)
0:70318(58.9%)
1:33226(27.8%)
2:12969(10.9%)
3:2497(2.1%)
4:340(0.3%)
5:40(0.0%)
0 (0.0%)
25 reservation_status [character]
1. Canceled
2. Check-Out
3. No-Show
43017(36.0%)
75166(63.0%)
1207(1.0%)
0 (0.0%)
26 reservation_status_date [Date]
min : 2014-10-17
med : 2016-08-07
max : 2017-09-14
range : 2y 10m 28d
926 distinct values 0 (0.0%)
27 total_guests [numeric]
Mean (sd) : 2 (0.7)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.4)
15 distinct values 4 (0.0%)
28 total_stay_period [numeric]
Mean (sd) : 3.4 (2.6)
min ≤ med ≤ max:
0 ≤ 3 ≤ 69
IQR (CV) : 2 (0.7)
45 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-09

Deep dive into the data

We get a superficial idea about the data using the summary statistics, but we need to deep dive into it to know more.

Code
head(bookings)

As the column adr represents Average Daily Rate, it cannot have -ve values, so I will drop them.

Code
bookings <- bookings %>% filter(adr > 0)

Finding the mean adr for both the hotels

Code
bookings %>% group_by(hotel) %>% summarise(mean_adr = mean(adr))

Now, we will look at the top 5 countries by the number of bookings

Code
bookings.top_countries <- bookings %>% group_by(country) %>% count %>% arrange(desc(n))
bookings.top_countries <- head(5)

Let us see which country has the highest cancellation percentage

Code
bookings.cancellation <- bookings %>% filter(country %in% c("PRT", "GBR", "ESP", "FRA", "DEU")) %>% select(country, is_canceled) %>% group_by(country) %>% summarise_if(is.numeric, mean, na.rm = TRUE) %>% arrange(desc(is_canceled))
bookings.cancellation

We can see that the maximum number of bookings are cancelled by customers from Portugal.

Let us see the number of successful bookings. i.e., the guests successfully checked out of the hotel.

Code
bookings.success <- bookings %>% 
  filter(reservation_status == "Check-Out") 

Let us find the hotels with most successful bookings

Code
bookings.success %>%  
  group_by(hotel) %>% 
  summarise(total_guests = sum(total_guests)) %>% 
  arrange(desc(total_guests)) %>% 
  head(n=10)

Now, let us find the countries with most successful bookings

Code
bookings.success %>%  
  group_by(country) %>% 
  summarise(total_guests = sum(total_guests)) %>% 
  arrange(desc(total_guests)) %>% 
  head(n=10)

From the above analysis, we can see that the City hotel is a bit expensive as compared to the Resort hotel and charges 10usd more per night. We also observe that City hotel had more successful bookings as compared to the Resourt hotel.

In terms of countries, Portugal is the most frequent source of origin for the customers as well as the most frequent source of cancellation as well.

Let us deep dive more into cancellations and see which types of customers are likely to cancel more.

Code
ggplot(bookings, aes(x = customer_type, y = is_canceled))+
  geom_col(fill = "Blue")+
  theme_classic() +
  labs(
    title = "Cancellations by Customer_type",
    x = "Customer Type",
    y = "No of cancellations"
  ) 

We can observe from the graph above that the transient customers are more likely to cancel and the groups being the least.

Code
bookings.changes_by_customer <- bookings %>% group_by(customer_type) %>% summarise(mean_changes = mean(booking_changes)) %>% arrange(desc(mean_changes)) %>% head(10)

bookings.changes_by_customer

We can again see that the transient-party and group customers make the most changes to their bookings

Code
ggplot(data = bookings.changes_by_customer, mapping = aes(x = mean_changes, y = reorder(customer_type, mean_changes))) +
  geom_col(mapping = aes(fill = mean_changes)) +
  labs(x = "Customer Type", y = "Mean Changes", title = "Booking changes by customer type")

We can observe from the graph that the most number of changes are done by people from Transient Parties and Groups whereas the least changes are done by groups.