Homework 3

hw3

hotel_bookings

Author

Siddharth Goel

Published

January 31, 2023

Importing the libraries

Code

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(ggplot2)
library(dplyr)
library(psych)


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

Code

library(readr)
library(summarytools)

Warning: no DISPLAY variable so Tk is not available

system might not have X11 capabilities; in case of errors when using dfSummary(), set st_options(use.x11 = FALSE)

Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view

Code

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Reading the dataset

Code

bookings <- readr::read_csv("_data/hotel_bookings.csv")

Taking an initial look into the data

Code

# looking at the dimensions
dim(bookings)

[1] 119390     32

Code

# looking at the schema
spec(bookings)

cols(
  hotel = col_character(),
  is_canceled = col_double(),
  lead_time = col_double(),
  arrival_date_year = col_double(),
  arrival_date_month = col_character(),
  arrival_date_week_number = col_double(),
  arrival_date_day_of_month = col_double(),
  stays_in_weekend_nights = col_double(),
  stays_in_week_nights = col_double(),
  adults = col_double(),
  children = col_double(),
  babies = col_double(),
  meal = col_character(),
  country = col_character(),
  market_segment = col_character(),
  distribution_channel = col_character(),
  is_repeated_guest = col_double(),
  previous_cancellations = col_double(),
  previous_bookings_not_canceled = col_double(),
  reserved_room_type = col_character(),
  assigned_room_type = col_character(),
  booking_changes = col_double(),
  deposit_type = col_character(),
  agent = col_character(),
  company = col_character(),
  days_in_waiting_list = col_double(),
  customer_type = col_character(),
  adr = col_double(),
  required_car_parking_spaces = col_double(),
  total_of_special_requests = col_double(),
  reservation_status = col_character(),
  reservation_status_date = col_date(format = "")
)

Code

# looking at the data values
head(bookings)

As we can see from the spec and head command outputs, this is the hotel booking data for 2 hotels over 3 years. We can also see that the dataset has 119390 total rows and 32 columns.

A bit detailed description about the data

finding the unique values for each column

Code

nrow(bookings)

[1] 119390

Code

unique(bookings$hotel)

[1] "Resort Hotel" "City Hotel"

Code

unique(bookings$arrival_date_year)

[1] 2015 2016 2017

Code

#unique(bookings$agent)
#unique(bookings$company)
unique(bookings$customer_type)

[1] "Transient"       "Contract"        "Transient-Party" "Group"

Looking at the data, we can see that there are 119390 rows and 32 columns in the dataset. Moreover, taking a look at the unique values of some columns, we observe that this booking data is collected from 2 hotels, over a period of three years, across various distribution channels. The other columns have values of either character or numerical types and The dataset contains all kinds of booking information regarding the hotels, ranging right from the arrival dates to the type of guests, the booking channel, agent, etc. to the demands of the guests and parking spaces, etc.

Cleaning and mutating the data

By looking at the top values in the dataset, character type columns with values such as NULL can be noticed. Also, 0 valued numerical columns can be seen in the dataset as well. It will be great to find out the percentage of these values in the columns and remove these columns if the percentages are high.

Code

# get zero percentage in data
zero_percent <- (colSums(bookings == 0) / nrow(bookings)) * 100
# get null percent in data
null_percent <- sapply(bookings, function(x) sum(str_detect(x, "NULL")) / length(x))

aggregated_df <- data.frame(null_percent = null_percent, zero_percent = zero_percent)

arrange(aggregated_df, desc(null_percent), desc(zero_percent))

As we can see from the stats above, it is safe to remove columns company and babies due to the high percentage of insignificant "NULL" and 0 values. But before that, I will add the values in adults, children and babies columns to get the total number of guests and also calculate the total_stay_period for guests by adding stays_in_weekend_nights and stays_in_week_nights columns.

Code

bookings <- bookings %>% mutate(total_guests = adults + babies + children,
  total_stay_period = stays_in_weekend_nights + stays_in_week_nights)

bookings <- bookings %>% select(-company, -babies, -adults, -children, -stays_in_weekend_nights, -stays_in_week_nights)

Getting the summary of the dataset

The method dfSummary provides us with the complete summary of the dataset, including the mean and median values, the range and the most important/frequent values of the columns.

Code

print(
  dfSummary(
    bookings, 
    plain.ascii  = FALSE, 
    style        = "grid", 
    graph.magnif = 0.75, 
    valid.col    = FALSE),
  method= 'render', 
  table.classes= 'table-condensed'
  )

Data Frame Summary

bookings

Dimensions: 119390 x 28
Duplicates: 32030

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

hotel [character]

1. City Hotel

2. Resort Hotel

79330	(	66.4%	)
40060	(	33.6%	)

0 (0.0%)

is_canceled [numeric]

Min : 0

Mean : 0.4

Max : 1

0	:	75166	(	63.0%	)
1	:	44224	(	37.0%	)

0 (0.0%)

lead_time [numeric]

Mean (sd) : 104 (106.9)

min ≤ med ≤ max:

0 ≤ 69 ≤ 737

IQR (CV) : 142 (1)

479 distinct values

0 (0.0%)

arrival_date_year [numeric]

Mean (sd) : 2016.2 (0.7)

min ≤ med ≤ max:

2015 ≤ 2016 ≤ 2017

IQR (CV) : 1 (0)

2015	:	21996	(	18.4%	)
2016	:	56707	(	47.5%	)
2017	:	40687	(	34.1%	)

0 (0.0%)

arrival_date_month [character]

1. August

2. July

3. May

4. October

5. April

6. June

7. September

8. March

9. February

10. November

[ 2 others ]

13877	(	11.6%	)
12661	(	10.6%	)
11791	(	9.9%	)
11160	(	9.3%	)
11089	(	9.3%	)
10939	(	9.2%	)
10508	(	8.8%	)
9794	(	8.2%	)
8068	(	6.8%	)
6794	(	5.7%	)
12709	(	10.6%	)

0 (0.0%)

arrival_date_week_number [numeric]

Mean (sd) : 27.2 (13.6)

min ≤ med ≤ max:

1 ≤ 28 ≤ 53

IQR (CV) : 22 (0.5)

53 distinct values

0 (0.0%)

arrival_date_day_of_month [numeric]

Mean (sd) : 15.8 (8.8)

min ≤ med ≤ max:

1 ≤ 16 ≤ 31

IQR (CV) : 15 (0.6)

31 distinct values

0 (0.0%)

meal [character]

1. BB

2. FB

3. HB

4. SC

5. Undefined

92310	(	77.3%	)
798	(	0.7%	)
14463	(	12.1%	)
10650	(	8.9%	)
1169	(	1.0%	)

0 (0.0%)

country [character]

1. PRT

2. GBR

3. FRA

4. ESP

5. DEU

6. ITA

7. IRL

8. BEL

9. BRA

10. NLD

[ 168 others ]

48590	(	40.7%	)
12129	(	10.2%	)
10415	(	8.7%	)
8568	(	7.2%	)
7287	(	6.1%	)
3766	(	3.2%	)
3375	(	2.8%	)
2342	(	2.0%	)
2224	(	1.9%	)
2104	(	1.8%	)
18590	(	15.6%	)

0 (0.0%)

market_segment [character]

1. Aviation

2. Complementary

3. Corporate

4. Direct

5. Groups

6. Offline TA/TO

7. Online TA

8. Undefined

237	(	0.2%	)
743	(	0.6%	)
5295	(	4.4%	)
12606	(	10.6%	)
19811	(	16.6%	)
24219	(	20.3%	)
56477	(	47.3%	)
2	(	0.0%	)

0 (0.0%)

distribution_channel [character]

1. Corporate

2. Direct

3. GDS

4. TA/TO

5. Undefined

6677	(	5.6%	)
14645	(	12.3%	)
193	(	0.2%	)
97870	(	82.0%	)
5	(	0.0%	)

0 (0.0%)

is_repeated_guest [numeric]

Min : 0

Mean : 0

Max : 1

0	:	115580	(	96.8%	)
1	:	3810	(	3.2%	)

0 (0.0%)

previous_cancellations [numeric]

Mean (sd) : 0.1 (0.8)

min ≤ med ≤ max:

0 ≤ 0 ≤ 26

IQR (CV) : 0 (9.7)

15 distinct values

0 (0.0%)

previous_bookings_not_canceled [numeric]

Mean (sd) : 0.1 (1.5)

min ≤ med ≤ max:

0 ≤ 0 ≤ 72

IQR (CV) : 0 (10.9)

73 distinct values

0 (0.0%)

reserved_room_type [character]

1. A

2. B

3. C

4. D

5. E

6. F

7. G

8. H

9. L

10. P

85994	(	72.0%	)
1118	(	0.9%	)
932	(	0.8%	)
19201	(	16.1%	)
6535	(	5.5%	)
2897	(	2.4%	)
2094	(	1.8%	)
601	(	0.5%	)
6	(	0.0%	)
12	(	0.0%	)

0 (0.0%)

assigned_room_type [character]

1. A

2. D

3. E

4. F

5. G

6. C

7. B

8. H

9. I

10. K

[ 2 others ]

74053	(	62.0%	)
25322	(	21.2%	)
7806	(	6.5%	)
3751	(	3.1%	)
2553	(	2.1%	)
2375	(	2.0%	)
2163	(	1.8%	)
712	(	0.6%	)
363	(	0.3%	)
279	(	0.2%	)
13	(	0.0%	)

0 (0.0%)

booking_changes [numeric]

Mean (sd) : 0.2 (0.7)

min ≤ med ≤ max:

0 ≤ 0 ≤ 21

IQR (CV) : 0 (2.9)

21 distinct values

0 (0.0%)

deposit_type [character]

1. No Deposit

2. Non Refund

3. Refundable

104641	(	87.6%	)
14587	(	12.2%	)
162	(	0.1%	)

0 (0.0%)

agent [character]

1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]

31961	(	26.8%	)
16340	(	13.7%	)
13922	(	11.7%	)
7191	(	6.0%	)
3640	(	3.0%	)
3539	(	3.0%	)
3290	(	2.8%	)
2870	(	2.4%	)
1721	(	1.4%	)
1666	(	1.4%	)
33250	(	27.8%	)

0 (0.0%)

days_in_waiting_list [numeric]

Mean (sd) : 2.3 (17.6)

min ≤ med ≤ max:

0 ≤ 0 ≤ 391

IQR (CV) : 0 (7.6)

128 distinct values

0 (0.0%)

customer_type [character]

1. Contract

2. Group

3. Transient

4. Transient-Party

4076	(	3.4%	)
577	(	0.5%	)
89613	(	75.1%	)
25124	(	21.0%	)

0 (0.0%)

adr [numeric]

Mean (sd) : 101.8 (50.5)

min ≤ med ≤ max:

-6.4 ≤ 94.6 ≤ 5400

IQR (CV) : 56.7 (0.5)

8879 distinct values

0 (0.0%)

required_car_parking_spaces [numeric]

Mean (sd) : 0.1 (0.2)

min ≤ med ≤ max:

0 ≤ 0 ≤ 8

IQR (CV) : 0 (3.9)

0	:	111974	(	93.8%	)
1	:	7383	(	6.2%	)
2	:	28	(	0.0%	)
3	:	3	(	0.0%	)
8	:	2	(	0.0%	)

0 (0.0%)

total_of_special_requests [numeric]

Mean (sd) : 0.6 (0.8)

min ≤ med ≤ max:

0 ≤ 0 ≤ 5

IQR (CV) : 1 (1.4)

0	:	70318	(	58.9%	)
1	:	33226	(	27.8%	)
2	:	12969	(	10.9%	)
3	:	2497	(	2.1%	)
4	:	340	(	0.3%	)
5	:	40	(	0.0%	)

0 (0.0%)

reservation_status [character]

1. Canceled

2. Check-Out

3. No-Show

43017	(	36.0%	)
75166	(	63.0%	)
1207	(	1.0%	)

0 (0.0%)

reservation_status_date [Date]

min : 2014-10-17

med : 2016-08-07

max : 2017-09-14

range : 2y 10m 28d

926 distinct values

0 (0.0%)

total_guests [numeric]

Mean (sd) : 2 (0.7)

min ≤ med ≤ max:

0 ≤ 2 ≤ 55

IQR (CV) : 0 (0.4)

15 distinct values

4 (0.0%)

total_stay_period [numeric]

Mean (sd) : 3.4 (2.6)

min ≤ med ≤ max:

0 ≤ 3 ≤ 69

IQR (CV) : 2 (0.7)

45 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-09

Deep dive into the data

We get a superficial idea about the data using the summary statistics, but we need to deep dive into it to know more.

Code

head(bookings)

As the column adr represents Average Daily Rate, it cannot have -ve values, so I will drop them.

Code

bookings <- bookings %>% filter(adr > 0)

Finding the mean adr for both the hotels

Code

bookings %>% group_by(hotel) %>% summarise(mean_adr = mean(adr))

Now, we will look at the top 5 countries by the number of bookings

Code

bookings.top_countries <- bookings %>% group_by(country) %>% count %>% arrange(desc(n))
bookings.top_countries <- head(5)

Let us see which country has the highest cancellation percentage

Code

bookings.cancellation <- bookings %>% filter(country %in% c("PRT", "GBR", "ESP", "FRA", "DEU")) %>% select(country, is_canceled) %>% group_by(country) %>% summarise_if(is.numeric, mean, na.rm = TRUE) %>% arrange(desc(is_canceled))
bookings.cancellation

We can see that the maximum number of bookings are cancelled by customers from Portugal.

Let us see the number of successful bookings. i.e., the guests successfully checked out of the hotel.

Code

bookings.success <- bookings %>% 
  filter(reservation_status == "Check-Out")

Let us find the hotels with most successful bookings

Code

bookings.success %>%  
  group_by(hotel) %>% 
  summarise(total_guests = sum(total_guests)) %>% 
  arrange(desc(total_guests)) %>% 
  head(n=10)

Now, let us find the countries with most successful bookings

Code

bookings.success %>%  
  group_by(country) %>% 
  summarise(total_guests = sum(total_guests)) %>% 
  arrange(desc(total_guests)) %>% 
  head(n=10)

From the above analysis, we can see that the City hotel is a bit expensive as compared to the Resort hotel and charges 10usd more per night. We also observe that City hotel had more successful bookings as compared to the Resourt hotel.

In terms of countries, Portugal is the most frequent source of origin for the customers as well as the most frequent source of cancellation as well.

Let us deep dive more into cancellations and see which types of customers are likely to cancel more.

Code

ggplot(bookings, aes(x = customer_type, y = is_canceled))+
  geom_col(fill = "Blue")+
  theme_classic() +
  labs(
    title = "Cancellations by Customer_type",
    x = "Customer Type",
    y = "No of cancellations"
  )

We can observe from the graph above that the transient customers are more likely to cancel and the groups being the least.

Code

bookings.changes_by_customer <- bookings %>% group_by(customer_type) %>% summarise(mean_changes = mean(booking_changes)) %>% arrange(desc(mean_changes)) %>% head(10)

bookings.changes_by_customer

We can again see that the transient-party and group customers make the most changes to their bookings

Code

ggplot(data = bookings.changes_by_customer, mapping = aes(x = mean_changes, y = reorder(customer_type, mean_changes))) +
  geom_col(mapping = aes(fill = mean_changes)) +
  labs(x = "Customer Type", y = "Mean Changes", title = "Booking changes by customer type")

We can observe from the graph that the most number of changes are done by people from Transient Parties and Groups whereas the least changes are done by groups.