Challenge 6

challenge_6

hotel_bookings

Visualizing Time and Relationships

Author

Steve O’Neill

Published

August 23, 2022

library(tidyverse)
library(ggplot2)
library(readxl)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

hotel_bookings ⭐⭐⭐⭐

The air_bnb dataset is interesting but really does not provide a good foundation for time analysis. Its only time information is the most recent date a review was left. A review doesn’t necessarily have any bearing on when a place was rented [users can wait to submit for 14 days], and then you still only get one data point.

Much more potent for time analysis is the hotel_bookings dataset.

hotel_bookings <- read_csv("_data/hotel_bookings.csv")
hotel_bookings

I covered this dataset in Challenge 2, so some of my analysis is already done.

The data was originally published in the journal Hospitality Management. It describes two hotels - one ‘city’ and one ‘resort’-style - in Portugal.

It is important to know that this dataset includes bookings that arrived and those which were canceled. Each row many observations about a single booking, including:

Customer information (Are they associated with a corporation or group? Do they bring kids?)
Booking information (How many times do they change their reservation?)
Rate information (What is the average daily rate, or ADR?)

…Along with much other information. The ADR is calculated by dividing the sum of all lodging transactions by the total number of staying nights:

$ADR$ = $\frac{Room Revenue}{Rooms Sold}$

Using summarytools, we can see right away that:

The data was collected between 2015 and 2017
August is the most popular month
Guests bringing children are uncommon - less than 1%

print(summarytools::dfSummary(hotel_bookings,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

hotel_bookings

Dimensions: 119390 x 32
Duplicates: 31994

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

hotel [character]

1. City Hotel

2. Resort Hotel

79330	(	66.4%	)
40060	(	33.6%	)

0 (0.0%)

is_canceled [numeric]

Min : 0

Mean : 0.4

Max : 1

0	:	75166	(	63.0%	)
1	:	44224	(	37.0%	)

0 (0.0%)

lead_time [numeric]

Mean (sd) : 104 (106.9)

min ≤ med ≤ max:

0 ≤ 69 ≤ 737

IQR (CV) : 142 (1)

479 distinct values

0 (0.0%)

arrival_date_year [numeric]

Mean (sd) : 2016.2 (0.7)

min ≤ med ≤ max:

2015 ≤ 2016 ≤ 2017

IQR (CV) : 1 (0)

2015	:	21996	(	18.4%	)
2016	:	56707	(	47.5%	)
2017	:	40687	(	34.1%	)

0 (0.0%)

arrival_date_month [character]

1. August

2. July

3. May

4. October

5. April

6. June

7. September

8. March

9. February

10. November

[ 2 others ]

13877	(	11.6%	)
12661	(	10.6%	)
11791	(	9.9%	)
11160	(	9.3%	)
11089	(	9.3%	)
10939	(	9.2%	)
10508	(	8.8%	)
9794	(	8.2%	)
8068	(	6.8%	)
6794	(	5.7%	)
12709	(	10.6%	)

0 (0.0%)

arrival_date_week_number [numeric]

Mean (sd) : 27.2 (13.6)

min ≤ med ≤ max:

1 ≤ 28 ≤ 53

IQR (CV) : 22 (0.5)

53 distinct values

0 (0.0%)

arrival_date_day_of_month [numeric]

Mean (sd) : 15.8 (8.8)

min ≤ med ≤ max:

1 ≤ 16 ≤ 31

IQR (CV) : 15 (0.6)

31 distinct values

0 (0.0%)

stays_in_weekend_nights [numeric]

Mean (sd) : 0.9 (1)

min ≤ med ≤ max:

0 ≤ 1 ≤ 19

IQR (CV) : 2 (1.1)

17 distinct values

0 (0.0%)

stays_in_week_nights [numeric]

Mean (sd) : 2.5 (1.9)

min ≤ med ≤ max:

0 ≤ 2 ≤ 50

IQR (CV) : 2 (0.8)

35 distinct values

0 (0.0%)

adults [numeric]

Mean (sd) : 1.9 (0.6)

min ≤ med ≤ max:

0 ≤ 2 ≤ 55

IQR (CV) : 0 (0.3)

14 distinct values

0 (0.0%)

children [numeric]

Mean (sd) : 0.1 (0.4)

min ≤ med ≤ max:

0 ≤ 0 ≤ 10

IQR (CV) : 0 (3.8)

0	:	110796	(	92.8%	)
1	:	4861	(	4.1%	)
2	:	3652	(	3.1%	)
3	:	76	(	0.1%	)
10	:	1	(	0.0%	)

4 (0.0%)

babies [numeric]

Mean (sd) : 0 (0.1)

min ≤ med ≤ max:

0 ≤ 0 ≤ 10

IQR (CV) : 0 (12.3)

0	:	118473	(	99.2%	)
1	:	900	(	0.8%	)
2	:	15	(	0.0%	)
9	:	1	(	0.0%	)
10	:	1	(	0.0%	)

0 (0.0%)

meal [character]

1. BB

2. FB

3. HB

4. SC

5. Undefined

92310	(	77.3%	)
798	(	0.7%	)
14463	(	12.1%	)
10650	(	8.9%	)
1169	(	1.0%	)

0 (0.0%)

country [character]

1. PRT

2. GBR

3. FRA

4. ESP

5. DEU

6. ITA

7. IRL

8. BEL

9. BRA

10. NLD

[ 168 others ]

48590	(	40.7%	)
12129	(	10.2%	)
10415	(	8.7%	)
8568	(	7.2%	)
7287	(	6.1%	)
3766	(	3.2%	)
3375	(	2.8%	)
2342	(	2.0%	)
2224	(	1.9%	)
2104	(	1.8%	)
18590	(	15.6%	)

0 (0.0%)

market_segment [character]

1. Aviation

2. Complementary

3. Corporate

4. Direct

5. Groups

6. Offline TA/TO

7. Online TA

8. Undefined

237	(	0.2%	)
743	(	0.6%	)
5295	(	4.4%	)
12606	(	10.6%	)
19811	(	16.6%	)
24219	(	20.3%	)
56477	(	47.3%	)
2	(	0.0%	)

0 (0.0%)

distribution_channel [character]

1. Corporate

2. Direct

3. GDS

4. TA/TO

5. Undefined

6677	(	5.6%	)
14645	(	12.3%	)
193	(	0.2%	)
97870	(	82.0%	)
5	(	0.0%	)

0 (0.0%)

is_repeated_guest [numeric]

Min : 0

Mean : 0

Max : 1

0	:	115580	(	96.8%	)
1	:	3810	(	3.2%	)

0 (0.0%)

previous_cancellations [numeric]

Mean (sd) : 0.1 (0.8)

min ≤ med ≤ max:

0 ≤ 0 ≤ 26

IQR (CV) : 0 (9.7)

15 distinct values

0 (0.0%)

previous_bookings_not_canceled [numeric]

Mean (sd) : 0.1 (1.5)

min ≤ med ≤ max:

0 ≤ 0 ≤ 72

IQR (CV) : 0 (10.9)

73 distinct values

0 (0.0%)

reserved_room_type [character]

1. A

2. B

3. C

4. D

5. E

6. F

7. G

8. H

9. L

10. P

85994	(	72.0%	)
1118	(	0.9%	)
932	(	0.8%	)
19201	(	16.1%	)
6535	(	5.5%	)
2897	(	2.4%	)
2094	(	1.8%	)
601	(	0.5%	)
6	(	0.0%	)
12	(	0.0%	)

0 (0.0%)

assigned_room_type [character]

1. A

2. D

3. E

4. F

5. G

6. C

7. B

8. H

9. I

10. K

[ 2 others ]

74053	(	62.0%	)
25322	(	21.2%	)
7806	(	6.5%	)
3751	(	3.1%	)
2553	(	2.1%	)
2375	(	2.0%	)
2163	(	1.8%	)
712	(	0.6%	)
363	(	0.3%	)
279	(	0.2%	)
13	(	0.0%	)

0 (0.0%)

booking_changes [numeric]

Mean (sd) : 0.2 (0.7)

min ≤ med ≤ max:

0 ≤ 0 ≤ 21

IQR (CV) : 0 (2.9)

21 distinct values

0 (0.0%)

deposit_type [character]

1. No Deposit

2. Non Refund

3. Refundable

104641	(	87.6%	)
14587	(	12.2%	)
162	(	0.1%	)

0 (0.0%)

agent [character]

1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]

31961	(	26.8%	)
16340	(	13.7%	)
13922	(	11.7%	)
7191	(	6.0%	)
3640	(	3.0%	)
3539	(	3.0%	)
3290	(	2.8%	)
2870	(	2.4%	)
1721	(	1.4%	)
1666	(	1.4%	)
33250	(	27.8%	)

0 (0.0%)

company [character]

1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]

112593	(	94.3%	)
927	(	0.8%	)
784	(	0.7%	)
267	(	0.2%	)
250	(	0.2%	)
215	(	0.2%	)
149	(	0.1%	)
141	(	0.1%	)
138	(	0.1%	)
133	(	0.1%	)
3793	(	3.2%	)

0 (0.0%)

days_in_waiting_list [numeric]

Mean (sd) : 2.3 (17.6)

min ≤ med ≤ max:

0 ≤ 0 ≤ 391

IQR (CV) : 0 (7.6)

128 distinct values

0 (0.0%)

customer_type [character]

1. Contract

2. Group

3. Transient

4. Transient-Party

4076	(	3.4%	)
577	(	0.5%	)
89613	(	75.1%	)
25124	(	21.0%	)

0 (0.0%)

adr [numeric]

Mean (sd) : 101.8 (50.5)

min ≤ med ≤ max:

-6.4 ≤ 94.6 ≤ 5400

IQR (CV) : 56.7 (0.5)

8879 distinct values

0 (0.0%)

required_car_parking_spaces [numeric]

Mean (sd) : 0.1 (0.2)

min ≤ med ≤ max:

0 ≤ 0 ≤ 8

IQR (CV) : 0 (3.9)

0	:	111974	(	93.8%	)
1	:	7383	(	6.2%	)
2	:	28	(	0.0%	)
3	:	3	(	0.0%	)
8	:	2	(	0.0%	)

0 (0.0%)

total_of_special_requests [numeric]

Mean (sd) : 0.6 (0.8)

min ≤ med ≤ max:

0 ≤ 0 ≤ 5

IQR (CV) : 1 (1.4)

0	:	70318	(	58.9%	)
1	:	33226	(	27.8%	)
2	:	12969	(	10.9%	)
3	:	2497	(	2.1%	)
4	:	340	(	0.3%	)
5	:	40	(	0.0%	)

0 (0.0%)

reservation_status [character]

1. Canceled

2. Check-Out

3. No-Show

43017	(	36.0%	)
75166	(	63.0%	)
1207	(	1.0%	)

0 (0.0%)

reservation_status_date [Date]

min : 2014-10-17

med : 2016-08-07

max : 2017-09-14

range : 2y 10m 28d

926 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-28

Trade-specific terms

There are some terms used that must be clarified for legibility:

Term	Meaning
TA	Travel Agent
TO	Tour Operator (distribution channel)
HB	Half Board (breakfast + other meal)
FB	Full Board (3 meals a day)
SC	Self-catering

Sanity checks

One thing I noticed is that arrival date is separated into three columns, even though reservation_status_date is already formatted as yyyy-mm-dd.

This is resolved with Lubridate and the built-in variable month-name, used to get a number from the month name (e.g. “7” from July).

hotel_bookings <- hotel_bookings %>% mutate(arrival_date = make_date(arrival_date_year,match(arrival_date_month, month.name),arrival_date_day_of_month))
hotel_bookings

With two date variables pulled out, I can now create intervals based on them. Sadly, time intervals are not well-supported in ggplot (or I’m just missing something.)

hotel_bookings <- hotel_bookings %>% mutate(booking_interval = interval(arrival_date, reservation_status_date))
hotel_bookings

Time Dependent Visualization

Here is the most straightforward graph possible, plotting average daily rate against time. But it doesn’t show things responsibly:

hotel_bookings %>% ggplot(aes(x=arrival_date, y=adr)) + geom_line()

Obviously this graph is misleading. The hotel is filling tons of rooms at once. The graph looks to be plotting lines on top of each other, so we only get to see the largest single booking for any given day. That’s why the “one crazy night” outlier - the $4500 stay - is so prominently featured.

We also need to filter out everything except real “Check-Outs” to avoid looking at canceled reservations:

hotel_bookings %>% group_by(arrival_date) %>% filter(str_detect(reservation_status, "Check-Out")) %>% summarise(adr = sum(adr))

I want to compare the earnings of these two hotels, actually:

resort_hotel_adr <- hotel_bookings %>% filter(str_detect(hotel, "Resort Hotel")) %>% group_by(arrival_date) %>% filter(str_detect(reservation_status, "Check-Out")) %>% summarise(adr = sum(adr))

city_hotel_adr <- hotel_bookings %>% filter(str_detect(hotel, "City Hotel")) %>% group_by(arrival_date) %>% filter(str_detect(reservation_status, "Check-Out")) %>% summarise(adr = sum(adr))

I’ll graph one. Because there are a few outliers, a trend line helps the eye:

resort_hotel_adr %>% ggplot(aes(x=arrival_date, y=adr)) + 
  geom_line(color = "indianred3", 
            size=.7) + 
  geom_smooth() +
  labs(title = "Total daily ADR (Average Daily Rate) for Anonymous Resort Hotel",
       subtitle = "2015 to 2017",
       x = "",
       y = "Total ADR")

Here I’ll join those two dataframes, keeping everything from each (full join)

adr_combined <- full_join(city_hotel_adr, resort_hotel_adr, by = "arrival_date", suffix = c("_city", "_resort"))

And now I will plot them both. This time I am adding a dollar format to the y-axis.

adr_combined %>% ggplot(aes(x=arrival_date)) + 
  geom_line(aes(y=adr_city, colour = "City Hotel")) +
  geom_line(aes(y=adr_resort, colour = "Resort Hotel")) +
  geom_smooth(aes(y=adr_city, colour = "City Hotel")) +
  geom_smooth(aes(y=adr_resort, colour = "Resort Hotel")) +
  labs(title = "Total daily ADR (Average Daily Rate) for Two Anonymous Hotels",
       subtitle = "2015 to 2017",
       x = "",
       y = "Total ADR",
       color = "Hotel") +
  scale_y_continuous(labels=scales::dollar_format())

Visualizing Part-Whole Relationships

Pie charts are controversial for their tendency to misrepresent larger proportions to the human eye.

Viable alternatives are donut charts, which cut out most of the middle area, But these are not always ideal for data with lots of variables.

To prepare for the chart, I am going to make a new dataframe called by_country

by_country <- hotel_bookings %>% filter(str_detect(hotel, "Resort Hotel")) %>% group_by(country)
by_country

Now, I can make a 100% stacked bar graph plot that accurately represents portions of a whole.

by_country %>% ggplot(aes(fill=meal, y=adults, x=market_segment)) +
  geom_bar(position="fill", stat="identity") +
  labs(title = "Hotel Meal Types by Market Segment",
       subtitle = "2015 to 2017: Data From Two Anonymous Portuguese Hotels",
       x = "Market Segment",
       y = "Percentage of Guests") +
  scale_y_continuous(labels = scales::percent)

At a glance, this graph suggests that:

Corporate customers are mostly only getting Bed & Breakfast, not the other inclusive meal packages other market segments get. Stingy, or smart?
The “groups” market segment gets all-inclusive meals (Full Board - FB) most often. Maybe they are using the hotel as a conference space and are therefore inside it all day.
Offline travel agents seem to be more likely to bundle a hotel meal in with their itinerary.

One criticism

This plot isn’t much better than a pie chart in my opinion, because it makes the “Complementary” and “Corporate” market segments look the same [by area]. People are used to seeing regular bar graphs, not 100% stacked bar graphs, so it isn’t immediately obvious that this is a percentage bar