Challenge 6

challenge_6
hotel_bookings
Visualizing Time and Relationships
Author

Steve O’Neill

Published

August 23, 2022

library(tidyverse)
library(ggplot2)
library(readxl)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

The air_bnb dataset is interesting but really does not provide a good foundation for time analysis. Its only time information is the most recent date a review was left. A review doesn’t necessarily have any bearing on when a place was rented [users can wait to submit for 14 days], and then you still only get one data point.

Much more potent for time analysis is the hotel_bookings dataset.

hotel_bookings <- read_csv("_data/hotel_bookings.csv")
hotel_bookings

I covered this dataset in Challenge 2, so some of my analysis is already done.

The data was originally published in the journal Hospitality Management. It describes two hotels - one ‘city’ and one ‘resort’-style - in Portugal.

It is important to know that this dataset includes bookings that arrived and those which were canceled. Each row many observations about a single booking, including:

…Along with much other information. The ADR is calculated by dividing the sum of all lodging transactions by the total number of staying nights:

\(ADR\) = \(\frac{Room Revenue}{Rooms Sold}\)

Using summarytools, we can see right away that:

print(summarytools::dfSummary(hotel_bookings,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

hotel_bookings

Dimensions: 119390 x 32
Duplicates: 31994
Variable Stats / Values Freqs (% of Valid) Graph Missing
hotel [character]
1. City Hotel
2. Resort Hotel
79330(66.4%)
40060(33.6%)
0 (0.0%)
is_canceled [numeric]
Min : 0
Mean : 0.4
Max : 1
0:75166(63.0%)
1:44224(37.0%)
0 (0.0%)
lead_time [numeric]
Mean (sd) : 104 (106.9)
min ≤ med ≤ max:
0 ≤ 69 ≤ 737
IQR (CV) : 142 (1)
479 distinct values 0 (0.0%)
arrival_date_year [numeric]
Mean (sd) : 2016.2 (0.7)
min ≤ med ≤ max:
2015 ≤ 2016 ≤ 2017
IQR (CV) : 1 (0)
2015:21996(18.4%)
2016:56707(47.5%)
2017:40687(34.1%)
0 (0.0%)
arrival_date_month [character]
1. August
2. July
3. May
4. October
5. April
6. June
7. September
8. March
9. February
10. November
[ 2 others ]
13877(11.6%)
12661(10.6%)
11791(9.9%)
11160(9.3%)
11089(9.3%)
10939(9.2%)
10508(8.8%)
9794(8.2%)
8068(6.8%)
6794(5.7%)
12709(10.6%)
0 (0.0%)
arrival_date_week_number [numeric]
Mean (sd) : 27.2 (13.6)
min ≤ med ≤ max:
1 ≤ 28 ≤ 53
IQR (CV) : 22 (0.5)
53 distinct values 0 (0.0%)
arrival_date_day_of_month [numeric]
Mean (sd) : 15.8 (8.8)
min ≤ med ≤ max:
1 ≤ 16 ≤ 31
IQR (CV) : 15 (0.6)
31 distinct values 0 (0.0%)
stays_in_weekend_nights [numeric]
Mean (sd) : 0.9 (1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 2 (1.1)
17 distinct values 0 (0.0%)
stays_in_week_nights [numeric]
Mean (sd) : 2.5 (1.9)
min ≤ med ≤ max:
0 ≤ 2 ≤ 50
IQR (CV) : 2 (0.8)
35 distinct values 0 (0.0%)
adults [numeric]
Mean (sd) : 1.9 (0.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.3)
14 distinct values 0 (0.0%)
children [numeric]
Mean (sd) : 0.1 (0.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (3.8)
0:110796(92.8%)
1:4861(4.1%)
2:3652(3.1%)
3:76(0.1%)
10:1(0.0%)
4 (0.0%)
babies [numeric]
Mean (sd) : 0 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (12.3)
0:118473(99.2%)
1:900(0.8%)
2:15(0.0%)
9:1(0.0%)
10:1(0.0%)
0 (0.0%)
meal [character]
1. BB
2. FB
3. HB
4. SC
5. Undefined
92310(77.3%)
798(0.7%)
14463(12.1%)
10650(8.9%)
1169(1.0%)
0 (0.0%)
country [character]
1. PRT
2. GBR
3. FRA
4. ESP
5. DEU
6. ITA
7. IRL
8. BEL
9. BRA
10. NLD
[ 168 others ]
48590(40.7%)
12129(10.2%)
10415(8.7%)
8568(7.2%)
7287(6.1%)
3766(3.2%)
3375(2.8%)
2342(2.0%)
2224(1.9%)
2104(1.8%)
18590(15.6%)
0 (0.0%)
market_segment [character]
1. Aviation
2. Complementary
3. Corporate
4. Direct
5. Groups
6. Offline TA/TO
7. Online TA
8. Undefined
237(0.2%)
743(0.6%)
5295(4.4%)
12606(10.6%)
19811(16.6%)
24219(20.3%)
56477(47.3%)
2(0.0%)
0 (0.0%)
distribution_channel [character]
1. Corporate
2. Direct
3. GDS
4. TA/TO
5. Undefined
6677(5.6%)
14645(12.3%)
193(0.2%)
97870(82.0%)
5(0.0%)
0 (0.0%)
is_repeated_guest [numeric]
Min : 0
Mean : 0
Max : 1
0:115580(96.8%)
1:3810(3.2%)
0 (0.0%)
previous_cancellations [numeric]
Mean (sd) : 0.1 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 26
IQR (CV) : 0 (9.7)
15 distinct values 0 (0.0%)
previous_bookings_not_canceled [numeric]
Mean (sd) : 0.1 (1.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 72
IQR (CV) : 0 (10.9)
73 distinct values 0 (0.0%)
reserved_room_type [character]
1. A
2. B
3. C
4. D
5. E
6. F
7. G
8. H
9. L
10. P
85994(72.0%)
1118(0.9%)
932(0.8%)
19201(16.1%)
6535(5.5%)
2897(2.4%)
2094(1.8%)
601(0.5%)
6(0.0%)
12(0.0%)
0 (0.0%)
assigned_room_type [character]
1. A
2. D
3. E
4. F
5. G
6. C
7. B
8. H
9. I
10. K
[ 2 others ]
74053(62.0%)
25322(21.2%)
7806(6.5%)
3751(3.1%)
2553(2.1%)
2375(2.0%)
2163(1.8%)
712(0.6%)
363(0.3%)
279(0.2%)
13(0.0%)
0 (0.0%)
booking_changes [numeric]
Mean (sd) : 0.2 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 21
IQR (CV) : 0 (2.9)
21 distinct values 0 (0.0%)
deposit_type [character]
1. No Deposit
2. Non Refund
3. Refundable
104641(87.6%)
14587(12.2%)
162(0.1%)
0 (0.0%)
agent [character]
1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]
31961(26.8%)
16340(13.7%)
13922(11.7%)
7191(6.0%)
3640(3.0%)
3539(3.0%)
3290(2.8%)
2870(2.4%)
1721(1.4%)
1666(1.4%)
33250(27.8%)
0 (0.0%)
company [character]
1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]
112593(94.3%)
927(0.8%)
784(0.7%)
267(0.2%)
250(0.2%)
215(0.2%)
149(0.1%)
141(0.1%)
138(0.1%)
133(0.1%)
3793(3.2%)
0 (0.0%)
days_in_waiting_list [numeric]
Mean (sd) : 2.3 (17.6)
min ≤ med ≤ max:
0 ≤ 0 ≤ 391
IQR (CV) : 0 (7.6)
128 distinct values 0 (0.0%)
customer_type [character]
1. Contract
2. Group
3. Transient
4. Transient-Party
4076(3.4%)
577(0.5%)
89613(75.1%)
25124(21.0%)
0 (0.0%)
adr [numeric]
Mean (sd) : 101.8 (50.5)
min ≤ med ≤ max:
-6.4 ≤ 94.6 ≤ 5400
IQR (CV) : 56.7 (0.5)
8879 distinct values 0 (0.0%)
required_car_parking_spaces [numeric]
Mean (sd) : 0.1 (0.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (3.9)
0:111974(93.8%)
1:7383(6.2%)
2:28(0.0%)
3:3(0.0%)
8:2(0.0%)
0 (0.0%)
total_of_special_requests [numeric]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 1 (1.4)
0:70318(58.9%)
1:33226(27.8%)
2:12969(10.9%)
3:2497(2.1%)
4:340(0.3%)
5:40(0.0%)
0 (0.0%)
reservation_status [character]
1. Canceled
2. Check-Out
3. No-Show
43017(36.0%)
75166(63.0%)
1207(1.0%)
0 (0.0%)
reservation_status_date [Date]
min : 2014-10-17
med : 2016-08-07
max : 2017-09-14
range : 2y 10m 28d
926 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-28

Trade-specific terms

There are some terms used that must be clarified for legibility:

Term Meaning
TA Travel Agent
TO Tour Operator (distribution channel)
HB Half Board (breakfast + other meal)
FB Full Board (3 meals a day)
SC Self-catering

Sanity checks

One thing I noticed is that arrival date is separated into three columns, even though reservation_status_date is already formatted as yyyy-mm-dd.

This is resolved with Lubridate and the built-in variable month-name, used to get a number from the month name (e.g. “7” from July).

hotel_bookings <- hotel_bookings %>% mutate(arrival_date = make_date(arrival_date_year,match(arrival_date_month, month.name),arrival_date_day_of_month))
hotel_bookings

With two date variables pulled out, I can now create intervals based on them. Sadly, time intervals are not well-supported in ggplot (or I’m just missing something.)

hotel_bookings <- hotel_bookings %>% mutate(booking_interval = interval(arrival_date, reservation_status_date))
hotel_bookings

Time Dependent Visualization

Here is the most straightforward graph possible, plotting average daily rate against time. But it doesn’t show things responsibly:

hotel_bookings %>% ggplot(aes(x=arrival_date, y=adr)) + geom_line()

Obviously this graph is misleading. The hotel is filling tons of rooms at once. The graph looks to be plotting lines on top of each other, so we only get to see the largest single booking for any given day. That’s why the “one crazy night” outlier - the $4500 stay - is so prominently featured.

We also need to filter out everything except real “Check-Outs” to avoid looking at canceled reservations:

hotel_bookings %>% group_by(arrival_date) %>% filter(str_detect(reservation_status, "Check-Out")) %>% summarise(adr = sum(adr))

I want to compare the earnings of these two hotels, actually:

resort_hotel_adr <- hotel_bookings %>% filter(str_detect(hotel, "Resort Hotel")) %>% group_by(arrival_date) %>% filter(str_detect(reservation_status, "Check-Out")) %>% summarise(adr = sum(adr))

city_hotel_adr <- hotel_bookings %>% filter(str_detect(hotel, "City Hotel")) %>% group_by(arrival_date) %>% filter(str_detect(reservation_status, "Check-Out")) %>% summarise(adr = sum(adr))

I’ll graph one. Because there are a few outliers, a trend line helps the eye:

resort_hotel_adr %>% ggplot(aes(x=arrival_date, y=adr)) + 
  geom_line(color = "indianred3", 
            size=.7) + 
  geom_smooth() +
  labs(title = "Total daily ADR (Average Daily Rate) for Anonymous Resort Hotel",
       subtitle = "2015 to 2017",
       x = "",
       y = "Total ADR")

Here I’ll join those two dataframes, keeping everything from each (full join)

adr_combined <- full_join(city_hotel_adr, resort_hotel_adr, by = "arrival_date", suffix = c("_city", "_resort"))

And now I will plot them both. This time I am adding a dollar format to the y-axis.

adr_combined %>% ggplot(aes(x=arrival_date)) + 
  geom_line(aes(y=adr_city, colour = "City Hotel")) +
  geom_line(aes(y=adr_resort, colour = "Resort Hotel")) +
  geom_smooth(aes(y=adr_city, colour = "City Hotel")) +
  geom_smooth(aes(y=adr_resort, colour = "Resort Hotel")) +
  labs(title = "Total daily ADR (Average Daily Rate) for Two Anonymous Hotels",
       subtitle = "2015 to 2017",
       x = "",
       y = "Total ADR",
       color = "Hotel") +
  scale_y_continuous(labels=scales::dollar_format())

Visualizing Part-Whole Relationships

Pie charts are controversial for their tendency to misrepresent larger proportions to the human eye.

Viable alternatives are donut charts, which cut out most of the middle area, But these are not always ideal for data with lots of variables.

To prepare for the chart, I am going to make a new dataframe called by_country

by_country <- hotel_bookings %>% filter(str_detect(hotel, "Resort Hotel")) %>% group_by(country)
by_country

Now, I can make a 100% stacked bar graph plot that accurately represents portions of a whole.

by_country %>% ggplot(aes(fill=meal, y=adults, x=market_segment)) +
  geom_bar(position="fill", stat="identity") +
  labs(title = "Hotel Meal Types by Market Segment",
       subtitle = "2015 to 2017: Data From Two Anonymous Portuguese Hotels",
       x = "Market Segment",
       y = "Percentage of Guests") +
  scale_y_continuous(labels = scales::percent)

At a glance, this graph suggests that:

  • Corporate customers are mostly only getting Bed & Breakfast, not the other inclusive meal packages other market segments get. Stingy, or smart?
  • The “groups” market segment gets all-inclusive meals (Full Board - FB) most often. Maybe they are using the hotel as a conference space and are therefore inside it all day.
  • Offline travel agents seem to be more likely to bundle a hotel meal in with their itinerary.

One criticism

This plot isn’t much better than a pie chart in my opinion, because it makes the “Complementary” and “Corporate” market segments look the same [by area]. People are used to seeing regular bar graphs, not 100% stacked bar graphs, so it isn’t immediately obvious that this is a percentage bar