Challenge 6 - Hotel Bookings

challenge_6
Megan Galarneau
hotel_bookings
Visualizing Time and Relationships
Author

Megan Galarneau

Published

April 7, 2023

Code
library(tidyverse)
library(ggplot2)
library(dplyr)
library(lubridate)
library(patchwork)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. Read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. Tidy data (as needed, including sanity checks)
  3. Mutate variables as needed (including sanity checks)
  4. Create at least one graph including time (evolution)
  • try to make them “publication” ready (optional)
  • Explain why you choose the specific graph type
  1. Create at least one graph depicting part-whole or flow relationships
  • try to make them “publication” ready (optional)
  • Explain why you choose the specific graph type

Read in data

Code
#read in the data set, raw
library(readr)
raw_hotel <- read_csv("_data/hotel_bookings.csv")
raw_hotel

Briefly describe the data

This data set contains information of over 119k city & resort hotel bookings from July 2015 to August 2017 in 178 different countries. Detailed information about arrival date/times, length of stay, cancellations, room type, reservation status, # of people (adults, childern, or babies), market segment & more are also included in this data set. At first glance, there are more reported city hotel bookings (66.4%) with almost half of the bookings from 2015 (47.5%).

Code
#summary of data set statistics
print(summarytools::dfSummary(raw_hotel,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

raw_hotel

Dimensions: 119390 x 32
Duplicates: 31994
Variable Stats / Values Freqs (% of Valid) Graph Missing
hotel [character]
1. City Hotel
2. Resort Hotel
79330 ( 66.4% )
40060 ( 33.6% )
0 (0.0%)
is_canceled [numeric]
Min : 0
Mean : 0.4
Max : 1
0 : 75166 ( 63.0% )
1 : 44224 ( 37.0% )
0 (0.0%)
lead_time [numeric]
Mean (sd) : 104 (106.9)
min ≤ med ≤ max:
0 ≤ 69 ≤ 737
IQR (CV) : 142 (1)
479 distinct values 0 (0.0%)
arrival_date_year [numeric]
Mean (sd) : 2016.2 (0.7)
min ≤ med ≤ max:
2015 ≤ 2016 ≤ 2017
IQR (CV) : 1 (0)
2015 : 21996 ( 18.4% )
2016 : 56707 ( 47.5% )
2017 : 40687 ( 34.1% )
0 (0.0%)
arrival_date_month [character]
1. August
2. July
3. May
4. October
5. April
6. June
7. September
8. March
9. February
10. November
[ 2 others ]
13877 ( 11.6% )
12661 ( 10.6% )
11791 ( 9.9% )
11160 ( 9.3% )
11089 ( 9.3% )
10939 ( 9.2% )
10508 ( 8.8% )
9794 ( 8.2% )
8068 ( 6.8% )
6794 ( 5.7% )
12709 ( 10.6% )
0 (0.0%)
arrival_date_week_number [numeric]
Mean (sd) : 27.2 (13.6)
min ≤ med ≤ max:
1 ≤ 28 ≤ 53
IQR (CV) : 22 (0.5)
53 distinct values 0 (0.0%)
arrival_date_day_of_month [numeric]
Mean (sd) : 15.8 (8.8)
min ≤ med ≤ max:
1 ≤ 16 ≤ 31
IQR (CV) : 15 (0.6)
31 distinct values 0 (0.0%)
stays_in_weekend_nights [numeric]
Mean (sd) : 0.9 (1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 2 (1.1)
17 distinct values 0 (0.0%)
stays_in_week_nights [numeric]
Mean (sd) : 2.5 (1.9)
min ≤ med ≤ max:
0 ≤ 2 ≤ 50
IQR (CV) : 2 (0.8)
35 distinct values 0 (0.0%)
adults [numeric]
Mean (sd) : 1.9 (0.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.3)
14 distinct values 0 (0.0%)
children [numeric]
Mean (sd) : 0.1 (0.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (3.8)
0 : 110796 ( 92.8% )
1 : 4861 ( 4.1% )
2 : 3652 ( 3.1% )
3 : 76 ( 0.1% )
10 : 1 ( 0.0% )
4 (0.0%)
babies [numeric]
Mean (sd) : 0 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (12.3)
0 : 118473 ( 99.2% )
1 : 900 ( 0.8% )
2 : 15 ( 0.0% )
9 : 1 ( 0.0% )
10 : 1 ( 0.0% )
0 (0.0%)
meal [character]
1. BB
2. FB
3. HB
4. SC
5. Undefined
92310 ( 77.3% )
798 ( 0.7% )
14463 ( 12.1% )
10650 ( 8.9% )
1169 ( 1.0% )
0 (0.0%)
country [character]
1. PRT
2. GBR
3. FRA
4. ESP
5. DEU
6. ITA
7. IRL
8. BEL
9. BRA
10. NLD
[ 168 others ]
48590 ( 40.7% )
12129 ( 10.2% )
10415 ( 8.7% )
8568 ( 7.2% )
7287 ( 6.1% )
3766 ( 3.2% )
3375 ( 2.8% )
2342 ( 2.0% )
2224 ( 1.9% )
2104 ( 1.8% )
18590 ( 15.6% )
0 (0.0%)
market_segment [character]
1. Aviation
2. Complementary
3. Corporate
4. Direct
5. Groups
6. Offline TA/TO
7. Online TA
8. Undefined
237 ( 0.2% )
743 ( 0.6% )
5295 ( 4.4% )
12606 ( 10.6% )
19811 ( 16.6% )
24219 ( 20.3% )
56477 ( 47.3% )
2 ( 0.0% )
0 (0.0%)
distribution_channel [character]
1. Corporate
2. Direct
3. GDS
4. TA/TO
5. Undefined
6677 ( 5.6% )
14645 ( 12.3% )
193 ( 0.2% )
97870 ( 82.0% )
5 ( 0.0% )
0 (0.0%)
is_repeated_guest [numeric]
Min : 0
Mean : 0
Max : 1
0 : 115580 ( 96.8% )
1 : 3810 ( 3.2% )
0 (0.0%)
previous_cancellations [numeric]
Mean (sd) : 0.1 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 26
IQR (CV) : 0 (9.7)
15 distinct values 0 (0.0%)
previous_bookings_not_canceled [numeric]
Mean (sd) : 0.1 (1.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 72
IQR (CV) : 0 (10.9)
73 distinct values 0 (0.0%)
reserved_room_type [character]
1. A
2. B
3. C
4. D
5. E
6. F
7. G
8. H
9. L
10. P
85994 ( 72.0% )
1118 ( 0.9% )
932 ( 0.8% )
19201 ( 16.1% )
6535 ( 5.5% )
2897 ( 2.4% )
2094 ( 1.8% )
601 ( 0.5% )
6 ( 0.0% )
12 ( 0.0% )
0 (0.0%)
assigned_room_type [character]
1. A
2. D
3. E
4. F
5. G
6. C
7. B
8. H
9. I
10. K
[ 2 others ]
74053 ( 62.0% )
25322 ( 21.2% )
7806 ( 6.5% )
3751 ( 3.1% )
2553 ( 2.1% )
2375 ( 2.0% )
2163 ( 1.8% )
712 ( 0.6% )
363 ( 0.3% )
279 ( 0.2% )
13 ( 0.0% )
0 (0.0%)
booking_changes [numeric]
Mean (sd) : 0.2 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 21
IQR (CV) : 0 (2.9)
21 distinct values 0 (0.0%)
deposit_type [character]
1. No Deposit
2. Non Refund
3. Refundable
104641 ( 87.6% )
14587 ( 12.2% )
162 ( 0.1% )
0 (0.0%)
agent [character]
1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]
31961 ( 26.8% )
16340 ( 13.7% )
13922 ( 11.7% )
7191 ( 6.0% )
3640 ( 3.0% )
3539 ( 3.0% )
3290 ( 2.8% )
2870 ( 2.4% )
1721 ( 1.4% )
1666 ( 1.4% )
33250 ( 27.8% )
0 (0.0%)
company [character]
1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]
112593 ( 94.3% )
927 ( 0.8% )
784 ( 0.7% )
267 ( 0.2% )
250 ( 0.2% )
215 ( 0.2% )
149 ( 0.1% )
141 ( 0.1% )
138 ( 0.1% )
133 ( 0.1% )
3793 ( 3.2% )
0 (0.0%)
days_in_waiting_list [numeric]
Mean (sd) : 2.3 (17.6)
min ≤ med ≤ max:
0 ≤ 0 ≤ 391
IQR (CV) : 0 (7.6)
128 distinct values 0 (0.0%)
customer_type [character]
1. Contract
2. Group
3. Transient
4. Transient-Party
4076 ( 3.4% )
577 ( 0.5% )
89613 ( 75.1% )
25124 ( 21.0% )
0 (0.0%)
adr [numeric]
Mean (sd) : 101.8 (50.5)
min ≤ med ≤ max:
-6.4 ≤ 94.6 ≤ 5400
IQR (CV) : 56.7 (0.5)
8879 distinct values 0 (0.0%)
required_car_parking_spaces [numeric]
Mean (sd) : 0.1 (0.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (3.9)
0 : 111974 ( 93.8% )
1 : 7383 ( 6.2% )
2 : 28 ( 0.0% )
3 : 3 ( 0.0% )
8 : 2 ( 0.0% )
0 (0.0%)
total_of_special_requests [numeric]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 1 (1.4)
0 : 70318 ( 58.9% )
1 : 33226 ( 27.8% )
2 : 12969 ( 10.9% )
3 : 2497 ( 2.1% )
4 : 340 ( 0.3% )
5 : 40 ( 0.0% )
0 (0.0%)
reservation_status [character]
1. Canceled
2. Check-Out
3. No-Show
43017 ( 36.0% )
75166 ( 63.0% )
1207 ( 1.0% )
0 (0.0%)
reservation_status_date [Date]
min : 2014-10-17
med : 2016-08-07
max : 2017-09-14
range : 2y 10m 28d
926 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-04-08

Tidy Data

Immediately, I noticed that arrival dates are separated into three columns (year, month, day). Let’s mutate these columns in order to graph it later on.

Code
#recode month characters to numeric
month_hotel<-raw_hotel%>%
  mutate(tidy_arrival_date_month = case_when(
  arrival_date_month == "January" ~ 1,
  arrival_date_month == "February" ~ 2,
  arrival_date_month == "March" ~ 3,
  arrival_date_month == "April" ~ 4,
  arrival_date_month == "May" ~ 5,
  arrival_date_month == "June" ~ 6,
  arrival_date_month == "July" ~ 7,
  arrival_date_month == "August" ~ 8,
  arrival_date_month == "September" ~ 9,
  arrival_date_month == "October" ~ 10,
  arrival_date_month == "November" ~ 11,
  arrival_date_month == "December" ~ 12)
)

#create new arrival date column
arrival_date_hotel<-month_hotel%>%
  mutate(arrival_date = make_date(arrival_date_year, tidy_arrival_date_month, arrival_date_day_of_month))

#create new total bookings per arrival date column
tidy_time_hotel<-arrival_date_hotel%>%
  mutate(month=floor_date(arrival_date,unit="month")) %>%
  group_by(arrival_date, hotel) %>%
  summarise(n=n()) %>%
  ungroup()

#pivot_longer adults, child, babies

people_hotel <-pivot_longer(raw_hotel, col = c("adults", "children", "babies"),
                 names_to="Person",
                 values_to = "Person Number",
                 values_drop_na = TRUE)

#arrival date sanity check
tidy_time_hotel
Code
#adult, child, baby sanity check
people_hotel

Time Dependent Visualization

Now that our data is tidy, I chose to graph the number of bookings per hotel type over time according to arrival date. I choose a line graph because it gives a granular perspective of each booking and date.

Code
time_hotel <- tidy_time_hotel %>%
  ggplot(aes(arrival_date,n,col=hotel))+
  geom_line()+
  scale_x_date(NULL, date_labels = "%b %y",breaks="2 months")+
  scale_y_continuous(limits=c(0,500))+
  labs(x="Arrival Date",y="# of bookings", title = "Hotel Booking Arrival Dates over Time")+
  theme(axis.text.x=element_text(angle=90))

time_hotel

Visualizing Part-Whole Relationships

I wanted to graph the breakdown of adult, child & babies for city vs. resort bookings below. I don’t think the quantities are accurately represented though.

Code
bar_hotel <- people_hotel %>%
ggplot(mapping = aes(x = hotel, fill = Person), position = "fill") + 
  geom_bar() +
   labs(x="Hotel Type",y="# of Bookings", title = "Occupants of Hotel Rooms by Type")

bar_hotel