library(tidyverse)
library(lubridate)
library(ggplot2)
library(treemap)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 6 Solutions
Challenge Overview
Working with the Hotel Bookings Dataset
Read in data
# Reading in the CSV data
<- read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)
h_book
h_book
# A tibble: 119,390 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Briefly describe the data
The following dataset has 119390 observations and 32 different fields. This is basically the data from two hotels - City Hotel & Resort Hotel. Each entry corresponds to a booking made by a customer. There are various information that are recorded with each booking. It includes the arrival date, number of days of stay, meal, type of room reserved, customer type, reservation status, number of adult/children and many more. The data consists of entries from countries all over the world.
Tidy Data (as needed)
# To check the number of unique values in all the columns
rapply(h_book,function(x)length(unique(x)))
hotel is_canceled
2 2
lead_time arrival_date_year
479 3
arrival_date_month arrival_date_week_number
12 53
arrival_date_day_of_month stays_in_weekend_nights
31 17
stays_in_week_nights adults
35 14
children babies
6 5
meal country
5 178
market_segment distribution_channel
8 5
is_repeated_guest previous_cancellations
2 15
previous_bookings_not_canceled reserved_room_type
73 10
assigned_room_type booking_changes
12 21
deposit_type agent
3 334
company days_in_waiting_list
353 128
customer_type adr
4 8879
required_car_parking_spaces total_of_special_requests
5 6
reservation_status reservation_status_date
3 926
# To check the unique values of hotel
unique(h_book$hotel)
[1] "Resort Hotel" "City Hotel"
# in country we can see that there is a field call NULL that can be removed
table(h_book$country)
ABW AGO AIA ALB AND ARE ARG ARM ASM ATA ATF AUS AUT
2 362 1 12 7 51 214 8 1 2 1 426 1263
AZE BDI BEL BEN BFA BGD BGR BHR BHS BIH BLR BOL BRA
17 1 2342 3 1 12 75 5 1 13 26 10 2224
BRB BWA CAF CHE CHL CHN CIV CMR CN COL COM CPV CRI
4 1 5 1730 65 999 6 10 1279 71 2 24 19
CUB CYM CYP CZE DEU DJI DMA DNK DOM DZA ECU EGY ESP
8 1 51 171 7287 1 1 435 14 103 27 32 8568
EST ETH FIN FJI FRA FRO GAB GBR GEO GGY GHA GIB GLP
83 3 447 1 10415 5 4 12129 22 3 4 18 2
GNB GRC GTM GUY HKG HND HRV HUN IDN IMN IND IRL IRN
9 128 4 1 29 1 100 230 35 2 152 3375 83
IRQ ISL ISR ITA JAM JEY JOR JPN KAZ KEN KHM KIR KNA
14 57 669 3766 6 8 21 197 19 6 2 1 2
KOR KWT LAO LBN LBY LCA LIE LKA LTU LUX LVA MAC MAR
133 16 2 31 8 1 3 7 81 287 55 16 259
MCO MDG MDV MEX MKD MLI MLT MMR MNE MOZ MRT MUS MWI
4 1 12 85 10 1 18 1 5 67 1 7 2
MYS MYT NAM NCL NGA NIC NLD NOR NPL NULL NZL OMN PAK
28 2 1 1 34 1 2104 607 1 488 74 18 14
PAN PER PHL PLW POL PRI PRT PRY PYF QAT ROU RUS RWA
9 29 40 1 919 12 48590 4 1 15 500 632 2
SAU SDN SEN SGP SLE SLV SMR SRB STP SUR SVK SVN SWE
48 1 11 39 1 2 1 101 2 5 65 57 1024
SYC SYR TGO THA TJK TMP TUN TUR TWN TZA UGA UKR UMI
2 3 2 59 9 3 39 248 51 5 2 68 1
URY USA UZB VEN VGB VNM ZAF ZMB ZWE
32 2097 4 26 1 8 80 2 4
<- h_book %>%
h_book filter(!(country == "NULL"))
In country, there are entries called NULL, which can be removed as it may not be useful for the analysis.
# to check the different types of the fields
head(h_book)
# A tibble: 6 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resort… 0 342 2015 July 27 1 0 0 2
2 Resort… 0 737 2015 July 27 1 0 0 2
3 Resort… 0 7 2015 July 27 1 0 1 1
4 Resort… 0 13 2015 July 27 1 0 1 1
5 Resort… 0 14 2015 July 27 1 0 2 2
6 Resort… 0 14 2015 July 27 1 0 2 2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
# assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
# agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
sapply(h_book, class)
hotel is_canceled
"character" "numeric"
lead_time arrival_date_year
"numeric" "numeric"
arrival_date_month arrival_date_week_number
"character" "numeric"
arrival_date_day_of_month stays_in_weekend_nights
"numeric" "numeric"
stays_in_week_nights adults
"numeric" "numeric"
children babies
"numeric" "numeric"
meal country
"character" "character"
market_segment distribution_channel
"character" "character"
is_repeated_guest previous_cancellations
"numeric" "numeric"
previous_bookings_not_canceled reserved_room_type
"numeric" "character"
assigned_room_type booking_changes
"character" "numeric"
deposit_type agent
"character" "character"
company days_in_waiting_list
"character" "numeric"
customer_type adr
"character" "numeric"
required_car_parking_spaces total_of_special_requests
"numeric" "numeric"
reservation_status reservation_status_date
"character" "Date"
From the above analysis, we can see that two fields, Agents and Company have numerical values in them, but have the datatype marked as character. These NULL entries can be changed to NA and the datatype can be changed to numeric. Also, the arrival date in year, month and date can be combined into a single field called as arrival date.
# combining the arrival date into a single field
# to find the total number of guests in the hotel - add adults, children and babies
<- h_book %>%
h_book_mut mutate(arrival_date = str_c(arrival_date_day_of_month,
arrival_date_month,sep="/"),
arrival_date_year, arrival_date = dmy(arrival_date),
total_guests = adults + children + babies) %>%
select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year))
h_book_mut
# A tibble: 118,902 × 31
hotel is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Resort H… 0 342 27 0 0 2 0 0 BB
2 Resort H… 0 737 27 0 0 2 0 0 BB
3 Resort H… 0 7 27 0 1 1 0 0 BB
4 Resort H… 0 13 27 0 1 1 0 0 BB
5 Resort H… 0 14 27 0 2 2 0 0 BB
6 Resort H… 0 14 27 0 2 2 0 0 BB
7 Resort H… 0 0 27 0 2 2 0 0 BB
8 Resort H… 0 9 27 0 2 2 0 0 FB
9 Resort H… 1 85 27 0 3 2 0 0 BB
10 Resort H… 1 75 27 0 3 2 0 0 HB
# … with 118,892 more rows, 21 more variables: country <chr>,
# market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>, reservation_status <chr>, …
Changing the datatype of agent and company from char to numeric
# mutating the datatype of the agent and company field from char to numeric
<- h_book_mut %>%
h_book_mut mutate(across(c(agent, company),~ replace(.,str_detect(., "NULL"), NA))) %>% mutate_at(vars(agent, company),as.numeric)
is.numeric(h_book_mut$agent)
[1] TRUE
is.numeric(h_book_mut$company)
[1] TRUE
Time Dependent Visualization
For the time dependent visualization I want to see how many guests are arriving over a period of time on a daily basis.
<- h_book_mut %>% select(total_guests, arrival_date) %>%
plot1_data group_by(arrival_date) %>%
summarise(net_guests = sum(total_guests, na.rm=TRUE))
plot1_data
# A tibble: 793 × 2
arrival_date net_guests
<date> <dbl>
1 2015-07-01 220
2 2015-07-02 190
3 2015-07-03 116
4 2015-07-04 181
5 2015-07-05 113
6 2015-07-06 152
7 2015-07-07 114
8 2015-07-08 139
9 2015-07-09 166
10 2015-07-10 118
# … with 783 more rows
# finding the date ranges of the arrival date of this data
summary(plot1_data$arrival_date)
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2015-07-01" "2016-01-15" "2016-07-31" "2016-07-31" "2017-02-14" "2017-08-31"
From above, we can see that the arrival dates lie between July 2015 - August 2017.
<- ggplot(plot1_data, aes(x=arrival_date, y=net_guests)) + ggtitle("Guest Count Vs. Arrival Date Trend: Jul 2015 - Jun 2016") +
p geom_line(color="maroon") +
xlab("") + theme(axis.text.x=element_text(angle=60, hjust=1)) +
scale_x_date(date_labels = "%Y %b %d", date_minor_breaks = "1 month", limit=c(as.Date("2015-07-01"),as.Date("2016-07-01")))
p
Here, I am visualizing the trends for a period of one year from Jul 2015 - Jul 2016 by using a line graph as it can precisely be used to see trends for time series data. In the above visualization we can see the total guests who have arrived on a particular day. Furthermore, we can see the monthly breaks as well. The influx of guests peaked during the first week of December 2015 and dipped sharply the following week. During the months of summer i.e. April, May, June, the number of incoming guests seem to be consistent. This can be because of the vacation that many families might take during the summer break.
Visualizing Part-Whole Relationships
For visualizing part-whole relationship, I want to see the proportion of guests who have chosen different categories for meal. Firstly, I want need to group-by to get the count of each meal category.
unique(h_book_mut$meal)
[1] "BB" "FB" "HB" "SC" "Undefined"
There are 5 meal categories here. BB - ‘Bed & Breakfast’ FB - ‘Full Board’ - Breakfast, Lunch, Dinner HB - ‘Half Board’ - Breakfast & Dinner SC - ‘Self-Catering’ - No meals included Undefined - which can be ignored
I want to gather arrival date data of one year from Jul 2015 to Jun 2016 and exclude reservation status which says canceled.
<- h_book_mut %>%
plot2_data filter(reservation_status != 'Canceled', arrival_date >= as.Date("2015-07-01") & arrival_date < as.Date("2016-07-01")) %>%
select(meal) %>%
group_by(meal) %>%
summarise(total_count = n(), .groups = 'drop') %>%
filter(meal != "Undefined")
plot2_data
# A tibble: 4 × 2
meal total_count
<chr> <int>
1 BB 25353
2 FB 210
3 HB 4170
4 SC 1827
Plotting the above data using a tree map.
treemap(plot2_data,
index="meal",
vSize="total_count",
type="index",
title = "Meal Cateogries :Jul 2015 - Jun 2016",
palette = "Set3",
border.lwds = 1.5
)
The most preferred meal scheme was BB, followed by HB, SC and finally FB. The reason for choosing a treemap is to give an easy overview of which category is the most preferred in the meal type. This can help with the initial analysis and we can further dive deep into why BB is preferred most, in which countries and by which demography. Maybe guests who have children tend to choose that option or guests who are just staying only for a night, etc.