library(tidyverse)
library(ggplot2)
library(ggrepel)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 7 : Visualizing multiple dimensions in Hotel Bookings
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- Recreate at least two graphs from previous exercises, but introduce at least one additional dimension that you omitted before using ggplot functionality (color, shape, line, facet, etc) The goal is not to create unneeded chart ink (Tufte), but to concisely capture variation in additional dimensions that were collapsed in your earlier 2 or 3 dimensional graphs.
- Explain why you choose the specific graph type
- If you haven’t tried in previous weeks, work this week to make your graphs “publication” ready with titles, captions, and pretty axis labels and other viewer-friendly features
R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code. And anyone not familiar with Edward Tufte should check out his fantastic books and courses on data visualizaton.
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- hotel_bookings ⭐⭐⭐
<- read.csv("_data/hotel_bookings.csv")
hotel_data hotel_data
Briefly describe the data
As seen in challenge 6. (Code directly used from Challenge 6)
%>% colnames() hotel_data
[1] "hotel" "is_canceled"
[3] "lead_time" "arrival_date_year"
[5] "arrival_date_month" "arrival_date_week_number"
[7] "arrival_date_day_of_month" "stays_in_weekend_nights"
[9] "stays_in_week_nights" "adults"
[11] "children" "babies"
[13] "meal" "country"
[15] "market_segment" "distribution_channel"
[17] "is_repeated_guest" "previous_cancellations"
[19] "previous_bookings_not_canceled" "reserved_room_type"
[21] "assigned_room_type" "booking_changes"
[23] "deposit_type" "agent"
[25] "company" "days_in_waiting_list"
[27] "customer_type" "adr"
[29] "required_car_parking_spaces" "total_of_special_requests"
[31] "reservation_status" "reservation_status_date"
We see that there are 32 variables in the dataset. Lets have a look at the summary for each variable of the dataset below.
%>% summary() hotel_data
hotel is_canceled lead_time arrival_date_year
Length:119390 Min. :0.0000 Min. : 0 Min. :2015
Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
Mode :character Median :0.0000 Median : 69 Median :2016
Mean :0.3704 Mean :104 Mean :2016
3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
Max. :1.0000 Max. :737 Max. :2017
arrival_date_month arrival_date_week_number arrival_date_day_of_month
Length:119390 Min. : 1.00 Min. : 1.0
Class :character 1st Qu.:16.00 1st Qu.: 8.0
Mode :character Median :28.00 Median :16.0
Mean :27.17 Mean :15.8
3rd Qu.:38.00 3rd Qu.:23.0
Max. :53.00 Max. :31.0
stays_in_weekend_nights stays_in_week_nights adults
Min. : 0.0000 Min. : 0.0 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
Median : 1.0000 Median : 2.0 Median : 2.000
Mean : 0.9276 Mean : 2.5 Mean : 1.856
3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
Max. :19.0000 Max. :50.0 Max. :55.000
children babies meal country
Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
Median : 0.0000 Median : 0.000000 Mode :character Mode :character
Mean : 0.1039 Mean : 0.007949
3rd Qu.: 0.0000 3rd Qu.: 0.000000
Max. :10.0000 Max. :10.000000
NA's :4
market_segment distribution_channel is_repeated_guest
Length:119390 Length:119390 Min. :0.00000
Class :character Class :character 1st Qu.:0.00000
Mode :character Mode :character Median :0.00000
Mean :0.03191
3rd Qu.:0.00000
Max. :1.00000
previous_cancellations previous_bookings_not_canceled reserved_room_type
Min. : 0.00000 Min. : 0.0000 Length:119390
1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
Median : 0.00000 Median : 0.0000 Mode :character
Mean : 0.08712 Mean : 0.1371
3rd Qu.: 0.00000 3rd Qu.: 0.0000
Max. :26.00000 Max. :72.0000
assigned_room_type booking_changes deposit_type agent
Length:119390 Min. : 0.0000 Length:119390 Length:119390
Class :character 1st Qu.: 0.0000 Class :character Class :character
Mode :character Median : 0.0000 Mode :character Mode :character
Mean : 0.2211
3rd Qu.: 0.0000
Max. :21.0000
company days_in_waiting_list customer_type adr
Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
Mode :character Median : 0.000 Mode :character Median : 94.58
Mean : 2.321 Mean : 101.83
3rd Qu.: 0.000 3rd Qu.: 126.00
Max. :391.000 Max. :5400.00
required_car_parking_spaces total_of_special_requests reservation_status
Min. :0.00000 Min. :0.0000 Length:119390
1st Qu.:0.00000 1st Qu.:0.0000 Class :character
Median :0.00000 Median :0.0000 Mode :character
Mean :0.06252 Mean :0.5714
3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :8.00000 Max. :5.0000
reservation_status_date
Length:119390
Class :character
Mode :character
Lets see the number of unique hotels and countries
$hotel %>% unique() hotel_data
[1] "Resort Hotel" "City Hotel"
There are two unique types of hotels
$country %>% unique() hotel_data
[1] "PRT" "GBR" "USA" "ESP" "IRL" "FRA" "NULL" "ROU" "NOR" "OMN"
[11] "ARG" "POL" "DEU" "BEL" "CHE" "CN" "GRC" "ITA" "NLD" "DNK"
[21] "RUS" "SWE" "AUS" "EST" "CZE" "BRA" "FIN" "MOZ" "BWA" "LUX"
[31] "SVN" "ALB" "IND" "CHN" "MEX" "MAR" "UKR" "SMR" "LVA" "PRI"
[41] "SRB" "CHL" "AUT" "BLR" "LTU" "TUR" "ZAF" "AGO" "ISR" "CYM"
[51] "ZMB" "CPV" "ZWE" "DZA" "KOR" "CRI" "HUN" "ARE" "TUN" "JAM"
[61] "HRV" "HKG" "IRN" "GEO" "AND" "GIB" "URY" "JEY" "CAF" "CYP"
[71] "COL" "GGY" "KWT" "NGA" "MDV" "VEN" "SVK" "FJI" "KAZ" "PAK"
[81] "IDN" "LBN" "PHL" "SEN" "SYC" "AZE" "BHR" "NZL" "THA" "DOM"
[91] "MKD" "MYS" "ARM" "JPN" "LKA" "CUB" "CMR" "BIH" "MUS" "COM"
[101] "SUR" "UGA" "BGR" "CIV" "JOR" "SYR" "SGP" "BDI" "SAU" "VNM"
[111] "PLW" "QAT" "EGY" "PER" "MLT" "MWI" "ECU" "MDG" "ISL" "UZB"
[121] "NPL" "BHS" "MAC" "TGO" "TWN" "DJI" "STP" "KNA" "ETH" "IRQ"
[131] "HND" "RWA" "KHM" "MCO" "BGD" "IMN" "TJK" "NIC" "BEN" "VGB"
[141] "TZA" "GAB" "GHA" "TMP" "GLP" "KEN" "LIE" "GNB" "MNE" "UMI"
[151] "MYT" "FRO" "MMR" "PAN" "BFA" "LBY" "MLI" "NAM" "BOL" "PRY"
[161] "BRB" "ABW" "AIA" "SLV" "DMA" "PYF" "GUY" "LCA" "ATA" "GTM"
[171] "ASM" "MRT" "NCL" "KIR" "SDN" "ATF" "SLE" "LAO"
And the data describes values from 178 countries
Tidy Data (as needed)
Like challenge 6, we start by combine all start date related fields to create a singular date field. (Code directly used from Challenge 6)
<-hotel_data%>%
hotel_mutatedmutate(date_arrival = str_c(arrival_date_day_of_month,
arrival_date_month,sep="/"),
arrival_date_year, date_arrival = dmy(date_arrival))%>%
select(-c("arrival_date_day_of_month", "arrival_date_year", "arrival_date_week_number"))
hotel_mutated
Visualization with Multiple Dimensions
In challenge 6, we visulaized the number of bookings made in each month. Here we try to find the count of babies that arrive each month, while introducing variables like
<- hotel_data %>%
hotels_babybookings_bymonth_pre group_by(arrival_date_month, arrival_date_year, hotel) %>%
summarise(sum_babies = sum(babies))
<- hotels_babybookings_bymonth_pre %>% mutate(
hotels_babybookings_bymonth arrival_date_month = factor(arrival_date_month, levels = month.name)
%>%
) arrange(arrival_date_month)
hotels_babybookings_bymonth
We first visualize the nu,ber of babies arriving in a given month. We use a histogram as it is an excellent way of depicting count. We fill the histogram based on the hotel type hence introducing another dimension in our visualization:
ggplot(hotels_babybookings_bymonth) + geom_bar(aes(x=arrival_date_month, y=sum_babies, fill=(hotel)), stat="identity") + scale_x_discrete(breaks = c('January', 'April', 'July', 'October')) + labs(y= "Total Babies", x = "Month")
Next, we try to introduce one more dimension, which is arrival_date_year. We use a line plot here. As we can see, we have two plots, for the two hotels. For each month, we see multiple datat points that correspond to the different years
ggplot(hotels_babybookings_bymonth, aes(arrival_date_month, sum_babies, col=hotel)) + facet_wrap(vars(hotel)) + scale_x_discrete(breaks = c('January', 'April', 'July', 'October')) + geom_line() + geom_point() + geom_text_repel(size=1.5, aes(label = arrival_date_year)) + labs(y= "Total Babies", x = "Month")
Next we try to visualize a different set of dimensions: total number of bookings, total number of adults and total number of children for a given month. This type of visualization is useful when a user wants to compare the three “totals” we mentioned earlier. We use a line to represent each of the three “totals” and also divide the data by hotel type.
<- hotel_data %>%
hotels_babyadultratio_bymonth_pre group_by(arrival_date_month, hotel) %>%
summarise(total_babies = sum(babies), total_adults = sum(adults), bookings = n(), .groups = 'drop')
<- hotels_babyadultratio_bymonth_pre %>% mutate(
hotels_babyadultratio_bymonth arrival_date_month = factor(arrival_date_month, levels = month.name)
%>%
) arrange(arrival_date_month)
<- hotels_babyadultratio_bymonth %>% pivot_longer(cols=c(total_babies, total_adults, bookings), names_to = 'sum', values_to = 'value')
hotels_babyadultratio_bymonth
hotels_babyadultratio_bymonth
ggplot(hotels_babyadultratio_bymonth, aes(arrival_date_month, value, color = sum, group=sum)) +
facet_wrap(vars(hotel)) +
geom_point() +
geom_line() +
scale_x_discrete(breaks = c('January', 'April', 'July', 'October')) +
labs(y= "Value", x = "Month", color ="Totals")