Challenge_7

challenge_7
hotel_bookings
Visualizing Multiple Dimensions
Author

Xinyang Mao

Published

May 10, 2023

library(tidyverse)
library(ggplot2)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. Recreate at least two graphs from previous exercises, but introduce at least one additional dimension that you omitted before using ggplot functionality (color, shape, line, facet, etc) The goal is not to create unneeded chart ink (Tufte), but to concisely capture variation in additional dimensions that were collapsed in your earlier 2 or 3 dimensional graphs.
  • Explain why you choose the specific graph type
  1. If you haven’t tried in previous weeks, work this week to make your graphs “publication” ready with titles, captions, and pretty axis labels and other viewer-friendly features

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code. And anyone not familiar with Edward Tufte should check out his fantastic books and courses on data visualizaton.

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • eggs ⭐
  • abc_poll ⭐⭐
  • australian_marriage ⭐⭐
  • hotel_bookings ⭐⭐⭐
  • air_bnb ⭐⭐⭐
  • us_hh ⭐⭐⭐⭐
  • faostat ⭐⭐⭐⭐⭐
hotel_raw <- read.csv("_data/hotel_bookings.csv")
print(summarytools::dfSummary(hotel_raw,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

hotel_raw

Dimensions: 119390 x 32
Duplicates: 31994
Variable Stats / Values Freqs (% of Valid) Graph Missing
hotel [character]
1. City Hotel
2. Resort Hotel
79330 ( 66.4% )
40060 ( 33.6% )
0 (0.0%)
is_canceled [integer]
Min : 0
Mean : 0.4
Max : 1
0 : 75166 ( 63.0% )
1 : 44224 ( 37.0% )
0 (0.0%)
lead_time [integer]
Mean (sd) : 104 (106.9)
min ≤ med ≤ max:
0 ≤ 69 ≤ 737
IQR (CV) : 142 (1)
479 distinct values 0 (0.0%)
arrival_date_year [integer]
Mean (sd) : 2016.2 (0.7)
min ≤ med ≤ max:
2015 ≤ 2016 ≤ 2017
IQR (CV) : 1 (0)
2015 : 21996 ( 18.4% )
2016 : 56707 ( 47.5% )
2017 : 40687 ( 34.1% )
0 (0.0%)
arrival_date_month [character]
1. August
2. July
3. May
4. October
5. April
6. June
7. September
8. March
9. February
10. November
[ 2 others ]
13877 ( 11.6% )
12661 ( 10.6% )
11791 ( 9.9% )
11160 ( 9.3% )
11089 ( 9.3% )
10939 ( 9.2% )
10508 ( 8.8% )
9794 ( 8.2% )
8068 ( 6.8% )
6794 ( 5.7% )
12709 ( 10.6% )
0 (0.0%)
arrival_date_week_number [integer]
Mean (sd) : 27.2 (13.6)
min ≤ med ≤ max:
1 ≤ 28 ≤ 53
IQR (CV) : 22 (0.5)
53 distinct values 0 (0.0%)
arrival_date_day_of_month [integer]
Mean (sd) : 15.8 (8.8)
min ≤ med ≤ max:
1 ≤ 16 ≤ 31
IQR (CV) : 15 (0.6)
31 distinct values 0 (0.0%)
stays_in_weekend_nights [integer]
Mean (sd) : 0.9 (1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 2 (1.1)
17 distinct values 0 (0.0%)
stays_in_week_nights [integer]
Mean (sd) : 2.5 (1.9)
min ≤ med ≤ max:
0 ≤ 2 ≤ 50
IQR (CV) : 2 (0.8)
35 distinct values 0 (0.0%)
adults [integer]
Mean (sd) : 1.9 (0.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.3)
14 distinct values 0 (0.0%)
children [integer]
Mean (sd) : 0.1 (0.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (3.8)
0 : 110796 ( 92.8% )
1 : 4861 ( 4.1% )
2 : 3652 ( 3.1% )
3 : 76 ( 0.1% )
10 : 1 ( 0.0% )
4 (0.0%)
babies [integer]
Mean (sd) : 0 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (12.3)
0 : 118473 ( 99.2% )
1 : 900 ( 0.8% )
2 : 15 ( 0.0% )
9 : 1 ( 0.0% )
10 : 1 ( 0.0% )
0 (0.0%)
meal [character]
1. BB
2. FB
3. HB
4. SC
5. Undefined
92310 ( 77.3% )
798 ( 0.7% )
14463 ( 12.1% )
10650 ( 8.9% )
1169 ( 1.0% )
0 (0.0%)
country [character]
1. PRT
2. GBR
3. FRA
4. ESP
5. DEU
6. ITA
7. IRL
8. BEL
9. BRA
10. NLD
[ 168 others ]
48590 ( 40.7% )
12129 ( 10.2% )
10415 ( 8.7% )
8568 ( 7.2% )
7287 ( 6.1% )
3766 ( 3.2% )
3375 ( 2.8% )
2342 ( 2.0% )
2224 ( 1.9% )
2104 ( 1.8% )
18590 ( 15.6% )
0 (0.0%)
market_segment [character]
1. Aviation
2. Complementary
3. Corporate
4. Direct
5. Groups
6. Offline TA/TO
7. Online TA
8. Undefined
237 ( 0.2% )
743 ( 0.6% )
5295 ( 4.4% )
12606 ( 10.6% )
19811 ( 16.6% )
24219 ( 20.3% )
56477 ( 47.3% )
2 ( 0.0% )
0 (0.0%)
distribution_channel [character]
1. Corporate
2. Direct
3. GDS
4. TA/TO
5. Undefined
6677 ( 5.6% )
14645 ( 12.3% )
193 ( 0.2% )
97870 ( 82.0% )
5 ( 0.0% )
0 (0.0%)
is_repeated_guest [integer]
Min : 0
Mean : 0
Max : 1
0 : 115580 ( 96.8% )
1 : 3810 ( 3.2% )
0 (0.0%)
previous_cancellations [integer]
Mean (sd) : 0.1 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 26
IQR (CV) : 0 (9.7)
15 distinct values 0 (0.0%)
previous_bookings_not_canceled [integer]
Mean (sd) : 0.1 (1.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 72
IQR (CV) : 0 (10.9)
73 distinct values 0 (0.0%)
reserved_room_type [character]
1. A
2. B
3. C
4. D
5. E
6. F
7. G
8. H
9. L
10. P
85994 ( 72.0% )
1118 ( 0.9% )
932 ( 0.8% )
19201 ( 16.1% )
6535 ( 5.5% )
2897 ( 2.4% )
2094 ( 1.8% )
601 ( 0.5% )
6 ( 0.0% )
12 ( 0.0% )
0 (0.0%)
assigned_room_type [character]
1. A
2. D
3. E
4. F
5. G
6. C
7. B
8. H
9. I
10. K
[ 2 others ]
74053 ( 62.0% )
25322 ( 21.2% )
7806 ( 6.5% )
3751 ( 3.1% )
2553 ( 2.1% )
2375 ( 2.0% )
2163 ( 1.8% )
712 ( 0.6% )
363 ( 0.3% )
279 ( 0.2% )
13 ( 0.0% )
0 (0.0%)
booking_changes [integer]
Mean (sd) : 0.2 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 21
IQR (CV) : 0 (2.9)
21 distinct values 0 (0.0%)
deposit_type [character]
1. No Deposit
2. Non Refund
3. Refundable
104641 ( 87.6% )
14587 ( 12.2% )
162 ( 0.1% )
0 (0.0%)
agent [character]
1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]
31961 ( 26.8% )
16340 ( 13.7% )
13922 ( 11.7% )
7191 ( 6.0% )
3640 ( 3.0% )
3539 ( 3.0% )
3290 ( 2.8% )
2870 ( 2.4% )
1721 ( 1.4% )
1666 ( 1.4% )
33250 ( 27.8% )
0 (0.0%)
company [character]
1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]
112593 ( 94.3% )
927 ( 0.8% )
784 ( 0.7% )
267 ( 0.2% )
250 ( 0.2% )
215 ( 0.2% )
149 ( 0.1% )
141 ( 0.1% )
138 ( 0.1% )
133 ( 0.1% )
3793 ( 3.2% )
0 (0.0%)
days_in_waiting_list [integer]
Mean (sd) : 2.3 (17.6)
min ≤ med ≤ max:
0 ≤ 0 ≤ 391
IQR (CV) : 0 (7.6)
128 distinct values 0 (0.0%)
customer_type [character]
1. Contract
2. Group
3. Transient
4. Transient-Party
4076 ( 3.4% )
577 ( 0.5% )
89613 ( 75.1% )
25124 ( 21.0% )
0 (0.0%)
adr [numeric]
Mean (sd) : 101.8 (50.5)
min ≤ med ≤ max:
-6.4 ≤ 94.6 ≤ 5400
IQR (CV) : 56.7 (0.5)
8879 distinct values 0 (0.0%)
required_car_parking_spaces [integer]
Mean (sd) : 0.1 (0.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (3.9)
0 : 111974 ( 93.8% )
1 : 7383 ( 6.2% )
2 : 28 ( 0.0% )
3 : 3 ( 0.0% )
8 : 2 ( 0.0% )
0 (0.0%)
total_of_special_requests [integer]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 1 (1.4)
0 : 70318 ( 58.9% )
1 : 33226 ( 27.8% )
2 : 12969 ( 10.9% )
3 : 2497 ( 2.1% )
4 : 340 ( 0.3% )
5 : 40 ( 0.0% )
0 (0.0%)
reservation_status [character]
1. Canceled
2. Check-Out
3. No-Show
43017 ( 36.0% )
75166 ( 63.0% )
1207 ( 1.0% )
0 (0.0%)
reservation_status_date [character]
1. 2015-10-21
2. 2015-07-06
3. 2016-11-25
4. 2015-01-01
5. 2016-01-18
6. 2015-07-02
7. 2016-12-07
8. 2015-12-18
9. 2016-02-09
10. 2016-04-04
[ 916 others ]
1461 ( 1.2% )
805 ( 0.7% )
790 ( 0.7% )
763 ( 0.6% )
625 ( 0.5% )
469 ( 0.4% )
450 ( 0.4% )
423 ( 0.4% )
412 ( 0.3% )
382 ( 0.3% )
112810 ( 94.5% )
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-05-22

Briefly describe the data

We can see there are lots of attributes in these dataset.There are 32 attributes and 119390 rows.For challenge 6 I mainly focus on date and different kind of customers.

str(hotel_raw)
'data.frame':   119390 obs. of  32 variables:
 $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
 $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
 $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
 $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
 $ arrival_date_month            : chr  "July" "July" "July" "July" ...
 $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
 $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
 $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
 $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ meal                          : chr  "BB" "BB" "BB" "BB" ...
 $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
 $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
 $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
 $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
 $ reserved_room_type            : chr  "C" "C" "A" "A" ...
 $ assigned_room_type            : chr  "C" "C" "C" "A" ...
 $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
 $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
 $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
 $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
 $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
 $ adr                           : num  0 0 75 75 98 ...
 $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
 $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
 $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...

Tidy Data

Firstly,I transfer string type of month to int for combine year and month,thus more convenient to observe the relationship between the total number of guests of each month and time.

hotel_raw<- na.omit(hotel_raw)
hotel <- hotel_raw%>%
  mutate(Month = match(arrival_date_month,month.name))
head(hotel$Month)
[1] 7 7 7 7 7 7

Combine year and month used to display the graph later.

hotel1 <- hotel%>%
  mutate(date = str_c(arrival_date_year,Month,sep = "-"))

check the result,it’s what I want!

head(hotel1$date)
[1] "2015-7" "2015-7" "2015-7" "2015-7" "2015-7" "2015-7"

Calculate the total number of guests for each day.

hotel_month <- hotel1%>%
  mutate(total_guests = rowSums(select(.,adults,children,babies),na.rm = TRUE))
summary(hotel_month$total_guests)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   2.000   2.000   1.968   2.000  55.000 

Time Dependent Visualization

According to the date column I processed previously,I grouped the rows by month and use summarise() to calculate the total guests number of each month from 2015 to 2017.

hotel_month<-hotel_month%>%
  group_by(date) %>%
  summarise(month_total = sum(total_guests)) %>%
  ungroup() %>%
  mutate(month_total = as.integer(month_total))
hotel_month
# A tibble: 26 × 2
   date    month_total
   <chr>         <int>
 1 2015-10        9084
 2 2015-11        3807
 3 2015-12        5470
 4 2015-7         5715
 5 2015-8         8031
 6 2015-9         9726
 7 2016-1         3835
 8 2016-10       11960
 9 2016-11        8005
10 2016-12        7726
# … with 16 more rows

Then draw a gragh for observing,we can see it was floated a lot,and the least crowded months are November 2015,Jan of both 2016 and 2017.

ggplot(hotel_month, aes(x = date, y = month_total,group = 1)) +
  geom_line() +
  theme(axis.text.x = element_text(size = 5 ))

Visualizing Part-Whole Relationships

In challenge 6, I explored the relationship between months and different type of guests. Firstly I calculated the total monthly amount of different kinds of guests separately.

hotel_type<-hotel1%>%
  group_by(Month) %>%
  summarise(month_child = sum(children),month_babies = sum(babies),month_adults = sum(adults)) %>%
  ungroup() %>%
  mutate(month_child = as.integer(month_child),month_babies = as.integer(month_babies),month_adults = as.integer(month_adults),Month = month(Month))


hotel_type
# A tibble: 12 × 4
   Month month_child month_babies month_adults
   <int>       <int>        <int>        <int>
 1     1         452           48        10024
 2     2         790           59        14450
 3     3         700           57        17675
 4     4        1141           53        20806
 5     5         845           72        21539
 6     6        1057           71        20353
 7     7        2322          121        25164
 8     8        2780          196        27786
 9     9         598           87        19681
10    10         703           62        20279
11    11         279           45        11488
12    12         736           78        12382

Then I drawed lines for each kind of guests:adults,child and babies. I find that there are many more adult guests than children and babies!And almost all year round at a very high level except from November to January it may means people will travel less in winter. For children,although the number of children is very low,but what is interesting is that during the summer holidays (June to September) there is a significant increase.It can be seen that children usually go out in the summer holidays. As for babies, the number of hotel reservations has remained at a very low number.

ggplot(hotel_type, aes(Month,month_child,group = 1,col = "child")) +
  geom_line() +
  geom_line(aes(Month,month_babies, group = 1,col = "babies")) +
  geom_line(aes(Month,month_adults, group = 1,col ="adults")) +
  scale_x_continuous(breaks = seq(1,12,1)) +
  scale_color_manual(values = c("red","blue","green"))

  theme(axis.text.x = element_text(size = 2 ))
List of 1
 $ axis.text.x:List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : num 2
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi FALSE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 - attr(*, "class")= chr [1:2] "theme" "gg"
 - attr(*, "complete")= logi FALSE
 - attr(*, "validate")= logi TRUE

Visualization with Multiple Dimensions

In challenge7,I’ll split the data base on distribution channel attribute. Firstly,to make the drawing easier,I’ll change the date type from chr to date.

hotel_date <- hotel1
hotel_date$date <- as.Date(paste0(hotel_date$date,'-1'),format='%Y-%m-%d')
head(hotel_date$date)
[1] "2015-07-01" "2015-07-01" "2015-07-01" "2015-07-01" "2015-07-01"
[6] "2015-07-01"

Count the amount of each distribution channel type.

count(hotel_date,distribution_channel)
  distribution_channel     n
1            Corporate  6677
2               Direct 14645
3                  GDS   193
4                TA/TO 97870
5            Undefined     1

We can see there are 5 types of distribution channel,I’d like to recode them as a new attribute channel.

bookings_channel <- hotel_date %>%
  mutate(channel=recode(distribution_channel,
                     Cor="Corporate",
                     Dir="Direct",
                     GDS="GDS",
                     TT="TA/TO"),
                     Un="Undefined",
         across(c(hotel, channel),as.factor)) %>%
  count(date, hotel, channel,.drop=F)
head(bookings_channel)
        date        hotel   channel    n
1 2015-07-01   City Hotel Corporate    0
2 2015-07-01   City Hotel    Direct   27
3 2015-07-01   City Hotel       GDS    0
4 2015-07-01   City Hotel     TA/TO 1371
5 2015-07-01   City Hotel Undefined    0
6 2015-07-01 Resort Hotel Corporate   22

Now we can see the comparison of these 5 different channels in these 2 different hotel types.

ggplot(bookings_channel,aes(date, n, col=channel))+
  geom_line()+
  facet_wrap(vars(hotel))+
  scale_x_date()+
  theme(axis.text.x=element_text(angle=90))