Challenge_6

challenge_6
hotel_bookings
Visualizing Time and Relationships
Author

Xinyang Mao

Published

April 19, 2023

library(tidyverse)
library(ggplot2)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. create at least one graph including time (evolution)
  • try to make them “publication” ready (optional)
  • Explain why you choose the specific graph type
  1. Create at least one graph depicting part-whole or flow relationships
  • try to make them “publication” ready (optional)
  • Explain why you choose the specific graph type

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command. - hotel_bookings ⭐⭐⭐⭐

hotel_raw <- read.csv("_data/hotel_bookings.csv")
print(summarytools::dfSummary(hotel_raw,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

hotel_raw

Dimensions: 119390 x 32
Duplicates: 31994
Variable Stats / Values Freqs (% of Valid) Graph Missing
hotel [character]
1. City Hotel
2. Resort Hotel
79330 ( 66.4% )
40060 ( 33.6% )
0 (0.0%)
is_canceled [integer]
Min : 0
Mean : 0.4
Max : 1
0 : 75166 ( 63.0% )
1 : 44224 ( 37.0% )
0 (0.0%)
lead_time [integer]
Mean (sd) : 104 (106.9)
min ≤ med ≤ max:
0 ≤ 69 ≤ 737
IQR (CV) : 142 (1)
479 distinct values 0 (0.0%)
arrival_date_year [integer]
Mean (sd) : 2016.2 (0.7)
min ≤ med ≤ max:
2015 ≤ 2016 ≤ 2017
IQR (CV) : 1 (0)
2015 : 21996 ( 18.4% )
2016 : 56707 ( 47.5% )
2017 : 40687 ( 34.1% )
0 (0.0%)
arrival_date_month [character]
1. August
2. July
3. May
4. October
5. April
6. June
7. September
8. March
9. February
10. November
[ 2 others ]
13877 ( 11.6% )
12661 ( 10.6% )
11791 ( 9.9% )
11160 ( 9.3% )
11089 ( 9.3% )
10939 ( 9.2% )
10508 ( 8.8% )
9794 ( 8.2% )
8068 ( 6.8% )
6794 ( 5.7% )
12709 ( 10.6% )
0 (0.0%)
arrival_date_week_number [integer]
Mean (sd) : 27.2 (13.6)
min ≤ med ≤ max:
1 ≤ 28 ≤ 53
IQR (CV) : 22 (0.5)
53 distinct values 0 (0.0%)
arrival_date_day_of_month [integer]
Mean (sd) : 15.8 (8.8)
min ≤ med ≤ max:
1 ≤ 16 ≤ 31
IQR (CV) : 15 (0.6)
31 distinct values 0 (0.0%)
stays_in_weekend_nights [integer]
Mean (sd) : 0.9 (1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 2 (1.1)
17 distinct values 0 (0.0%)
stays_in_week_nights [integer]
Mean (sd) : 2.5 (1.9)
min ≤ med ≤ max:
0 ≤ 2 ≤ 50
IQR (CV) : 2 (0.8)
35 distinct values 0 (0.0%)
adults [integer]
Mean (sd) : 1.9 (0.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.3)
14 distinct values 0 (0.0%)
children [integer]
Mean (sd) : 0.1 (0.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (3.8)
0 : 110796 ( 92.8% )
1 : 4861 ( 4.1% )
2 : 3652 ( 3.1% )
3 : 76 ( 0.1% )
10 : 1 ( 0.0% )
4 (0.0%)
babies [integer]
Mean (sd) : 0 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (12.3)
0 : 118473 ( 99.2% )
1 : 900 ( 0.8% )
2 : 15 ( 0.0% )
9 : 1 ( 0.0% )
10 : 1 ( 0.0% )
0 (0.0%)
meal [character]
1. BB
2. FB
3. HB
4. SC
5. Undefined
92310 ( 77.3% )
798 ( 0.7% )
14463 ( 12.1% )
10650 ( 8.9% )
1169 ( 1.0% )
0 (0.0%)
country [character]
1. PRT
2. GBR
3. FRA
4. ESP
5. DEU
6. ITA
7. IRL
8. BEL
9. BRA
10. NLD
[ 168 others ]
48590 ( 40.7% )
12129 ( 10.2% )
10415 ( 8.7% )
8568 ( 7.2% )
7287 ( 6.1% )
3766 ( 3.2% )
3375 ( 2.8% )
2342 ( 2.0% )
2224 ( 1.9% )
2104 ( 1.8% )
18590 ( 15.6% )
0 (0.0%)
market_segment [character]
1. Aviation
2. Complementary
3. Corporate
4. Direct
5. Groups
6. Offline TA/TO
7. Online TA
8. Undefined
237 ( 0.2% )
743 ( 0.6% )
5295 ( 4.4% )
12606 ( 10.6% )
19811 ( 16.6% )
24219 ( 20.3% )
56477 ( 47.3% )
2 ( 0.0% )
0 (0.0%)
distribution_channel [character]
1. Corporate
2. Direct
3. GDS
4. TA/TO
5. Undefined
6677 ( 5.6% )
14645 ( 12.3% )
193 ( 0.2% )
97870 ( 82.0% )
5 ( 0.0% )
0 (0.0%)
is_repeated_guest [integer]
Min : 0
Mean : 0
Max : 1
0 : 115580 ( 96.8% )
1 : 3810 ( 3.2% )
0 (0.0%)
previous_cancellations [integer]
Mean (sd) : 0.1 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 26
IQR (CV) : 0 (9.7)
15 distinct values 0 (0.0%)
previous_bookings_not_canceled [integer]
Mean (sd) : 0.1 (1.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 72
IQR (CV) : 0 (10.9)
73 distinct values 0 (0.0%)
reserved_room_type [character]
1. A
2. B
3. C
4. D
5. E
6. F
7. G
8. H
9. L
10. P
85994 ( 72.0% )
1118 ( 0.9% )
932 ( 0.8% )
19201 ( 16.1% )
6535 ( 5.5% )
2897 ( 2.4% )
2094 ( 1.8% )
601 ( 0.5% )
6 ( 0.0% )
12 ( 0.0% )
0 (0.0%)
assigned_room_type [character]
1. A
2. D
3. E
4. F
5. G
6. C
7. B
8. H
9. I
10. K
[ 2 others ]
74053 ( 62.0% )
25322 ( 21.2% )
7806 ( 6.5% )
3751 ( 3.1% )
2553 ( 2.1% )
2375 ( 2.0% )
2163 ( 1.8% )
712 ( 0.6% )
363 ( 0.3% )
279 ( 0.2% )
13 ( 0.0% )
0 (0.0%)
booking_changes [integer]
Mean (sd) : 0.2 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 21
IQR (CV) : 0 (2.9)
21 distinct values 0 (0.0%)
deposit_type [character]
1. No Deposit
2. Non Refund
3. Refundable
104641 ( 87.6% )
14587 ( 12.2% )
162 ( 0.1% )
0 (0.0%)
agent [character]
1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]
31961 ( 26.8% )
16340 ( 13.7% )
13922 ( 11.7% )
7191 ( 6.0% )
3640 ( 3.0% )
3539 ( 3.0% )
3290 ( 2.8% )
2870 ( 2.4% )
1721 ( 1.4% )
1666 ( 1.4% )
33250 ( 27.8% )
0 (0.0%)
company [character]
1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]
112593 ( 94.3% )
927 ( 0.8% )
784 ( 0.7% )
267 ( 0.2% )
250 ( 0.2% )
215 ( 0.2% )
149 ( 0.1% )
141 ( 0.1% )
138 ( 0.1% )
133 ( 0.1% )
3793 ( 3.2% )
0 (0.0%)
days_in_waiting_list [integer]
Mean (sd) : 2.3 (17.6)
min ≤ med ≤ max:
0 ≤ 0 ≤ 391
IQR (CV) : 0 (7.6)
128 distinct values 0 (0.0%)
customer_type [character]
1. Contract
2. Group
3. Transient
4. Transient-Party
4076 ( 3.4% )
577 ( 0.5% )
89613 ( 75.1% )
25124 ( 21.0% )
0 (0.0%)
adr [numeric]
Mean (sd) : 101.8 (50.5)
min ≤ med ≤ max:
-6.4 ≤ 94.6 ≤ 5400
IQR (CV) : 56.7 (0.5)
8879 distinct values 0 (0.0%)
required_car_parking_spaces [integer]
Mean (sd) : 0.1 (0.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (3.9)
0 : 111974 ( 93.8% )
1 : 7383 ( 6.2% )
2 : 28 ( 0.0% )
3 : 3 ( 0.0% )
8 : 2 ( 0.0% )
0 (0.0%)
total_of_special_requests [integer]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 1 (1.4)
0 : 70318 ( 58.9% )
1 : 33226 ( 27.8% )
2 : 12969 ( 10.9% )
3 : 2497 ( 2.1% )
4 : 340 ( 0.3% )
5 : 40 ( 0.0% )
0 (0.0%)
reservation_status [character]
1. Canceled
2. Check-Out
3. No-Show
43017 ( 36.0% )
75166 ( 63.0% )
1207 ( 1.0% )
0 (0.0%)
reservation_status_date [character]
1. 2015-10-21
2. 2015-07-06
3. 2016-11-25
4. 2015-01-01
5. 2016-01-18
6. 2015-07-02
7. 2016-12-07
8. 2015-12-18
9. 2016-02-09
10. 2016-04-04
[ 916 others ]
1461 ( 1.2% )
805 ( 0.7% )
790 ( 0.7% )
763 ( 0.6% )
625 ( 0.5% )
469 ( 0.4% )
450 ( 0.4% )
423 ( 0.4% )
412 ( 0.3% )
382 ( 0.3% )
112810 ( 94.5% )
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-04-20

We can see there are lots of attributes in these dataset. In this challenge I mainly focus on the number of guests and time.

str(hotel_raw)
'data.frame':   119390 obs. of  32 variables:
 $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
 $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
 $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
 $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
 $ arrival_date_month            : chr  "July" "July" "July" "July" ...
 $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
 $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
 $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
 $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ meal                          : chr  "BB" "BB" "BB" "BB" ...
 $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
 $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
 $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
 $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
 $ reserved_room_type            : chr  "C" "C" "A" "A" ...
 $ assigned_room_type            : chr  "C" "C" "C" "A" ...
 $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
 $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
 $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
 $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
 $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
 $ adr                           : num  0 0 75 75 98 ...
 $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
 $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
 $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...

Tidy Data

Firstly,I transfer string type of month to int for combine year and month,thus more convenient to observe the relationship between the total number of guests of each month and time.

hotel_raw<- na.omit(hotel_raw)
hotel <- hotel_raw%>%
  mutate(Month = match(arrival_date_month,month.name))
head(hotel$Month)
[1] 7 7 7 7 7 7

Combine year and month used to display the graph later.

hotel1 <- hotel%>%
  mutate(date = str_c(arrival_date_year,Month,sep = "-"))

check the result,it’s what I want!

head(hotel1$date)
[1] "2015-7" "2015-7" "2015-7" "2015-7" "2015-7" "2015-7"

Calculate the total number of guests for each day.

hotel_month <- hotel1%>%
  mutate(total_guests = rowSums(select(.,adults,children,babies),na.rm = TRUE))
summary(hotel_month$total_guests)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   2.000   2.000   1.968   2.000  55.000 

Time Dependent Visualization

According to the date column I processed previously,I grouped the rows by month and use summarise() to calculate the total guests number of each month from 2015 to 2017.

hotel_month<-hotel_month%>%
  group_by(date) %>%
  summarise(month_total = sum(total_guests)) %>%
  ungroup() %>%
  mutate(month_total = as.integer(month_total))
hotel_month
# A tibble: 26 × 2
   date    month_total
   <chr>         <int>
 1 2015-10        9084
 2 2015-11        3807
 3 2015-12        5470
 4 2015-7         5715
 5 2015-8         8031
 6 2015-9         9726
 7 2016-1         3835
 8 2016-10       11960
 9 2016-11        8005
10 2016-12        7726
# … with 16 more rows

Then draw a gragh for observing,we can see it was floated a lot,and the least crowded months are November 2015,Jan of both 2016 and 2017.

ggplot(hotel_month, aes(x = date, y = month_total,group = 1)) +
  geom_line() +
  theme(axis.text.x = element_text(size = 5 ))

Visualizing Part-Whole Relationships

In this part I’d like exploring the relationship between months and different type of guests. Firstly I I calculated the total monthly amount of different kinds of guests separately.

hotel_type<-hotel1%>%
  group_by(Month) %>%
  summarise(month_child = sum(children),month_babies = sum(babies),month_adults = sum(adults)) %>%
  ungroup() %>%
  mutate(month_child = as.integer(month_child),month_babies = as.integer(month_babies),month_adults = as.integer(month_adults),Month = month(Month))


hotel_type
# A tibble: 12 × 4
   Month month_child month_babies month_adults
   <int>       <int>        <int>        <int>
 1     1         452           48        10024
 2     2         790           59        14450
 3     3         700           57        17675
 4     4        1141           53        20806
 5     5         845           72        21539
 6     6        1057           71        20353
 7     7        2322          121        25164
 8     8        2780          196        27786
 9     9         598           87        19681
10    10         703           62        20279
11    11         279           45        11488
12    12         736           78        12382

Then I drawed lines for each kind of guests:adults,child and babies. I find that there are many more adult guests than children and babies!And almost all year round at a very high level except from November to January it may means people will travel less in winter. For children,although the number of children is very low,but what is interesting is that during the summer holidays (June to September) there is a significant increase.It can be seen that children usually go out in the summer holidays. As for babies, the number of hotel reservations has remained at a very low number.

ggplot(hotel_type, aes(Month,month_child,group = 1,col = "child")) +
  geom_line() +
  geom_line(aes(Month,month_babies, group = 1,col = "babies")) +
  geom_line(aes(Month,month_adults, group = 1,col ="adults")) +
  scale_x_continuous(breaks = seq(1,12,1)) +
  scale_color_manual(values = c("red","blue","green"))

  theme(axis.text.x = element_text(size = 2 ))
List of 1
 $ axis.text.x:List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : num 2
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi FALSE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 - attr(*, "class")= chr [1:2] "theme" "gg"
 - attr(*, "complete")= logi FALSE
 - attr(*, "validate")= logi TRUE