Challenge 2 Instructions

hotel_bookings

Author

Shoshana Buck

Published

September 21, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

Code

hotel_bookings <- read_csv("_data/hotel_bookings.csv")
hotel_bookings

The ’Hotel_bookings’ data set is comparing two types of hotels, City Hotel and Resort Hotel. The data has 119,390 observations with 32 variables, identifying important information when booking like arrival, length of stay, location of hotel and assigned room type. Data for ‘Hotel_bookings’ was collected data starting in August of 2015 and ending in April of 2017. The data is originally from Hotel booking demand data set by Nuno Antonio, Ana de Almeida, and Luis Nunes.

Describe the data

In order to get a great breakdown and visualization of the data set I used the R package, summary tools. Summary tools gives the variable, stats within that variable, the frequency, and a graph of each variable. From the summary it shows that the average year to stay in either a resort or city hotel was in 2016. Additionally, the top month to arrive/stay at the hotel was August and the two least popular month to stay in a hotel was December and January.

Code

colnames(hotel_bookings)

 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"

Code

print(summarytools::dfSummary(hotel_bookings,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

hotel_bookings

Dimensions: 119390 x 32
Duplicates: 31994

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

hotel [character]

1. City Hotel

2. Resort Hotel

79330	(	66.4%	)
40060	(	33.6%	)

0 (0.0%)

is_canceled [numeric]

Min : 0

Mean : 0.4

Max : 1

0	:	75166	(	63.0%	)
1	:	44224	(	37.0%	)

0 (0.0%)

lead_time [numeric]

Mean (sd) : 104 (106.9)

min ≤ med ≤ max:

0 ≤ 69 ≤ 737

IQR (CV) : 142 (1)

479 distinct values

0 (0.0%)

arrival_date_year [numeric]

Mean (sd) : 2016.2 (0.7)

min ≤ med ≤ max:

2015 ≤ 2016 ≤ 2017

IQR (CV) : 1 (0)

2015	:	21996	(	18.4%	)
2016	:	56707	(	47.5%	)
2017	:	40687	(	34.1%	)

0 (0.0%)

arrival_date_month [character]

1. August

2. July

3. May

4. October

5. April

6. June

7. September

8. March

9. February

10. November

[ 2 others ]

13877	(	11.6%	)
12661	(	10.6%	)
11791	(	9.9%	)
11160	(	9.3%	)
11089	(	9.3%	)
10939	(	9.2%	)
10508	(	8.8%	)
9794	(	8.2%	)
8068	(	6.8%	)
6794	(	5.7%	)
12709	(	10.6%	)

0 (0.0%)

arrival_date_week_number [numeric]

Mean (sd) : 27.2 (13.6)

min ≤ med ≤ max:

1 ≤ 28 ≤ 53

IQR (CV) : 22 (0.5)

53 distinct values

0 (0.0%)

arrival_date_day_of_month [numeric]

Mean (sd) : 15.8 (8.8)

min ≤ med ≤ max:

1 ≤ 16 ≤ 31

IQR (CV) : 15 (0.6)

31 distinct values

0 (0.0%)

stays_in_weekend_nights [numeric]

Mean (sd) : 0.9 (1)

min ≤ med ≤ max:

0 ≤ 1 ≤ 19

IQR (CV) : 2 (1.1)

17 distinct values

0 (0.0%)

stays_in_week_nights [numeric]

Mean (sd) : 2.5 (1.9)

min ≤ med ≤ max:

0 ≤ 2 ≤ 50

IQR (CV) : 2 (0.8)

35 distinct values

0 (0.0%)

adults [numeric]

Mean (sd) : 1.9 (0.6)

min ≤ med ≤ max:

0 ≤ 2 ≤ 55

IQR (CV) : 0 (0.3)

14 distinct values

0 (0.0%)

children [numeric]

Mean (sd) : 0.1 (0.4)

min ≤ med ≤ max:

0 ≤ 0 ≤ 10

IQR (CV) : 0 (3.8)

0	:	110796	(	92.8%	)
1	:	4861	(	4.1%	)
2	:	3652	(	3.1%	)
3	:	76	(	0.1%	)
10	:	1	(	0.0%	)

4 (0.0%)

babies [numeric]

Mean (sd) : 0 (0.1)

min ≤ med ≤ max:

0 ≤ 0 ≤ 10

IQR (CV) : 0 (12.3)

0	:	118473	(	99.2%	)
1	:	900	(	0.8%	)
2	:	15	(	0.0%	)
9	:	1	(	0.0%	)
10	:	1	(	0.0%	)

0 (0.0%)

meal [character]

1. BB

2. FB

3. HB

4. SC

5. Undefined

92310	(	77.3%	)
798	(	0.7%	)
14463	(	12.1%	)
10650	(	8.9%	)
1169	(	1.0%	)

0 (0.0%)

country [character]

1. PRT

2. GBR

3. FRA

4. ESP

5. DEU

6. ITA

7. IRL

8. BEL

9. BRA

10. NLD

[ 168 others ]

48590	(	40.7%	)
12129	(	10.2%	)
10415	(	8.7%	)
8568	(	7.2%	)
7287	(	6.1%	)
3766	(	3.2%	)
3375	(	2.8%	)
2342	(	2.0%	)
2224	(	1.9%	)
2104	(	1.8%	)
18590	(	15.6%	)

0 (0.0%)

market_segment [character]

1. Aviation

2. Complementary

3. Corporate

4. Direct

5. Groups

6. Offline TA/TO

7. Online TA

8. Undefined

237	(	0.2%	)
743	(	0.6%	)
5295	(	4.4%	)
12606	(	10.6%	)
19811	(	16.6%	)
24219	(	20.3%	)
56477	(	47.3%	)
2	(	0.0%	)

0 (0.0%)

distribution_channel [character]

1. Corporate

2. Direct

3. GDS

4. TA/TO

5. Undefined

6677	(	5.6%	)
14645	(	12.3%	)
193	(	0.2%	)
97870	(	82.0%	)
5	(	0.0%	)

0 (0.0%)

is_repeated_guest [numeric]

Min : 0

Mean : 0

Max : 1

0	:	115580	(	96.8%	)
1	:	3810	(	3.2%	)

0 (0.0%)

previous_cancellations [numeric]

Mean (sd) : 0.1 (0.8)

min ≤ med ≤ max:

0 ≤ 0 ≤ 26

IQR (CV) : 0 (9.7)

15 distinct values

0 (0.0%)

previous_bookings_not_canceled [numeric]

Mean (sd) : 0.1 (1.5)

min ≤ med ≤ max:

0 ≤ 0 ≤ 72

IQR (CV) : 0 (10.9)

73 distinct values

0 (0.0%)

reserved_room_type [character]

1. A

2. B

3. C

4. D

5. E

6. F

7. G

8. H

9. L

10. P

85994	(	72.0%	)
1118	(	0.9%	)
932	(	0.8%	)
19201	(	16.1%	)
6535	(	5.5%	)
2897	(	2.4%	)
2094	(	1.8%	)
601	(	0.5%	)
6	(	0.0%	)
12	(	0.0%	)

0 (0.0%)

assigned_room_type [character]

1. A

2. D

3. E

4. F

5. G

6. C

7. B

8. H

9. I

10. K

[ 2 others ]

74053	(	62.0%	)
25322	(	21.2%	)
7806	(	6.5%	)
3751	(	3.1%	)
2553	(	2.1%	)
2375	(	2.0%	)
2163	(	1.8%	)
712	(	0.6%	)
363	(	0.3%	)
279	(	0.2%	)
13	(	0.0%	)

0 (0.0%)

booking_changes [numeric]

Mean (sd) : 0.2 (0.7)

min ≤ med ≤ max:

0 ≤ 0 ≤ 21

IQR (CV) : 0 (2.9)

21 distinct values

0 (0.0%)

deposit_type [character]

1. No Deposit

2. Non Refund

3. Refundable

104641	(	87.6%	)
14587	(	12.2%	)
162	(	0.1%	)

0 (0.0%)

agent [character]

1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]

31961	(	26.8%	)
16340	(	13.7%	)
13922	(	11.7%	)
7191	(	6.0%	)
3640	(	3.0%	)
3539	(	3.0%	)
3290	(	2.8%	)
2870	(	2.4%	)
1721	(	1.4%	)
1666	(	1.4%	)
33250	(	27.8%	)

0 (0.0%)

company [character]

1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]

112593	(	94.3%	)
927	(	0.8%	)
784	(	0.7%	)
267	(	0.2%	)
250	(	0.2%	)
215	(	0.2%	)
149	(	0.1%	)
141	(	0.1%	)
138	(	0.1%	)
133	(	0.1%	)
3793	(	3.2%	)

0 (0.0%)

days_in_waiting_list [numeric]

Mean (sd) : 2.3 (17.6)

min ≤ med ≤ max:

0 ≤ 0 ≤ 391

IQR (CV) : 0 (7.6)

128 distinct values

0 (0.0%)

customer_type [character]

1. Contract

2. Group

3. Transient

4. Transient-Party

4076	(	3.4%	)
577	(	0.5%	)
89613	(	75.1%	)
25124	(	21.0%	)

0 (0.0%)

adr [numeric]

Mean (sd) : 101.8 (50.5)

min ≤ med ≤ max:

-6.4 ≤ 94.6 ≤ 5400

IQR (CV) : 56.7 (0.5)

8879 distinct values

0 (0.0%)

required_car_parking_spaces [numeric]

Mean (sd) : 0.1 (0.2)

min ≤ med ≤ max:

0 ≤ 0 ≤ 8

IQR (CV) : 0 (3.9)

0	:	111974	(	93.8%	)
1	:	7383	(	6.2%	)
2	:	28	(	0.0%	)
3	:	3	(	0.0%	)
8	:	2	(	0.0%	)

0 (0.0%)

total_of_special_requests [numeric]

Mean (sd) : 0.6 (0.8)

min ≤ med ≤ max:

0 ≤ 0 ≤ 5

IQR (CV) : 1 (1.4)

0	:	70318	(	58.9%	)
1	:	33226	(	27.8%	)
2	:	12969	(	10.9%	)
3	:	2497	(	2.1%	)
4	:	340	(	0.3%	)
5	:	40	(	0.0%	)

0 (0.0%)

reservation_status [character]

1. Canceled

2. Check-Out

3. No-Show

43017	(	36.0%	)
75166	(	63.0%	)
1207	(	1.0%	)

0 (0.0%)

reservation_status_date [Date]

min : 2014-10-17

med : 2016-08-07

max : 2017-09-14

range : 2y 10m 28d

926 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21

Provide Grouped Summary Statistics

Code

hotel_bookings %>% 
  summarise(across(c(country,arrival_date_year,adr, stays_in_week_nights, stays_in_weekend_nights), n_distinct))

I used the function summarize() to look the variables across the data set. From the table we can see that there were 178 unique countries that came from the data. Additionally, on average a room was booked for 35 week nights and 17 weekend nights. Showing that people normally stayed in a hotel on week days rather than weekends.

I Know that the 3 for arrival year makes sense because the data was taken from 2015-2017. However, I am a little confused about the numerical value for the average daily rate. I don’t see 8879 reflected in the original data set when I put it in descending order.

Code

hotel_bookings %>% 
  group_by(hotel,country,adr, stays_in_week_nights, stays_in_weekend_nights,arrival_date_year) %>% 
  summarize(total_adults = sum(adults)) %>% 
  arrange(desc(adr))

For the second code chunk I used the function group_by() to focus in on variables that I thought were interesting. I think piped and used the function summarize() to find ‘total_adults.’ I then piped again to arrange my date set in descending order based off of the ’average daily rate.’ I thought it was crazy that the most expensive city hotel was in Portugal in 2016 that had 2 adults staying for one week night at a rate of $5400.

Since the chart is arranged in descending order based off the average daily rate, Portugal has seven out of the ten most expensive hotels with five being a resort style hotel. Other countries within the top ten are Italy, Spain, and Morocco.

Code

bookings<- hotel_bookings %>% 
  group_by(hotel,reserved_room_type) %>% 
  summarize(price = mean(adr),
            adults = mean(adults),
            children = mean(children),
            babies = mean(babies), na.rm = TRUE) %>% 
  arrange(reserved_room_type)
bookings

Explain and Interpret

For the third code chunk I thought it would be interesting to use the group_by() function of hotel and room_type to see if there was a correlation between reserved_room_types and the type of hotel. Though I do not know what the reserved_room_types mean it can be seen that a resort hotel is cheaper to stay in than a city hotel.