We can use summary tools to view the different components of the data better.
Code
head(hotel.bookings)
hotel is_canceled lead_time arrival_date_year arrival_date_month
1 Resort Hotel 0 342 2015 July
2 Resort Hotel 0 737 2015 July
3 Resort Hotel 0 7 2015 July
4 Resort Hotel 0 13 2015 July
5 Resort Hotel 0 14 2015 July
6 Resort Hotel 0 14 2015 July
arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
1 27 1 0
2 27 1 0
3 27 1 0
4 27 1 0
5 27 1 0
6 27 1 0
stays_in_week_nights adults children babies meal country market_segment
1 0 2 0 0 BB PRT Direct
2 0 2 0 0 BB PRT Direct
3 1 1 0 0 BB GBR Direct
4 1 1 0 0 BB GBR Corporate
5 2 2 0 0 BB GBR Online TA
6 2 2 0 0 BB GBR Online TA
distribution_channel is_repeated_guest previous_cancellations
1 Direct 0 0
2 Direct 0 0
3 Direct 0 0
4 Corporate 0 0
5 TA/TO 0 0
6 TA/TO 0 0
previous_bookings_not_canceled reserved_room_type assigned_room_type
1 0 C C
2 0 C C
3 0 A C
4 0 A A
5 0 A A
6 0 A A
booking_changes deposit_type agent company days_in_waiting_list customer_type
1 3 No Deposit NULL NULL 0 Transient
2 4 No Deposit NULL NULL 0 Transient
3 0 No Deposit NULL NULL 0 Transient
4 0 No Deposit 304 NULL 0 Transient
5 0 No Deposit 240 NULL 0 Transient
6 0 No Deposit 240 NULL 0 Transient
adr required_car_parking_spaces total_of_special_requests reservation_status
1 0 0 0 Check-Out
2 0 0 0 Check-Out
3 75 0 0 Check-Out
4 75 0 0 Check-Out
5 98 0 1 Check-Out
6 98 0 1 Check-Out
reservation_status_date
1 2015-07-01
2 2015-07-01
3 2015-07-02
4 2015-07-02
5 2015-07-03
6 2015-07-03
Code
tail(hotel.bookings)
hotel is_canceled lead_time arrival_date_year arrival_date_month
119385 City Hotel 0 21 2017 August
119386 City Hotel 0 23 2017 August
119387 City Hotel 0 102 2017 August
119388 City Hotel 0 34 2017 August
119389 City Hotel 0 109 2017 August
119390 City Hotel 0 205 2017 August
arrival_date_week_number arrival_date_day_of_month
119385 35 30
119386 35 30
119387 35 31
119388 35 31
119389 35 31
119390 35 29
stays_in_weekend_nights stays_in_week_nights adults children babies meal
119385 2 5 2 0 0 BB
119386 2 5 2 0 0 BB
119387 2 5 3 0 0 BB
119388 2 5 2 0 0 BB
119389 2 5 2 0 0 BB
119390 2 7 2 0 0 HB
country market_segment distribution_channel is_repeated_guest
119385 BEL Offline TA/TO TA/TO 0
119386 BEL Offline TA/TO TA/TO 0
119387 FRA Online TA TA/TO 0
119388 DEU Online TA TA/TO 0
119389 GBR Online TA TA/TO 0
119390 DEU Online TA TA/TO 0
previous_cancellations previous_bookings_not_canceled reserved_room_type
119385 0 0 A
119386 0 0 A
119387 0 0 E
119388 0 0 D
119389 0 0 A
119390 0 0 A
assigned_room_type booking_changes deposit_type agent company
119385 A 0 No Deposit 394 NULL
119386 A 0 No Deposit 394 NULL
119387 E 0 No Deposit 9 NULL
119388 D 0 No Deposit 9 NULL
119389 A 0 No Deposit 89 NULL
119390 A 0 No Deposit 9 NULL
days_in_waiting_list customer_type adr required_car_parking_spaces
119385 0 Transient 96.14 0
119386 0 Transient 96.14 0
119387 0 Transient 225.43 0
119388 0 Transient 157.71 0
119389 0 Transient 104.40 0
119390 0 Transient 151.20 0
total_of_special_requests reservation_status reservation_status_date
119385 2 Check-Out 2017-09-06
119386 0 Check-Out 2017-09-06
119387 2 Check-Out 2017-09-07
119388 4 Check-Out 2017-09-07
119389 0 Check-Out 2017-09-07
119390 2 Check-Out 2017-09-07
Code
dim(hotel.bookings)
[1] 119390 32
In this dataset there are 32 variables(columns) and 119,390 observations(rows).By looking at the variable names, we can say that the it consists of reservation data of some hotels.
There are two types of hotels in the dataset: Resort Hotel and City Hotel. Reservations are made in 2015, 2016 and 2017. The data covers reservations from 178 countries. So the data should belong to a big worldwide hotels chain. Both canceled and completed reservations are stored in the dataset as well as no-shows. Thus, each observation includes information about a reservation such as what type of hotel and in which country the reservation is for, number of visitors, dates, daily rates, stay durations and some categorical info about customer and the channel of reservation.
We do not need to pivot the data as each column repsresent a variable and each row is an observation.
Code
hotel.bookings <-mutate( hotel.bookings, number_of_guests = adults + children + babies,total_stay = stays_in_weekend_nights + stays_in_week_nights) # adding two new variablesprint(dfSummary(hotel.bookings, varnumbers=FALSE, plain.ascii=FALSE, style="grid", graph.magnif=0.80, valid.col=TRUE),method='render', table.classes='table-condensed')
Data Frame Summary
hotel.bookings
Dimensions: 119390 x 34
Duplicates: 31994
Variable
Stats / Values
Freqs (% of Valid)
Graph
Valid
Missing
hotel
[character]
1. City Hotel
2. Resort Hotel
79330
(
66.4%
)
40060
(
33.6%
)
119390
(100.0%)
0
(0.0%)
is_canceled
[integer]
Min : 0
Mean : 0.4
Max : 1
0
:
75166
(
63.0%
)
1
:
44224
(
37.0%
)
119390
(100.0%)
0
(0.0%)
lead_time
[integer]
Mean (sd) : 104 (106.9)
min ≤ med ≤ max:
0 ≤ 69 ≤ 737
IQR (CV) : 142 (1)
479 distinct values
119390
(100.0%)
0
(0.0%)
arrival_date_year
[integer]
Mean (sd) : 2016.2 (0.7)
min ≤ med ≤ max:
2015 ≤ 2016 ≤ 2017
IQR (CV) : 1 (0)
2015
:
21996
(
18.4%
)
2016
:
56707
(
47.5%
)
2017
:
40687
(
34.1%
)
119390
(100.0%)
0
(0.0%)
arrival_date_month
[character]
1. August
2. July
3. May
4. October
5. April
6. June
7. September
8. March
9. February
10. November
[ 2 others ]
13877
(
11.6%
)
12661
(
10.6%
)
11791
(
9.9%
)
11160
(
9.3%
)
11089
(
9.3%
)
10939
(
9.2%
)
10508
(
8.8%
)
9794
(
8.2%
)
8068
(
6.8%
)
6794
(
5.7%
)
12709
(
10.6%
)
119390
(100.0%)
0
(0.0%)
arrival_date_week_number
[integer]
Mean (sd) : 27.2 (13.6)
min ≤ med ≤ max:
1 ≤ 28 ≤ 53
IQR (CV) : 22 (0.5)
53 distinct values
119390
(100.0%)
0
(0.0%)
arrival_date_day_of_month
[integer]
Mean (sd) : 15.8 (8.8)
min ≤ med ≤ max:
1 ≤ 16 ≤ 31
IQR (CV) : 15 (0.6)
31 distinct values
119390
(100.0%)
0
(0.0%)
stays_in_weekend_nights
[integer]
Mean (sd) : 0.9 (1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 2 (1.1)
17 distinct values
119390
(100.0%)
0
(0.0%)
stays_in_week_nights
[integer]
Mean (sd) : 2.5 (1.9)
min ≤ med ≤ max:
0 ≤ 2 ≤ 50
IQR (CV) : 2 (0.8)
35 distinct values
119390
(100.0%)
0
(0.0%)
adults
[integer]
Mean (sd) : 1.9 (0.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.3)
14 distinct values
119390
(100.0%)
0
(0.0%)
children
[integer]
Mean (sd) : 0.1 (0.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (3.8)
0
:
110796
(
92.8%
)
1
:
4861
(
4.1%
)
2
:
3652
(
3.1%
)
3
:
76
(
0.1%
)
10
:
1
(
0.0%
)
119386
(100.0%)
4
(0.0%)
babies
[integer]
Mean (sd) : 0 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (12.3)
0
:
118473
(
99.2%
)
1
:
900
(
0.8%
)
2
:
15
(
0.0%
)
9
:
1
(
0.0%
)
10
:
1
(
0.0%
)
119390
(100.0%)
0
(0.0%)
meal
[character]
1. BB
2. FB
3. HB
4. SC
5. Undefined
92310
(
77.3%
)
798
(
0.7%
)
14463
(
12.1%
)
10650
(
8.9%
)
1169
(
1.0%
)
119390
(100.0%)
0
(0.0%)
country
[character]
1. PRT
2. GBR
3. FRA
4. ESP
5. DEU
6. ITA
7. IRL
8. BEL
9. BRA
10. NLD
[ 168 others ]
48590
(
40.7%
)
12129
(
10.2%
)
10415
(
8.7%
)
8568
(
7.2%
)
7287
(
6.1%
)
3766
(
3.2%
)
3375
(
2.8%
)
2342
(
2.0%
)
2224
(
1.9%
)
2104
(
1.8%
)
18590
(
15.6%
)
119390
(100.0%)
0
(0.0%)
market_segment
[character]
1. Aviation
2. Complementary
3. Corporate
4. Direct
5. Groups
6. Offline TA/TO
7. Online TA
8. Undefined
237
(
0.2%
)
743
(
0.6%
)
5295
(
4.4%
)
12606
(
10.6%
)
19811
(
16.6%
)
24219
(
20.3%
)
56477
(
47.3%
)
2
(
0.0%
)
119390
(100.0%)
0
(0.0%)
distribution_channel
[character]
1. Corporate
2. Direct
3. GDS
4. TA/TO
5. Undefined
6677
(
5.6%
)
14645
(
12.3%
)
193
(
0.2%
)
97870
(
82.0%
)
5
(
0.0%
)
119390
(100.0%)
0
(0.0%)
is_repeated_guest
[integer]
Min : 0
Mean : 0
Max : 1
0
:
115580
(
96.8%
)
1
:
3810
(
3.2%
)
119390
(100.0%)
0
(0.0%)
previous_cancellations
[integer]
Mean (sd) : 0.1 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 26
IQR (CV) : 0 (9.7)
15 distinct values
119390
(100.0%)
0
(0.0%)
previous_bookings_not_canceled
[integer]
Mean (sd) : 0.1 (1.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 72
IQR (CV) : 0 (10.9)
73 distinct values
119390
(100.0%)
0
(0.0%)
reserved_room_type
[character]
1. A
2. B
3. C
4. D
5. E
6. F
7. G
8. H
9. L
10. P
85994
(
72.0%
)
1118
(
0.9%
)
932
(
0.8%
)
19201
(
16.1%
)
6535
(
5.5%
)
2897
(
2.4%
)
2094
(
1.8%
)
601
(
0.5%
)
6
(
0.0%
)
12
(
0.0%
)
119390
(100.0%)
0
(0.0%)
assigned_room_type
[character]
1. A
2. D
3. E
4. F
5. G
6. C
7. B
8. H
9. I
10. K
[ 2 others ]
74053
(
62.0%
)
25322
(
21.2%
)
7806
(
6.5%
)
3751
(
3.1%
)
2553
(
2.1%
)
2375
(
2.0%
)
2163
(
1.8%
)
712
(
0.6%
)
363
(
0.3%
)
279
(
0.2%
)
13
(
0.0%
)
119390
(100.0%)
0
(0.0%)
booking_changes
[integer]
Mean (sd) : 0.2 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 21
IQR (CV) : 0 (2.9)
21 distinct values
119390
(100.0%)
0
(0.0%)
deposit_type
[character]
1. No Deposit
2. Non Refund
3. Refundable
104641
(
87.6%
)
14587
(
12.2%
)
162
(
0.1%
)
119390
(100.0%)
0
(0.0%)
agent
[character]
1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]
31961
(
26.8%
)
16340
(
13.7%
)
13922
(
11.7%
)
7191
(
6.0%
)
3640
(
3.0%
)
3539
(
3.0%
)
3290
(
2.8%
)
2870
(
2.4%
)
1721
(
1.4%
)
1666
(
1.4%
)
33250
(
27.8%
)
119390
(100.0%)
0
(0.0%)
company
[character]
1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]
112593
(
94.3%
)
927
(
0.8%
)
784
(
0.7%
)
267
(
0.2%
)
250
(
0.2%
)
215
(
0.2%
)
149
(
0.1%
)
141
(
0.1%
)
138
(
0.1%
)
133
(
0.1%
)
3793
(
3.2%
)
119390
(100.0%)
0
(0.0%)
days_in_waiting_list
[integer]
Mean (sd) : 2.3 (17.6)
min ≤ med ≤ max:
0 ≤ 0 ≤ 391
IQR (CV) : 0 (7.6)
128 distinct values
119390
(100.0%)
0
(0.0%)
customer_type
[character]
1. Contract
2. Group
3. Transient
4. Transient-Party
4076
(
3.4%
)
577
(
0.5%
)
89613
(
75.1%
)
25124
(
21.0%
)
119390
(100.0%)
0
(0.0%)
average_daily_rate
[numeric]
Mean (sd) : 101.8 (50.5)
min ≤ med ≤ max:
-6.4 ≤ 94.6 ≤ 5400
IQR (CV) : 56.7 (0.5)
8879 distinct values
119390
(100.0%)
0
(0.0%)
required_car_parking_spaces
[integer]
Mean (sd) : 0.1 (0.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (3.9)
0
:
111974
(
93.8%
)
1
:
7383
(
6.2%
)
2
:
28
(
0.0%
)
3
:
3
(
0.0%
)
8
:
2
(
0.0%
)
119390
(100.0%)
0
(0.0%)
total_of_special_requests
[integer]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 1 (1.4)
0
:
70318
(
58.9%
)
1
:
33226
(
27.8%
)
2
:
12969
(
10.9%
)
3
:
2497
(
2.1%
)
4
:
340
(
0.3%
)
5
:
40
(
0.0%
)
119390
(100.0%)
0
(0.0%)
reservation_status
[character]
1. Canceled
2. Check-Out
3. No-Show
43017
(
36.0%
)
75166
(
63.0%
)
1207
(
1.0%
)
119390
(100.0%)
0
(0.0%)
reservation_status_date
[character]
1. 2015-10-21
2. 2015-07-06
3. 2016-11-25
4. 2015-01-01
5. 2016-01-18
6. 2015-07-02
7. 2016-12-07
8. 2015-12-18
9. 2016-02-09
10. 2016-04-04
[ 916 others ]
1461
(
1.2%
)
805
(
0.7%
)
790
(
0.7%
)
763
(
0.6%
)
625
(
0.5%
)
469
(
0.4%
)
450
(
0.4%
)
423
(
0.4%
)
412
(
0.3%
)
382
(
0.3%
)
112810
(
94.5%
)
119390
(100.0%)
0
(0.0%)
number_of_guests
[integer]
Mean (sd) : 2 (0.7)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.4)
15 distinct values
119386
(100.0%)
4
(0.0%)
total_stay
[integer]
Mean (sd) : 3.4 (2.6)
min ≤ med ≤ max:
0 ≤ 3 ≤ 69
IQR (CV) : 2 (0.7)
45 distinct values
119390
(100.0%)
0
(0.0%)
Generated by summarytools 1.0.1 (R version 4.2.1) 2022-12-21
From the above detail we get a lot more information about the descriptive statistics for numeric variables in the data. The percentage of cancelled reservation sis 37.On average, reservations are made 104 days before date of stay. On average, every reservation is made for 1.97 people. Out of the bookings,one out of 10 reservations included a infant present. The daily rate of hotels is $101 on an average and average stay duration is 3.34 days. Around average stay duration is 3.43 days. 22% of the booking is changed afterwards.
Of the numerical variables, only 4 value is missing. However, when we examine the summary table above and the dataset itself we can see that there are some “NULL” entries that shows up as string. From summary table, agent and company variables have “NULL” values.
0.41% of country data, 13.89% of agent data and 94.31% of company data of reservations are missing.
Some analysis
According to the summary table, daily rate of a city hotel may go up to 5400 dollars while it is only 508 dollars for resort hotels. This seems suspicious.
hotel arrival_date_year country agent number_of_guests total_stay
1 Resort Hotel 2017 GBR 273 2 10
2 Resort Hotel 2015 PRT NULL 2 0
3 Resort Hotel 2015 PRT NULL 2 0
4 Resort Hotel 2015 PRT NULL 4 1
5 Resort Hotel 2015 PRT 240 2 0
6 Resort Hotel 2015 PRT 250 1 0
7 Resort Hotel 2015 PRT NULL 2 0
8 Resort Hotel 2015 PRT 240 2 0
9 Resort Hotel 2015 PRT 305 2 2
10 Resort Hotel 2015 PRT 305 1 2
reservation_status average_daily_rate
1 Check-Out -6.38
2 Check-Out 0.00
3 Check-Out 0.00
4 Check-Out 0.00
5 Check-Out 0.00
6 Check-Out 0.00
7 Check-Out 0.00
8 Check-Out 0.00
9 Canceled 0.00
10 Check-Out 0.00
We can observe that the row with 5,400 dollars average daily rate is a wrong entry. There is also a row with negative average daily rate. We can remove both of them.
hotel.bookings %>%select(hotel, average_daily_rate) %>%group_by(hotel) %>%summarise_if(is.numeric, list(min = min, max = max, mean = mean, std_dev = sd, median = median), na.rm =TRUE)
# A tibble: 2 × 6
hotel min max mean std_dev median
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 0 510 105. 39.3 99.9
2 Resort Hotel 0 508 95.0 61.4 75
# A tibble: 10 × 2
country average_daily_rate
<chr> <dbl>
1 DJI 273
2 AIA 265
3 AND 203.
4 UMI 200
5 LAO 182.
6 MYT 178.
7 NCL 176.
8 GEO 169.
9 COM 165.
10 FRO 155.
Code
table(hotel.bookings$arrival_date_month)
April August December February January July June March
11089 13877 6780 8068 5929 12661 10939 9792
May November October September
11791 6794 11160 10508
From the above tables we can form some observations.
The average daily rate for City Hotels is 11 dollars higher than Resort Hotels. However the variance in price of Resort Hotels is greater than that of City Hotels.
The most popular 10 countries in terms of total reservations are Portugal, Great Britain, France, Spain, Germany, Italy, Ireland, Belgium, Brasil and Netherlands. However, we can see that 56% reservations made for Portugal hotels are actually canceled. This ratio is 35% for Italy and 25% for Spain. Among all of them, the country that hosts the highest number of guests is Portugal with total of 37,670 guests in 3 years.
Interestingly, in terms of average daily rate, the most expensive hotels are in Djibouti, Anguilla, Andorra, United States Minor Outlying Islands, Laos and so on. It looks that hotels in small countries that host a small number of guests are much more expensive.
August, July and May, respectively, are the months when hotels are the busiest throughout the year.
# A tibble: 6 × 2
# Groups: country [6]
country n
<chr> <int>
1 PRT 1550
2 ESP 79
3 GBR 73
4 FRA 55
5 DEU 41
6 NULL 21
We can observe that most of the zero values from Portugal.So we need to delve further into the the accuracy.
Potential research questions
How do hotel room rates change seasonally?
Are the room rates change according to the length of stay or not?
How is the performance of agents in terms of number of reservations and length of stay?
How the preferences of families with children differ from other visitors?
Source Code
---title: "Homework-2"author: "Nikita Masanagi"date: "10/20/2022"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - hw2 - hotel_bookings---```{r setup}library(tidyverse)library(psych)library(summarytools)knitr::opts_chunk$set(echo =TRUE, warning=FALSE, message=FALSE)```## Read DataThe dataset I have chosen is the Hotel Bookings dataset.```{r }hotel.bookings <-read.csv("_data/hotel_bookings.csv")```## Describe DataWe can use summary tools to view the different components of the data better.```{r}head(hotel.bookings)``````{r}tail(hotel.bookings)``````{r }dim(hotel.bookings)```In this dataset there are 32 variables(columns) and 119,390 observations(rows).By looking at the variable names, we can say that the it consists of reservation data of some hotels. ```{r}colnames(hotel.bookings)```We can change the column name of column 28 from adr to average_daily_rate to make it more readable.```{r}colnames(hotel.bookings)[28] <-"average_daily_rate"```We can see the distinct datatypes for each column.```{r}sapply(hotel.bookings, function(x) n_distinct(x))```We can observe the unique values, for some columns with repeated values.```{r}unique(hotel.bookings$hotel)unique(hotel.bookings$arrival_date_year)unique(hotel.bookings$reservation_status)unique(hotel.bookings$distribution_channel)unique(hotel.bookings$customer_type)```There are two types of hotels in the dataset: Resort Hotel and City Hotel. Reservations are made in 2015, 2016 and 2017. The data covers reservations from 178 countries. So the data should belong to a big worldwide hotels chain. Both canceled and completed reservations are stored in the dataset as well as no-shows. Thus, each observation includes information about a reservation such as what type of hotel and in which country the reservation is for, number of visitors, dates, daily rates, stay durations and some categorical info about customer and the channel of reservation.We do not need to pivot the data as each column repsresent a variable and each row is an observation. ```{r }hotel.bookings <-mutate( hotel.bookings, number_of_guests = adults + children + babies,total_stay = stays_in_weekend_nights + stays_in_week_nights) # adding two new variablesprint(dfSummary(hotel.bookings, varnumbers=FALSE, plain.ascii=FALSE, style="grid", graph.magnif=0.80, valid.col=TRUE),method='render', table.classes='table-condensed')```From the above detail we get a lot more information about the descriptive statistics for numeric variables in the data.The percentage of cancelled reservation sis 37.On average, reservations are made 104 days before date of stay. On average, every reservation is made for 1.97 people. Out of the bookings,one out of 10 reservations included a infant present.The daily rate of hotels is $101 on an average and average stay duration is 3.34 days. Around average stay duration is 3.43 days. 22% of the booking is changed afterwards. ```{r }colSums(is.na(hotel.bookings))```Of the numerical variables, only 4 value is missing. However, when we examine the summary table above and the dataset itself we can see that there are some "NULL" entries that shows up as string. From summary table, `agent` and `company` variables have "NULL" values. We can check the null values individually.```{r}nulls <-sapply(hotel.bookings, function(x) table(grepl("NULL", x)))for (i in1:32) {if (!is.na(nulls[[i]][2])) {print(nulls[i]) }}```So, actully 3 variables, `country`, `agent` and `company` have "NULL" values. ```{r}round(100*prop.table(table(grepl("NULL", hotel.bookings$country))), 2)round(100*prop.table(table(grepl("NULL", hotel.bookings$agent))), 2)round(100*prop.table(table(grepl("NULL", hotel.bookings$company))), 2)```0.41% of `country` data, 13.89% of `agent` data and 94.31% of `company` data of reservations are missing.## Some analysisAccording to the summary table, daily rate of a city hotel may go up to 5400 dollars while it is only 508 dollars for resort hotels. This seems suspicious.```{r}hotel.bookings %>%arrange(desc(average_daily_rate)) %>%slice_head(n=10) %>%select(hotel, arrival_date_year, country, agent, number_of_guests, total_stay, reservation_status, average_daily_rate)``````{r}hotel.bookings %>%arrange(average_daily_rate) %>%slice_head(n=10) %>%select(hotel, arrival_date_year, country, agent, number_of_guests, total_stay, reservation_status, average_daily_rate)```We can observe that the row with 5,400 dollars average daily rate is a wrong entry. There is also a row with negative average daily rate. We can remove both of them.```{r}hotel.bookings <- hotel.bookings %>%filter(average_daily_rate>=0& average_daily_rate<=510)``````{r}hotel.bookings %>%select(hotel, average_daily_rate) %>%group_by(hotel) %>%summarise_if(is.numeric, list(min = min, max = max, mean = mean, std_dev = sd, median = median), na.rm =TRUE)``````{r}hotel.bookings %>%select(country) %>%group_by(country) %>%count() %>%arrange(desc(n)) %>%head(n=10)``````{r}hotel.bookings %>%select(country, is_canceled) %>%group_by(country) %>%summarise_if(is.numeric, sum, na.rm =TRUE) %>%arrange(desc(is_canceled)) %>%head(n=10)``````{r}hotel.bookings %>%filter(country %in%c("PRT", "GBR", "ESP", "FRA", "ITA")) %>%select(country,is_canceled) %>%group_by(country) %>%summarise_if(is.numeric, mean, na.rm =TRUE) %>%arrange(desc(is_canceled))``````{r}hotel.bookings %>%filter(reservation_status =="Check-Out") %>%select(country, number_of_guests) %>%group_by(country) %>%summarise_if(is.numeric, sum, na.rm =TRUE) %>%arrange(desc(number_of_guests)) %>%head(n=10)``````{r}hotel.bookings %>%filter(reservation_status =="Check-Out") %>%select(country, number_of_guests) %>%group_by(country) %>%summarise_if(is.numeric, sum, na.rm =TRUE) %>%arrange(desc(number_of_guests)) %>%head(n=10)``````{r}hotel.bookings %>%select(country, average_daily_rate) %>%group_by(country) %>%summarise_if(is.numeric, mean, na.rm =TRUE) %>%arrange(desc(average_daily_rate)) %>%head(n=10)``````{r}table(hotel.bookings$arrival_date_month)```From the above tables we can form some observations.The average daily rate for City Hotels is 11 dollars higher than Resort Hotels. However the variance in price of Resort Hotels is greater than that of City Hotels.The most popular 10 countries in terms of total reservations are Portugal, Great Britain, France, Spain, Germany, Italy, Ireland, Belgium, Brasil and Netherlands. However, we can see that 56% reservations made for Portugal hotels are actually canceled. This ratio is 35% for Italy and 25% for Spain. Among all of them, the country that hosts the highest number of guests is Portugal with total of 37,670 guests in 3 years.Interestingly, in terms of average daily rate, the most expensive hotels are in Djibouti, Anguilla, Andorra, United States Minor Outlying Islands, Laos and so on. It looks that hotels in small countries that host a small number of guests are much more expensive.August, July and May, respectively, are the months when hotels are the busiest throughout the year.We can check the bookings with zero daily rate.```{r}hotel.bookings %>%filter(average_daily_rate ==0) %>%count()```There are 1959 reservations with zero daily rate.```{r}hotel.bookings %>%filter(average_daily_rate ==0) %>%group_by(country) %>%count() %>%arrange(desc(n)) %>%head()```We can observe that most of the zero values from Portugal.So we need to delve further into the the accuracy.## Potential research questions1. How do hotel room rates change seasonally?2. Are the room rates change according to the length of stay or not?3. How is the performance of agents in terms of number of reservations and length of stay?4. How the preferences of families with children differ from other visitors?