Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Vinitha Maheswaran
October 12, 2022
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
For this challenge I will be working with the “hotel_bookings” data set. It is a publicly available data set containing booking transactions from a city hotel and a resort hotel in Portugal.
# A tibble: 119,390 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
After reading the data using read_csv function, it is stored in a dataframe “hotel_data”. The hotel_bookings data set contains 119390 rows (observations) and 32 columns (attributes/variables) including information such as when the booking was made, length of stay, the number of adults, children, and/or babies, the number of available parking spaces, and many others for the respective hotels. Each row/observation represents a hotel booking. Since this is a public data set, all data elements pertaining to hotel or customer identification are not included in the data set. This data set can have an important role for research and educational purposes in hotel administration, revenue management, machine learning, data mining and other fields. This data set will be helpful for (i) Tourists (people who are booking hotels) to check and understand trends of hotel price over a period of time and plan their travel accordingly within budget, learn about hotel accommodations and features before booking (ii) Hotel Management to keep track of the relevant information about themselves as well as their competitors. Understand and analyze the seasonal trend of hotel booking and accommodate different types of visitors that they have (iii) Tourist / hospitality services - (e.g. travel agency, airlines / car rental companies) to observe the times when hotels in the region are in high demand, analyze the duration of typical stays, and use the information to help plan their own services (iv) Local Government / independent data analysts to observe the overall trend of tourist activities in the region and analyzing the different types of visitors in the hotels during different seasons.
Using the “dfSummary” function from “summarytools” package we find that there are 31994 duplicates in the data. The reason for the identified duplication is that there is no unique id for each booking. It is possible that the booking was made by different tourists and the values for each attribute was exactly the same. This confusion could have been avoided by adding a Booking ID which would be unique for each booking. We can also see that there are 4 NA’s in the data for the column “Children”. We can ignore this for now as 4 NA values out of 119390 values would not change the grouped summary statistics significantly. 66.4% of the data represents city hotel and the remaining 33.6% of the data represents resort hotel. 18.4%, 47.5% and 34.1% of the data correspondingly represents years 2015, 2016 and 2017. I wanted to check the unequal distribution in data for the 3 consecutive years. I further investigated the data and found that the data we have represents hotel bookings from the period July 2015 - August 2017. Hence, we cannot compare the summary statistics between 2015, 2016 and 2017 as we do not have data for the entire years of 2015 and 2017. I performed group summary statistics on “hotel”, arrival_date_year” and “arrival_date_month” combined.
Few interesting finds!! The maximum “lead_time” is 737 indicating that someone had made a booking 2 years in advance! 37% of the bookings were cancelled. A booking was made for a maximum of 55 “adults” once. A booking was made for a maximum of 10 “children” once and a maximum of 10 “babies” another time. About 77% of the bookings preferred BB (Bed & Breakfast) “meal” package. The “country” of origin was PRT (Portugal) for around 41% of the bookings. Most of the bookings are made by Offline TA/TO and Online TA “market segment”. The maximum “previous_cancellations” done by a person making the hotel booking is 26. Most of the tourists prefer “reserved_room_type” A which may be the best bang for your buck. A maximum of 21 “booking_changes” was made for a particular booking. A certain booking was on a maximum of 391 “days_in_waiting_list”. The maximum “adr” (Average Daily Rate) for a booking was 5400. The maximum “required_car_parking_spaces” was a maximum of 8 in two separate booking instances.
[1] "hotel" "is_canceled"
[3] "lead_time" "arrival_date_year"
[5] "arrival_date_month" "arrival_date_week_number"
[7] "arrival_date_day_of_month" "stays_in_weekend_nights"
[9] "stays_in_week_nights" "adults"
[11] "children" "babies"
[13] "meal" "country"
[15] "market_segment" "distribution_channel"
[17] "is_repeated_guest" "previous_cancellations"
[19] "previous_bookings_not_canceled" "reserved_room_type"
[21] "assigned_room_type" "booking_changes"
[23] "deposit_type" "agent"
[25] "company" "days_in_waiting_list"
[27] "customer_type" "adr"
[29] "required_car_parking_spaces" "total_of_special_requests"
[31] "reservation_status" "reservation_status_date"
spc_tbl_ [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
$ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
$ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
$ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
$ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
$ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
$ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
$ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
$ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
$ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
$ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
$ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
$ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
$ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
$ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
$ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
$ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
$ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
$ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
$ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
$ adr : num [1:119390] 0 0 75 75 98 ...
$ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
$ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
$ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
- attr(*, "spec")=
.. cols(
.. hotel = col_character(),
.. is_canceled = col_double(),
.. lead_time = col_double(),
.. arrival_date_year = col_double(),
.. arrival_date_month = col_character(),
.. arrival_date_week_number = col_double(),
.. arrival_date_day_of_month = col_double(),
.. stays_in_weekend_nights = col_double(),
.. stays_in_week_nights = col_double(),
.. adults = col_double(),
.. children = col_double(),
.. babies = col_double(),
.. meal = col_character(),
.. country = col_character(),
.. market_segment = col_character(),
.. distribution_channel = col_character(),
.. is_repeated_guest = col_double(),
.. previous_cancellations = col_double(),
.. previous_bookings_not_canceled = col_double(),
.. reserved_room_type = col_character(),
.. assigned_room_type = col_character(),
.. booking_changes = col_double(),
.. deposit_type = col_character(),
.. agent = col_character(),
.. company = col_character(),
.. days_in_waiting_list = col_double(),
.. customer_type = col_character(),
.. adr = col_double(),
.. required_car_parking_spaces = col_double(),
.. total_of_special_requests = col_double(),
.. reservation_status = col_character(),
.. reservation_status_date = col_date(format = "")
.. )
- attr(*, "problems")=<externalptr>
hotel is_canceled lead_time arrival_date_year
Length:119390 Min. :0.0000 Min. : 0 Min. :2015
Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
Mode :character Median :0.0000 Median : 69 Median :2016
Mean :0.3704 Mean :104 Mean :2016
3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
Max. :1.0000 Max. :737 Max. :2017
arrival_date_month arrival_date_week_number arrival_date_day_of_month
Length:119390 Min. : 1.00 Min. : 1.0
Class :character 1st Qu.:16.00 1st Qu.: 8.0
Mode :character Median :28.00 Median :16.0
Mean :27.17 Mean :15.8
3rd Qu.:38.00 3rd Qu.:23.0
Max. :53.00 Max. :31.0
stays_in_weekend_nights stays_in_week_nights adults
Min. : 0.0000 Min. : 0.0 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
Median : 1.0000 Median : 2.0 Median : 2.000
Mean : 0.9276 Mean : 2.5 Mean : 1.856
3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
Max. :19.0000 Max. :50.0 Max. :55.000
children babies meal country
Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
Median : 0.0000 Median : 0.000000 Mode :character Mode :character
Mean : 0.1039 Mean : 0.007949
3rd Qu.: 0.0000 3rd Qu.: 0.000000
Max. :10.0000 Max. :10.000000
NA's :4
market_segment distribution_channel is_repeated_guest
Length:119390 Length:119390 Min. :0.00000
Class :character Class :character 1st Qu.:0.00000
Mode :character Mode :character Median :0.00000
Mean :0.03191
3rd Qu.:0.00000
Max. :1.00000
previous_cancellations previous_bookings_not_canceled reserved_room_type
Min. : 0.00000 Min. : 0.0000 Length:119390
1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
Median : 0.00000 Median : 0.0000 Mode :character
Mean : 0.08712 Mean : 0.1371
3rd Qu.: 0.00000 3rd Qu.: 0.0000
Max. :26.00000 Max. :72.0000
assigned_room_type booking_changes deposit_type agent
Length:119390 Min. : 0.0000 Length:119390 Length:119390
Class :character 1st Qu.: 0.0000 Class :character Class :character
Mode :character Median : 0.0000 Mode :character Mode :character
Mean : 0.2211
3rd Qu.: 0.0000
Max. :21.0000
company days_in_waiting_list customer_type adr
Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
Mode :character Median : 0.000 Mode :character Median : 94.58
Mean : 2.321 Mean : 101.83
3rd Qu.: 0.000 3rd Qu.: 126.00
Max. :391.000 Max. :5400.00
required_car_parking_spaces total_of_special_requests reservation_status
Min. :0.00000 Min. :0.0000 Length:119390
1st Qu.:0.00000 1st Qu.:0.0000 Class :character
Median :0.00000 Median :0.0000 Mode :character
Mean :0.06252 Mean :0.5714
3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :8.00000 Max. :5.0000
reservation_status_date
Min. :2014-10-17
1st Qu.:2016-02-01
Median :2016-08-07
Mean :2016-07-30
3rd Qu.:2017-02-08
Max. :2017-09-14
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hotel [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
is_canceled [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
lead_time [numeric] |
|
479 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
arrival_date_year [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
arrival_date_month [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
arrival_date_week_number [numeric] |
|
53 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
arrival_date_day_of_month [numeric] |
|
31 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
stays_in_weekend_nights [numeric] |
|
17 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
stays_in_week_nights [numeric] |
|
35 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
adults [numeric] |
|
14 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
children [numeric] |
|
|
4 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
babies [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
meal [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
country [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
market_segment [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
distribution_channel [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
is_repeated_guest [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
previous_cancellations [numeric] |
|
15 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
previous_bookings_not_canceled [numeric] |
|
73 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reserved_room_type [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
assigned_room_type [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
booking_changes [numeric] |
|
21 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
deposit_type [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
agent [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
company [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
days_in_waiting_list [numeric] |
|
128 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
customer_type [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
adr [numeric] |
|
8879 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
required_car_parking_spaces [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
total_of_special_requests [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reservation_status [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reservation_status_date [Date] |
|
926 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21
[1] 4
[1] 0
# A tibble: 52 × 3
hotel arrival_date_year arrival_date_month
<chr> <dbl> <chr>
1 Resort Hotel 2015 July
2 Resort Hotel 2015 August
3 Resort Hotel 2015 September
4 Resort Hotel 2015 October
5 Resort Hotel 2015 November
6 Resort Hotel 2015 December
7 Resort Hotel 2016 January
8 Resort Hotel 2016 February
9 Resort Hotel 2016 March
10 Resort Hotel 2016 April
# … with 42 more rows
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
# Finding the total number of adults, children and babies who stayed in the hotels
hotel_data%>%
select(hotel, adults, children, babies)%>%
summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
# A tibble: 1 × 4
total total_adults total_children total_babies
<dbl> <dbl> <dbl> <dbl>
1 234979 221636 12403 949
# Finding the total number of adults, children and babies for each hotel
hotel_data%>%
select(hotel, adults, children, babies)%>%
group_by(hotel)%>%
summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
# A tibble: 2 × 5
hotel total total_adults total_children total_babies
<chr> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 154469 146838 7248 392
2 Resort Hotel 80510 74798 5155 557
# Finding the total number of adults, children and babies for each hotel, year
hotel_data%>%
select(hotel, arrival_date_year, adults, children, babies)%>%
group_by(hotel, arrival_date_year)%>%
summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
# A tibble: 6 × 6
# Groups: hotel [2]
hotel arrival_date_year total total_adults total_children total_babies
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2015 24972 24385 512 84
2 City Hotel 2016 74577 70568 3814 195
3 City Hotel 2017 54920 51885 2922 113
4 Resort Hotel 2015 16861 15877 845 139
5 Resort Hotel 2016 36758 34396 2094 268
6 Resort Hotel 2017 26891 24525 2216 150
# Finding the total number of adults, children and babies for each hotel, year, month
hotel_data%>%
select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
# A tibble: 52 × 7
# Groups: hotel, arrival_date_year [6]
hotel arrival_date_year arrival_date_month total total…¹ total…² total…³
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2015 August 4869 4697 165 16
2 City Hotel 2015 December 3050 2907 120 23
3 City Hotel 2015 July 2687 2671 14 2
4 City Hotel 2015 November 1951 1918 25 8
5 City Hotel 2015 October 6029 5910 98 21
6 City Hotel 2015 September 6386 6282 90 14
7 City Hotel 2016 April 7021 6671 339 11
8 City Hotel 2016 August 7660 6904 730 26
9 City Hotel 2016 December 5095 4702 378 15
10 City Hotel 2016 February 4331 4124 194 13
# … with 42 more rows, and abbreviated variable names ¹total_adults,
# ²total_children, ³total_babies
# Finding the total number of adults, children and babies for each hotel, year for the July month
hotel_data%>%
select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
filter(arrival_date_month=='July')%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
# A tibble: 6 × 7
# Groups: hotel, arrival_date_year [6]
hotel arrival_date_year arrival_date_mo…¹ total total…² total…³ total…⁴
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2015 July 2687 2671 14 2
2 City Hotel 2016 July 6857 6231 605 21
3 City Hotel 2017 July 7659 7073 574 12
4 Resort Hotel 2015 July 3028 2752 252 24
5 Resort Hotel 2016 July 3244 2887 319 38
6 Resort Hotel 2017 July 4132 3550 558 24
# … with abbreviated variable names ¹arrival_date_month, ²total_adults,
# ³total_children, ⁴total_babies
# Finding the total number of adults, children and babies for each hotel, year for the August month
hotel_data%>%
select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
filter(arrival_date_month=='August')%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
# A tibble: 6 × 7
# Groups: hotel, arrival_date_year [6]
hotel arrival_date_year arrival_date_mo…¹ total total…² total…³ total…⁴
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2015 August 4869 4697 165 16
2 City Hotel 2016 August 7660 6904 730 26
3 City Hotel 2017 August 6863 6240 604 19
4 Resort Hotel 2015 August 3162 2848 263 51
5 Resort Hotel 2016 August 3940 3447 438 55
6 Resort Hotel 2017 August 4268 3659 580 29
# … with abbreviated variable names ¹arrival_date_month, ²total_adults,
# ³total_children, ⁴total_babies
# Finding the total number of adults, children and babies for year 2016 for each month
hotel_data%>%
select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
filter(hotel=='City Hotel' & arrival_date_year==2016)%>%
group_by(arrival_date_year, arrival_date_month)%>%
summarise(total = sum(adults, children, babies), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))%>%
arrange(match(arrival_date_month,month.name))
# A tibble: 12 × 6
# Groups: arrival_date_year [1]
arrival_date_year arrival_date_month total total_adults total_child…¹ total…²
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 2016 January 2285 2200 69 16
2 2016 February 4331 4124 194 13
3 2016 March 5893 5599 279 15
4 2016 April 7021 6671 339 11
5 2016 May 6932 6671 254 7
6 2016 June 7251 7012 222 17
7 2016 July 6857 6231 605 21
8 2016 August 7660 6904 730 26
9 2016 September 7575 7287 261 27
10 2016 October 8104 7755 336 13
11 2016 November 5573 5412 147 14
12 2016 December 5095 4702 378 15
# … with abbreviated variable names ¹total_children, ²total_babies
From the above statistics, we can see that most people have stayed in Portugal hotels during Fall (September and October) and during August (summer break for children).
# Finding the mean of adr, adults, children and babies for each hotel
hotel_data%>%
select(hotel, adr, adults, children, babies)%>%
group_by(hotel)%>%
summarise(avg_adr = mean(adr), avg_total = mean(adults+children+babies, na.rm = TRUE), avg_adults = mean(adults), avg_children = mean(children, na.rm = TRUE), avg_babies = mean(babies))
# A tibble: 2 × 6
hotel avg_adr avg_total avg_adults avg_children avg_babies
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 105. 1.95 1.85 0.0914 0.00494
2 Resort Hotel 95.0 2.01 1.87 0.129 0.0139
# Finding the median of adr, adults, children and babies for each hotel
hotel_data%>%
select(hotel, adr, adults, children, babies)%>%
group_by(hotel)%>%
summarise(median_adr = median(adr), median_adults = median(adults), median_children = median(children, na.rm = TRUE), median_babies = median(babies))
# A tibble: 2 × 5
hotel median_adr median_adults median_children median_babies
<chr> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 99.9 2 0 0
2 Resort Hotel 75 2 0 0
# A tibble: 2 × 5
hotel sd_adr sd_adults sd_children sd_babies
<chr> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 43.6 0.509 0.372 0.0843
2 Resort Hotel 61.4 0.697 0.445 0.119
# Finding the mean of adr, adults, children and babies for each hotel, year
hotel_data%>%
select(hotel, adr, arrival_date_year, adults, children, babies)%>%
group_by(hotel, arrival_date_year)%>%
summarise(avg_adr = mean(adr), avg_adults = mean(adults), avg_children = mean(children, na.rm = TRUE), avg_babies = mean(babies))
# A tibble: 6 × 6
# Groups: hotel [2]
hotel arrival_date_year avg_adr avg_adults avg_children avg_babies
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2015 85.9 1.78 0.0374 0.00614
2 City Hotel 2016 103. 1.85 0.1 0.00511
3 City Hotel 2017 118. 1.89 0.106 0.00411
4 Resort Hotel 2015 89.4 1.91 0.102 0.0167
5 Resort Hotel 2016 87.7 1.85 0.113 0.0144
6 Resort Hotel 2017 109. 1.86 0.168 0.0114
# Finding the median of adr, adults, children and babies for each hotel, year
hotel_data%>%
select(hotel, adr, arrival_date_year, adults, children, babies)%>%
group_by(hotel, arrival_date_year)%>%
summarise(median_adr = median(adr), median_adults = median(adults), median_children = median(children, na.rm = TRUE), median_babies = median(babies))
# A tibble: 6 × 6
# Groups: hotel [2]
hotel arrival_date_year median_adr median_adults median_child…¹ media…²
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2015 80 2 0 0
2 City Hotel 2016 99 2 0 0
3 City Hotel 2017 110 2 0 0
4 Resort Hotel 2015 75 2 0 0
5 Resort Hotel 2016 70 2 0 0
6 Resort Hotel 2017 86.2 2 0 0
# … with abbreviated variable names ¹median_children, ²median_babies
# Finding the sd of adr, adults, children and babies for each hotel, year
hotel_data%>%
select(hotel, adr, arrival_date_year, adults, children, babies)%>%
group_by(hotel, arrival_date_year)%>%
summarise(sd_adr = sd(adr), sd_adults = sd(adults), sd_children = sd(children, na.rm = TRUE), sd_babies = sd(babies))
# A tibble: 6 × 6
# Groups: hotel [2]
hotel arrival_date_year sd_adr sd_adults sd_children sd_babies
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2015 34.3 0.452 0.250 0.107
2 City Hotel 2016 45.2 0.520 0.387 0.0875
3 City Hotel 2017 41.6 0.518 0.399 0.0651
4 Resort Hotel 2015 53.8 1.25 0.406 0.130
5 Resort Hotel 2016 57.1 0.450 0.416 0.121
6 Resort Hotel 2017 69.1 0.448 0.502 0.108
# Finding the max of adr, adults, children and babies for each hotel, year
hotel_data%>%
select(hotel, adr, arrival_date_year, adults, children, babies)%>%
group_by(hotel, arrival_date_year)%>%
summarise(max_adr = max(adr), max_adults = max(adults), max_children = max(children, na.rm = TRUE), max_babies = max(babies))
# A tibble: 6 × 6
# Groups: hotel [2]
hotel arrival_date_year max_adr max_adults max_children max_babies
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2015 290 3 3 9
2 City Hotel 2016 5400 4 3 10
3 City Hotel 2017 510 4 3 2
4 Resort Hotel 2015 508 55 10 2
5 Resort Hotel 2016 384 4 3 2
6 Resort Hotel 2017 450 4 3 2
# Finding the min of adr, adults, children and babies for each hotel, year
hotel_data%>%
select(hotel, adr, arrival_date_year, adults, children, babies)%>%
group_by(hotel, arrival_date_year)%>%
summarise(min_adr = min(adr), min_total = min(adults+children+babies, na.rm = TRUE), min_adults = min(adults), min_children = min(children, na.rm = TRUE), min_babies = min(babies))
# A tibble: 6 × 7
# Groups: hotel [2]
hotel arrival_date_year min_adr min_total min_adults min_chil…¹ min_b…²
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2015 0 0 0 0 0
2 City Hotel 2016 0 0 0 0 0
3 City Hotel 2017 0 0 0 0 0
4 Resort Hotel 2015 0 0 0 0 0
5 Resort Hotel 2016 0 0 0 0 0
6 Resort Hotel 2017 -6.38 0 0 0 0
# … with abbreviated variable names ¹min_children, ²min_babies
# Finding the mean of adr, adults, children and babies for each hotel, year, month
hotel_data%>%
select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(avg_adr = mean(adr), avg_adults = mean(adults), avg_children = mean(children, na.rm = TRUE), avg_babies = mean(babies))
# A tibble: 52 × 7
# Groups: hotel, arrival_date_year [6]
hotel arrival_date_year arrival_date_m…¹ avg_adr avg_a…² avg_c…³ avg_b…⁴
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2015 August 77.7 1.89 0.0666 0.00645
2 City Hotel 2015 December 81.1 1.76 0.0726 0.0139
3 City Hotel 2015 July 69.8 1.91 0.0100 0.00143
4 City Hotel 2015 November 73.5 1.55 0.0202 0.00648
5 City Hotel 2015 October 89.4 1.75 0.0289 0.00620
6 City Hotel 2015 September 101. 1.78 0.0255 0.00397
7 City Hotel 2016 April 99.5 1.87 0.0952 0.00309
8 City Hotel 2016 August 119. 2.04 0.216 0.00770
9 City Hotel 2016 December 94.0 1.90 0.153 0.00605
10 City Hotel 2016 February 80.0 1.74 0.0818 0.00548
# … with 42 more rows, and abbreviated variable names ¹arrival_date_month,
# ²avg_adults, ³avg_children, ⁴avg_babies
# Finding the median of adr, adults, children and babies for each hotel, year, month
hotel_data%>%
select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(median_adr = median(adr), median_adults = median(adults), median_children = median(children, na.rm = TRUE), median_babies = median(babies))
# A tibble: 52 × 7
# Groups: hotel, arrival_date_year [6]
hotel arrival_date_year arrival_date_m…¹ media…² media…³ media…⁴ media…⁵
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2015 August 75 2 0 0
2 City Hotel 2015 December 77.4 2 0 0
3 City Hotel 2015 July 62.8 2 0 0
4 City Hotel 2015 November 72.2 2 0 0
5 City Hotel 2015 October 87.8 2 0 0
6 City Hotel 2015 September 95 2 0 0
7 City Hotel 2016 April 96.3 2 0 0
8 City Hotel 2016 August 115. 2 0 0
9 City Hotel 2016 December 88 2 0 0
10 City Hotel 2016 February 79 2 0 0
# … with 42 more rows, and abbreviated variable names ¹arrival_date_month,
# ²median_adr, ³median_adults, ⁴median_children, ⁵median_babies
# Finding the sd of adr, adults, children and babies for each hotel, year, month
hotel_data%>%
select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(sd_adr = sd(adr), sd_adults = sd(adults), sd_children = sd(children, na.rm = TRUE), sd_babies = sd(babies))
# A tibble: 52 × 7
# Groups: hotel, arrival_date_year [6]
hotel arrival_date_year arrival_date_mo…¹ sd_adr sd_ad…² sd_ch…³ sd_ba…⁴
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2015 August 26.9 0.365 0.340 0.0801
2 City Hotel 2015 December 32.1 0.504 0.340 0.117
3 City Hotel 2015 July 19.4 0.365 0.107 0.0378
4 City Hotel 2015 November 20.7 0.512 0.177 0.0803
5 City Hotel 2015 October 30.7 0.463 0.221 0.166
6 City Hotel 2015 September 43.7 0.440 0.209 0.0629
7 City Hotel 2016 April 30.5 0.478 0.381 0.0555
8 City Hotel 2016 August 39.8 0.499 0.544 0.0874
9 City Hotel 2016 December 36.0 0.538 0.466 0.0776
10 City Hotel 2016 February 24.7 0.535 0.355 0.0845
# … with 42 more rows, and abbreviated variable names ¹arrival_date_month,
# ²sd_adults, ³sd_children, ⁴sd_babies
# Finding the max of adr, adults, children and babies for each hotel, year, month
hotel_data%>%
select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(max_adr = max(adr), max_adults = max(adults), max_children = max(children, na.rm = TRUE), max_babies = max(babies))%>%
arrange(desc(max_adr))
# A tibble: 52 × 7
# Groups: hotel, arrival_date_year [6]
hotel arrival_date_year arrival_date…¹ max_adr max_a…² max_c…³ max_b…⁴
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2016 March 5400 3 2 1
2 City Hotel 2017 May 510 4 3 1
3 Resort Hotel 2015 July 508 4 10 2
4 City Hotel 2016 December 452. 4 3 1
5 Resort Hotel 2017 August 450 4 3 2
6 Resort Hotel 2017 July 397. 4 2 1
7 Resort Hotel 2016 December 384 3 3 2
8 City Hotel 2017 August 372. 4 3 1
9 Resort Hotel 2016 August 369 4 3 1
10 City Hotel 2016 July 365 4 3 2
# … with 42 more rows, and abbreviated variable names ¹arrival_date_month,
# ²max_adults, ³max_children, ⁴max_babies
# Finding the min of adr, adults, children and babies for each hotel, year, month
hotel_data%>%
select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(min_adr = min(adr), min_total = min(adults+children+babies, na.rm = TRUE), min_adults = min(adults), min_children = min(children, na.rm = TRUE), min_babies = min(babies))
# A tibble: 52 × 8
# Groups: hotel, arrival_date_year [6]
hotel arrival_date_year arriva…¹ min_adr min_t…² min_a…³ min_c…⁴ min_b…⁵
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2015 August 0 0 0 0 0
2 City Hotel 2015 December 0 0 0 0 0
3 City Hotel 2015 July 0 0 0 0 0
4 City Hotel 2015 November 0 0 0 0 0
5 City Hotel 2015 October 0 0 0 0 0
6 City Hotel 2015 Septemb… 0 0 0 0 0
7 City Hotel 2016 April 0 0 0 0 0
8 City Hotel 2016 August 0 0 0 0 0
9 City Hotel 2016 December 0 0 0 0 0
10 City Hotel 2016 February 0 0 0 0 0
# … with 42 more rows, and abbreviated variable names ¹arrival_date_month,
# ²min_total, ³min_adults, ⁴min_children, ⁵min_babies
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
A total of 221636 adults, 12403 children and 949 babies stayed in both hotels from July 2015 - August 2017. I grouped by “hotel” to find the total number of “adults”, “children” and “babies” that stayed in the (i) city hotel as 146838, 7248, 392 and (ii) resort hotel as 74798, 5155, 557 respectively. Looking at the group by “hotel”, we can find that families with babies prefer to stay at resort hotel compared to city hotel. This may be due to the fact that resort hotels are generally closer to the beach or has swimming pool and fun outdoor activities for babies. City hotels are preferred more by adults traveling alone without kids. City hotel may be more suitable for people traveling for business purposes. I also find the mean, median, and sd for the “adr”, “adults”, “children” and “babies” grouped by “hotel”. The difference between the mean and median values are almost similar indicating that the data is not skewed. The average “adr” for City Hotel is more than the Resort Hotel which makes sense as city hotels are usually located in central locations making transportation easier.
Next, I grouped by “hotel”, “arrival_date_year” to find the total number of “adults”, “children” and “babies” that stayed in the city hotel and resort hotel for years 2015, 2016 and 2017. I also find the mean, median, sd, max, and min for the “adr”, “adults”, “children” and “babies” grouped by “hotel” and “arrival_date_year”. We do not interpret anything new from the mean, median and sd. From the max, we identify few interesting bookings where the maximum number of “adults” is 55 / max “children” is 10 in resort hotel for the year 2015 and max “babies” is 9 / max “babies” is 10 for city hotel in years 2015 and 2016 respectively. The min values are mostly 0s except for the year 2017 in resort hotel where the min “adr” is -6.38. Having a negative “adr” is unusual and needs to be investigated further.
Finally, I grouped by “hotel”, “arrival_date_year”, ”arrival_date_month” to find the total number of “adults”, “children” and “babies” that stayed in the city hotel and resort hotel for July 2015 - August 2017. For the city hotel, the months September and October had the highest number of people as this is the best time to visit Portugal during Fall when the weather is warm and the crowd is relatively less compared to Summer. For the resort hotel, August month had the highest number of people (children and babies in particular) as this is the best time to stay in resort during Summer and enjoy outdoor activities. Hence, hotel management can price the rooms accordingly during peak season and off-season. I also find the mean, median, sd, max, and min for the “adr”, “adults”, “children” and “babies” grouped by “hotel”, “arrival_date_year”, and ”arrival_date_month”. We do not interpret anything new from the mean, median and sd. The max “adr” for City Hotel in March 2016 is the highest with 5400. The max “adults” for resort hotel in September and October 2015 is 55 and 50 respectively. The min values are mostly 0s or 1s.
I also grouped by “hotel”, “arrival_date_year”, ”arrival_date_month” to find the total number of “adults”, “children” and “babies” that stayed in the city hotel and resort hotel and filtered for the months July and August. We have data available for July and August for all the three years. We see an increasing trend in the number of people staying in hotels from year 2015 to year 2017.
---
title: "Challenge 2 Solutions"
author: "Vinitha Maheswaran"
desription: "Data wrangling: using group() and summarise()"
date: "10/12/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
For this challenge I will be working with the "hotel_bookings" data set. It is a publicly available data set containing booking transactions from a city hotel and a resort hotel in Portugal.
```{r}
# Reading the hotel_bookings.csv data set and storing in a data frame
hotel_data <- read_csv("_data/hotel_bookings.csv")
print(hotel_data)
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
After reading the data using read_csv function, it is stored in a dataframe "hotel_data". The hotel_bookings data set contains 119390 rows (observations) and 32 columns (attributes/variables) including information such as when the booking was made, length of stay, the number of adults, children, and/or babies, the number of available parking spaces, and many others for the respective hotels. Each row/observation represents a hotel booking. Since this is a public data set, all data elements pertaining to hotel or customer identification are not included in the data set. This data set can have an important role for research and educational purposes in hotel administration, revenue management, machine learning, data mining and other fields. This data set will be helpful for (i) Tourists (people who are booking hotels) to check and understand trends of hotel price over a period of time and plan their travel accordingly within budget, learn about hotel accommodations and features before booking (ii) Hotel Management to keep track of the relevant information about themselves as well as their competitors. Understand and analyze the seasonal trend of hotel booking and accommodate different types of visitors that they have (iii) Tourist / hospitality services - (e.g. travel agency, airlines / car rental companies) to observe the times when hotels in the region are in high demand, analyze the duration of typical stays, and use the information to help plan their own services (iv) Local Government / independent data analysts to observe the overall trend of tourist activities in the region and analyzing the different types of visitors in the hotels during different seasons.
Using the "dfSummary" function from "summarytools" package we find that there are 31994 duplicates in the data. The reason for the identified duplication is that there is no unique id for each booking. It is possible that the booking was made by different tourists and the values for each attribute was exactly the same. This confusion could have been avoided by adding a Booking ID which would be unique for each booking. We can also see that there are 4 NA's in the data for the column "Children". We can ignore this for now as 4 NA values out of 119390 values would not change the grouped summary statistics significantly. 66.4% of the data represents city hotel and the remaining 33.6% of the data represents resort hotel. 18.4%, 47.5% and 34.1% of the data correspondingly represents years 2015, 2016 and 2017. I wanted to check the unequal distribution in data for the 3 consecutive years. I further investigated the data and found that the data we have represents hotel bookings from the period July 2015 - August 2017. Hence, we cannot compare the summary statistics between 2015, 2016 and 2017 as we do not have data for the entire years of 2015 and 2017. I performed group summary statistics on "hotel", arrival_date_year" and "arrival_date_month" combined.
Few interesting finds!!
The maximum "lead_time" is 737 indicating that someone had made a booking 2 years in advance! 37% of the bookings were cancelled. A booking was made for a maximum of 55 "adults" once. A booking was made for a maximum of 10 "children" once and a maximum of 10 "babies" another time. About 77% of the bookings preferred BB (Bed & Breakfast) "meal" package. The "country" of origin was PRT (Portugal) for around 41% of the bookings. Most of the bookings are made by Offline TA/TO and Online TA "market segment". The maximum "previous_cancellations" done by a person making the hotel booking is 26. Most of the tourists prefer "reserved_room_type" A which may be the best bang for your buck. A maximum of 21 "booking_changes" was made for a particular booking. A certain booking was on a maximum of 391 "days_in_waiting_list". The maximum "adr" (Average Daily Rate) for a booking was 5400. The maximum "required_car_parking_spaces" was a maximum of 8 in two separate booking instances.
```{r}
#Finding dimension of the data set
dim(hotel_data)
```
```{r}
#Finding column names
colnames(hotel_data)
```
```{r}
#Structure of hotel_data
str(hotel_data)
```
```{r}
#Summary of hotel_data
summary(hotel_data)
```
```{r}
#Summary of hotel_data
library(summarytools)
print(summarytools::dfSummary(hotel_data,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.60,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
```
```{r}
#Check for missing/null data in the hotel_data
sum(is.na(hotel_data))
sum(is.null(hotel_data))
```
```{r}
#Check for the period of hotel booking dates
hotel_data%>%
select(hotel, arrival_date_year, arrival_date_month)%>%distinct()
```
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
# Finding the total number of adults, children and babies who stayed in the hotels
hotel_data%>%
select(hotel, adults, children, babies)%>%
summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
```
```{r}
# Finding the total number of adults, children and babies for each hotel
hotel_data%>%
select(hotel, adults, children, babies)%>%
group_by(hotel)%>%
summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
```
```{r}
# Finding the total number of adults, children and babies for each hotel, year
hotel_data%>%
select(hotel, arrival_date_year, adults, children, babies)%>%
group_by(hotel, arrival_date_year)%>%
summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
```
```{r}
# Finding the total number of adults, children and babies for each hotel, year, month
hotel_data%>%
select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
```
```{r}
# Finding the total number of adults, children and babies for each hotel, year for the July month
hotel_data%>%
select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
filter(arrival_date_month=='July')%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
```
```{r}
# Finding the total number of adults, children and babies for each hotel, year for the August month
hotel_data%>%
select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
filter(arrival_date_month=='August')%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
```
```{r}
# Finding the total number of adults, children and babies for year 2016 for each month
hotel_data%>%
select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
filter(hotel=='City Hotel' & arrival_date_year==2016)%>%
group_by(arrival_date_year, arrival_date_month)%>%
summarise(total = sum(adults, children, babies), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))%>%
arrange(match(arrival_date_month,month.name))
```
From the above statistics, we can see that most people have stayed in Portugal hotels during Fall (September and October) and during August (summer break for children).
```{r}
# Finding the mean of adr, adults, children and babies for each hotel
hotel_data%>%
select(hotel, adr, adults, children, babies)%>%
group_by(hotel)%>%
summarise(avg_adr = mean(adr), avg_total = mean(adults+children+babies, na.rm = TRUE), avg_adults = mean(adults), avg_children = mean(children, na.rm = TRUE), avg_babies = mean(babies))
```
```{r}
# Finding the median of adr, adults, children and babies for each hotel
hotel_data%>%
select(hotel, adr, adults, children, babies)%>%
group_by(hotel)%>%
summarise(median_adr = median(adr), median_adults = median(adults), median_children = median(children, na.rm = TRUE), median_babies = median(babies))
```
```{r}
# Finding the sd of adr, adults, children and babies for each hotel
hotel_data%>%
select(hotel, adr, adults, children, babies)%>%
group_by(hotel)%>%
summarise(sd_adr = sd(adr), sd_adults = sd(adults), sd_children = sd(children, na.rm = TRUE), sd_babies = sd(babies))
```
```{r}
# Finding the mean of adr, adults, children and babies for each hotel, year
hotel_data%>%
select(hotel, adr, arrival_date_year, adults, children, babies)%>%
group_by(hotel, arrival_date_year)%>%
summarise(avg_adr = mean(adr), avg_adults = mean(adults), avg_children = mean(children, na.rm = TRUE), avg_babies = mean(babies))
```
```{r}
# Finding the median of adr, adults, children and babies for each hotel, year
hotel_data%>%
select(hotel, adr, arrival_date_year, adults, children, babies)%>%
group_by(hotel, arrival_date_year)%>%
summarise(median_adr = median(adr), median_adults = median(adults), median_children = median(children, na.rm = TRUE), median_babies = median(babies))
```
```{r}
# Finding the sd of adr, adults, children and babies for each hotel, year
hotel_data%>%
select(hotel, adr, arrival_date_year, adults, children, babies)%>%
group_by(hotel, arrival_date_year)%>%
summarise(sd_adr = sd(adr), sd_adults = sd(adults), sd_children = sd(children, na.rm = TRUE), sd_babies = sd(babies))
```
```{r}
# Finding the max of adr, adults, children and babies for each hotel, year
hotel_data%>%
select(hotel, adr, arrival_date_year, adults, children, babies)%>%
group_by(hotel, arrival_date_year)%>%
summarise(max_adr = max(adr), max_adults = max(adults), max_children = max(children, na.rm = TRUE), max_babies = max(babies))
```
```{r}
# Finding the min of adr, adults, children and babies for each hotel, year
hotel_data%>%
select(hotel, adr, arrival_date_year, adults, children, babies)%>%
group_by(hotel, arrival_date_year)%>%
summarise(min_adr = min(adr), min_total = min(adults+children+babies, na.rm = TRUE), min_adults = min(adults), min_children = min(children, na.rm = TRUE), min_babies = min(babies))
```
```{r}
# Finding the mean of adr, adults, children and babies for each hotel, year, month
hotel_data%>%
select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(avg_adr = mean(adr), avg_adults = mean(adults), avg_children = mean(children, na.rm = TRUE), avg_babies = mean(babies))
```
```{r}
# Finding the median of adr, adults, children and babies for each hotel, year, month
hotel_data%>%
select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(median_adr = median(adr), median_adults = median(adults), median_children = median(children, na.rm = TRUE), median_babies = median(babies))
```
```{r}
# Finding the sd of adr, adults, children and babies for each hotel, year, month
hotel_data%>%
select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(sd_adr = sd(adr), sd_adults = sd(adults), sd_children = sd(children, na.rm = TRUE), sd_babies = sd(babies))
```
```{r}
# Finding the max of adr, adults, children and babies for each hotel, year, month
hotel_data%>%
select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(max_adr = max(adr), max_adults = max(adults), max_children = max(children, na.rm = TRUE), max_babies = max(babies))%>%
arrange(desc(max_adr))
```
```{r}
# Finding the min of adr, adults, children and babies for each hotel, year, month
hotel_data%>%
select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
group_by(hotel, arrival_date_year, arrival_date_month)%>%
summarise(min_adr = min(adr), min_total = min(adults+children+babies, na.rm = TRUE), min_adults = min(adults), min_children = min(children, na.rm = TRUE), min_babies = min(babies))
```
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
A total of 221636 adults, 12403 children and 949 babies stayed in both hotels from July 2015 - August 2017. I grouped by “hotel” to find the total number of “adults”, “children” and “babies” that stayed in the (i) city hotel as 146838, 7248, 392 and (ii) resort hotel as 74798, 5155, 557 respectively. Looking at the group by “hotel”, we can find that families with babies prefer to stay at resort hotel compared to city hotel. This may be due to the fact that resort hotels are generally closer to the beach or has swimming pool and fun outdoor activities for babies. City hotels are preferred more by adults traveling alone without kids. City hotel may be more suitable for people traveling for business purposes. I also find the mean, median, and sd for the “adr”, “adults”, “children” and “babies” grouped by “hotel”. The difference between the mean and median values are almost similar indicating that the data is not skewed. The average “adr” for City Hotel is more than the Resort Hotel which makes sense as city hotels are usually located in central locations making transportation easier.
Next, I grouped by “hotel”, “arrival_date_year” to find the total number of “adults”, “children” and “babies” that stayed in the city hotel and resort hotel for years 2015, 2016 and 2017. I also find the mean, median, sd, max, and min for the “adr”, “adults”, “children” and “babies” grouped by “hotel” and “arrival_date_year”. We do not interpret anything new from the mean, median and sd. From the max, we identify few interesting bookings where the maximum number of “adults” is 55 / max “children” is 10 in resort hotel for the year 2015 and max “babies” is 9 / max “babies” is 10 for city hotel in years 2015 and 2016 respectively. The min values are mostly 0s except for the year 2017 in resort hotel where the min “adr” is -6.38. Having a negative “adr” is unusual and needs to be investigated further.
Finally, I grouped by “hotel”, “arrival_date_year”, ”arrival_date_month” to find the total number of “adults”, “children” and “babies” that stayed in the city hotel and resort hotel for July 2015 - August 2017. For the city hotel, the months September and October had the highest number of people as this is the best time to visit Portugal during Fall when the weather is warm and the crowd is relatively less compared to Summer. For the resort hotel, August month had the highest number of people (children and babies in particular) as this is the best time to stay in resort during Summer and enjoy outdoor activities. Hence, hotel management can price the rooms accordingly during peak season and off-season. I also find the mean, median, sd, max, and min for the “adr”, “adults”, “children” and “babies” grouped by “hotel”, “arrival_date_year”, and ”arrival_date_month”. We do not interpret anything new from the mean, median and sd. The max “adr” for City Hotel in March 2016 is the highest with 5400. The max “adults” for resort hotel in September and October 2015 is 55 and 50 respectively. The min values are mostly 0s or 1s.
I also grouped by “hotel”, “arrival_date_year”, ”arrival_date_month” to find the total number of “adults”, “children” and “babies” that stayed in the city hotel and resort hotel and filtered for the months July and August. We have data available for July and August for all the three years. We see an increasing trend in the number of people staying in hotels from year 2015 to year 2017.