DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2 Solutions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge 2 Solutions

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
Author

Vinitha Maheswaran

Published

October 12, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐

For this challenge I will be working with the “hotel_bookings” data set. It is a publicly available data set containing booking transactions from a city hotel and a resort hotel in Portugal.

Code
# Reading the hotel_bookings.csv data set and storing in a data frame
hotel_data <- read_csv("_data/hotel_bookings.csv")
print(hotel_data)
# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

After reading the data using read_csv function, it is stored in a dataframe “hotel_data”. The hotel_bookings data set contains 119390 rows (observations) and 32 columns (attributes/variables) including information such as when the booking was made, length of stay, the number of adults, children, and/or babies, the number of available parking spaces, and many others for the respective hotels. Each row/observation represents a hotel booking. Since this is a public data set, all data elements pertaining to hotel or customer identification are not included in the data set. This data set can have an important role for research and educational purposes in hotel administration, revenue management, machine learning, data mining and other fields. This data set will be helpful for (i) Tourists (people who are booking hotels) to check and understand trends of hotel price over a period of time and plan their travel accordingly within budget, learn about hotel accommodations and features before booking (ii) Hotel Management to keep track of the relevant information about themselves as well as their competitors. Understand and analyze the seasonal trend of hotel booking and accommodate different types of visitors that they have (iii) Tourist / hospitality services - (e.g. travel agency, airlines / car rental companies) to observe the times when hotels in the region are in high demand, analyze the duration of typical stays, and use the information to help plan their own services (iv) Local Government / independent data analysts to observe the overall trend of tourist activities in the region and analyzing the different types of visitors in the hotels during different seasons.

Using the “dfSummary” function from “summarytools” package we find that there are 31994 duplicates in the data. The reason for the identified duplication is that there is no unique id for each booking. It is possible that the booking was made by different tourists and the values for each attribute was exactly the same. This confusion could have been avoided by adding a Booking ID which would be unique for each booking. We can also see that there are 4 NA’s in the data for the column “Children”. We can ignore this for now as 4 NA values out of 119390 values would not change the grouped summary statistics significantly. 66.4% of the data represents city hotel and the remaining 33.6% of the data represents resort hotel. 18.4%, 47.5% and 34.1% of the data correspondingly represents years 2015, 2016 and 2017. I wanted to check the unequal distribution in data for the 3 consecutive years. I further investigated the data and found that the data we have represents hotel bookings from the period July 2015 - August 2017. Hence, we cannot compare the summary statistics between 2015, 2016 and 2017 as we do not have data for the entire years of 2015 and 2017. I performed group summary statistics on “hotel”, arrival_date_year” and “arrival_date_month” combined.

Few interesting finds!! The maximum “lead_time” is 737 indicating that someone had made a booking 2 years in advance! 37% of the bookings were cancelled. A booking was made for a maximum of 55 “adults” once. A booking was made for a maximum of 10 “children” once and a maximum of 10 “babies” another time. About 77% of the bookings preferred BB (Bed & Breakfast) “meal” package. The “country” of origin was PRT (Portugal) for around 41% of the bookings. Most of the bookings are made by Offline TA/TO and Online TA “market segment”. The maximum “previous_cancellations” done by a person making the hotel booking is 26. Most of the tourists prefer “reserved_room_type” A which may be the best bang for your buck. A maximum of 21 “booking_changes” was made for a particular booking. A certain booking was on a maximum of 391 “days_in_waiting_list”. The maximum “adr” (Average Daily Rate) for a booking was 5400. The maximum “required_car_parking_spaces” was a maximum of 8 in two separate booking instances.

Code
#Finding dimension of the data set
dim(hotel_data)
[1] 119390     32
Code
#Finding column names
colnames(hotel_data)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       
Code
#Structure of hotel_data
str(hotel_data)
spc_tbl_ [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
 $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
 $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
 $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
 $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
 $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
 $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
 $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
 $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
 $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
 $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
 $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
 $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
 $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
 $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
 $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
 $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
 $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
 $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
 $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
 $ adr                           : num [1:119390] 0 0 75 75 98 ...
 $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
 $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
 $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
 - attr(*, "spec")=
  .. cols(
  ..   hotel = col_character(),
  ..   is_canceled = col_double(),
  ..   lead_time = col_double(),
  ..   arrival_date_year = col_double(),
  ..   arrival_date_month = col_character(),
  ..   arrival_date_week_number = col_double(),
  ..   arrival_date_day_of_month = col_double(),
  ..   stays_in_weekend_nights = col_double(),
  ..   stays_in_week_nights = col_double(),
  ..   adults = col_double(),
  ..   children = col_double(),
  ..   babies = col_double(),
  ..   meal = col_character(),
  ..   country = col_character(),
  ..   market_segment = col_character(),
  ..   distribution_channel = col_character(),
  ..   is_repeated_guest = col_double(),
  ..   previous_cancellations = col_double(),
  ..   previous_bookings_not_canceled = col_double(),
  ..   reserved_room_type = col_character(),
  ..   assigned_room_type = col_character(),
  ..   booking_changes = col_double(),
  ..   deposit_type = col_character(),
  ..   agent = col_character(),
  ..   company = col_character(),
  ..   days_in_waiting_list = col_double(),
  ..   customer_type = col_character(),
  ..   adr = col_double(),
  ..   required_car_parking_spaces = col_double(),
  ..   total_of_special_requests = col_double(),
  ..   reservation_status = col_character(),
  ..   reservation_status_date = col_date(format = "")
  .. )
 - attr(*, "problems")=<externalptr> 
Code
#Summary of hotel_data
summary(hotel_data)
    hotel            is_canceled       lead_time   arrival_date_year
 Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
 Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
 Mode  :character   Median :0.0000   Median : 69   Median :2016     
                    Mean   :0.3704   Mean   :104   Mean   :2016     
                    3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
                    Max.   :1.0000   Max.   :737   Max.   :2017     
                                                                    
 arrival_date_month arrival_date_week_number arrival_date_day_of_month
 Length:119390      Min.   : 1.00            Min.   : 1.0             
 Class :character   1st Qu.:16.00            1st Qu.: 8.0             
 Mode  :character   Median :28.00            Median :16.0             
                    Mean   :27.17            Mean   :15.8             
                    3rd Qu.:38.00            3rd Qu.:23.0             
                    Max.   :53.00            Max.   :31.0             
                                                                      
 stays_in_weekend_nights stays_in_week_nights     adults      
 Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
 1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
 Median : 1.0000         Median : 2.0         Median : 2.000  
 Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
 3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
 Max.   :19.0000         Max.   :50.0         Max.   :55.000  
                                                              
    children           babies              meal             country         
 Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
 1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
 Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
 Mean   : 0.1039   Mean   : 0.007949                                        
 3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
 Max.   :10.0000   Max.   :10.000000                                        
 NA's   :4                                                                  
 market_segment     distribution_channel is_repeated_guest
 Length:119390      Length:119390        Min.   :0.00000  
 Class :character   Class :character     1st Qu.:0.00000  
 Mode  :character   Mode  :character     Median :0.00000  
                                         Mean   :0.03191  
                                         3rd Qu.:0.00000  
                                         Max.   :1.00000  
                                                          
 previous_cancellations previous_bookings_not_canceled reserved_room_type
 Min.   : 0.00000       Min.   : 0.0000                Length:119390     
 1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
 Median : 0.00000       Median : 0.0000                Mode  :character  
 Mean   : 0.08712       Mean   : 0.1371                                  
 3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
 Max.   :26.00000       Max.   :72.0000                                  
                                                                         
 assigned_room_type booking_changes   deposit_type          agent          
 Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
 Class :character   1st Qu.: 0.0000   Class :character   Class :character  
 Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
                    Mean   : 0.2211                                        
                    3rd Qu.: 0.0000                                        
                    Max.   :21.0000                                        
                                                                           
   company          days_in_waiting_list customer_type           adr         
 Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
 Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
 Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
                    Mean   :  2.321                         Mean   : 101.83  
                    3rd Qu.:  0.000                         3rd Qu.: 126.00  
                    Max.   :391.000                         Max.   :5400.00  
                                                                             
 required_car_parking_spaces total_of_special_requests reservation_status
 Min.   :0.00000             Min.   :0.0000            Length:119390     
 1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
 Median :0.00000             Median :0.0000            Mode  :character  
 Mean   :0.06252             Mean   :0.5714                              
 3rd Qu.:0.00000             3rd Qu.:1.0000                              
 Max.   :8.00000             Max.   :5.0000                              
                                                                         
 reservation_status_date
 Min.   :2014-10-17     
 1st Qu.:2016-02-01     
 Median :2016-08-07     
 Mean   :2016-07-30     
 3rd Qu.:2017-02-08     
 Max.   :2017-09-14     
                        
Code
#Summary of hotel_data
library(summarytools)
print(summarytools::dfSummary(hotel_data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.60, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

hotel_data

Dimensions: 119390 x 32
Duplicates: 31994
Variable Stats / Values Freqs (% of Valid) Graph Missing
hotel [character]
1. City Hotel
2. Resort Hotel
79330(66.4%)
40060(33.6%)
0 (0.0%)
is_canceled [numeric]
Min : 0
Mean : 0.4
Max : 1
0:75166(63.0%)
1:44224(37.0%)
0 (0.0%)
lead_time [numeric]
Mean (sd) : 104 (106.9)
min ≤ med ≤ max:
0 ≤ 69 ≤ 737
IQR (CV) : 142 (1)
479 distinct values 0 (0.0%)
arrival_date_year [numeric]
Mean (sd) : 2016.2 (0.7)
min ≤ med ≤ max:
2015 ≤ 2016 ≤ 2017
IQR (CV) : 1 (0)
2015:21996(18.4%)
2016:56707(47.5%)
2017:40687(34.1%)
0 (0.0%)
arrival_date_month [character]
1. August
2. July
3. May
4. October
5. April
6. June
7. September
8. March
9. February
10. November
[ 2 others ]
13877(11.6%)
12661(10.6%)
11791(9.9%)
11160(9.3%)
11089(9.3%)
10939(9.2%)
10508(8.8%)
9794(8.2%)
8068(6.8%)
6794(5.7%)
12709(10.6%)
0 (0.0%)
arrival_date_week_number [numeric]
Mean (sd) : 27.2 (13.6)
min ≤ med ≤ max:
1 ≤ 28 ≤ 53
IQR (CV) : 22 (0.5)
53 distinct values 0 (0.0%)
arrival_date_day_of_month [numeric]
Mean (sd) : 15.8 (8.8)
min ≤ med ≤ max:
1 ≤ 16 ≤ 31
IQR (CV) : 15 (0.6)
31 distinct values 0 (0.0%)
stays_in_weekend_nights [numeric]
Mean (sd) : 0.9 (1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 2 (1.1)
17 distinct values 0 (0.0%)
stays_in_week_nights [numeric]
Mean (sd) : 2.5 (1.9)
min ≤ med ≤ max:
0 ≤ 2 ≤ 50
IQR (CV) : 2 (0.8)
35 distinct values 0 (0.0%)
adults [numeric]
Mean (sd) : 1.9 (0.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.3)
14 distinct values 0 (0.0%)
children [numeric]
Mean (sd) : 0.1 (0.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (3.8)
0:110796(92.8%)
1:4861(4.1%)
2:3652(3.1%)
3:76(0.1%)
10:1(0.0%)
4 (0.0%)
babies [numeric]
Mean (sd) : 0 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (12.3)
0:118473(99.2%)
1:900(0.8%)
2:15(0.0%)
9:1(0.0%)
10:1(0.0%)
0 (0.0%)
meal [character]
1. BB
2. FB
3. HB
4. SC
5. Undefined
92310(77.3%)
798(0.7%)
14463(12.1%)
10650(8.9%)
1169(1.0%)
0 (0.0%)
country [character]
1. PRT
2. GBR
3. FRA
4. ESP
5. DEU
6. ITA
7. IRL
8. BEL
9. BRA
10. NLD
[ 168 others ]
48590(40.7%)
12129(10.2%)
10415(8.7%)
8568(7.2%)
7287(6.1%)
3766(3.2%)
3375(2.8%)
2342(2.0%)
2224(1.9%)
2104(1.8%)
18590(15.6%)
0 (0.0%)
market_segment [character]
1. Aviation
2. Complementary
3. Corporate
4. Direct
5. Groups
6. Offline TA/TO
7. Online TA
8. Undefined
237(0.2%)
743(0.6%)
5295(4.4%)
12606(10.6%)
19811(16.6%)
24219(20.3%)
56477(47.3%)
2(0.0%)
0 (0.0%)
distribution_channel [character]
1. Corporate
2. Direct
3. GDS
4. TA/TO
5. Undefined
6677(5.6%)
14645(12.3%)
193(0.2%)
97870(82.0%)
5(0.0%)
0 (0.0%)
is_repeated_guest [numeric]
Min : 0
Mean : 0
Max : 1
0:115580(96.8%)
1:3810(3.2%)
0 (0.0%)
previous_cancellations [numeric]
Mean (sd) : 0.1 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 26
IQR (CV) : 0 (9.7)
15 distinct values 0 (0.0%)
previous_bookings_not_canceled [numeric]
Mean (sd) : 0.1 (1.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 72
IQR (CV) : 0 (10.9)
73 distinct values 0 (0.0%)
reserved_room_type [character]
1. A
2. B
3. C
4. D
5. E
6. F
7. G
8. H
9. L
10. P
85994(72.0%)
1118(0.9%)
932(0.8%)
19201(16.1%)
6535(5.5%)
2897(2.4%)
2094(1.8%)
601(0.5%)
6(0.0%)
12(0.0%)
0 (0.0%)
assigned_room_type [character]
1. A
2. D
3. E
4. F
5. G
6. C
7. B
8. H
9. I
10. K
[ 2 others ]
74053(62.0%)
25322(21.2%)
7806(6.5%)
3751(3.1%)
2553(2.1%)
2375(2.0%)
2163(1.8%)
712(0.6%)
363(0.3%)
279(0.2%)
13(0.0%)
0 (0.0%)
booking_changes [numeric]
Mean (sd) : 0.2 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 21
IQR (CV) : 0 (2.9)
21 distinct values 0 (0.0%)
deposit_type [character]
1. No Deposit
2. Non Refund
3. Refundable
104641(87.6%)
14587(12.2%)
162(0.1%)
0 (0.0%)
agent [character]
1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]
31961(26.8%)
16340(13.7%)
13922(11.7%)
7191(6.0%)
3640(3.0%)
3539(3.0%)
3290(2.8%)
2870(2.4%)
1721(1.4%)
1666(1.4%)
33250(27.8%)
0 (0.0%)
company [character]
1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]
112593(94.3%)
927(0.8%)
784(0.7%)
267(0.2%)
250(0.2%)
215(0.2%)
149(0.1%)
141(0.1%)
138(0.1%)
133(0.1%)
3793(3.2%)
0 (0.0%)
days_in_waiting_list [numeric]
Mean (sd) : 2.3 (17.6)
min ≤ med ≤ max:
0 ≤ 0 ≤ 391
IQR (CV) : 0 (7.6)
128 distinct values 0 (0.0%)
customer_type [character]
1. Contract
2. Group
3. Transient
4. Transient-Party
4076(3.4%)
577(0.5%)
89613(75.1%)
25124(21.0%)
0 (0.0%)
adr [numeric]
Mean (sd) : 101.8 (50.5)
min ≤ med ≤ max:
-6.4 ≤ 94.6 ≤ 5400
IQR (CV) : 56.7 (0.5)
8879 distinct values 0 (0.0%)
required_car_parking_spaces [numeric]
Mean (sd) : 0.1 (0.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (3.9)
0:111974(93.8%)
1:7383(6.2%)
2:28(0.0%)
3:3(0.0%)
8:2(0.0%)
0 (0.0%)
total_of_special_requests [numeric]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 1 (1.4)
0:70318(58.9%)
1:33226(27.8%)
2:12969(10.9%)
3:2497(2.1%)
4:340(0.3%)
5:40(0.0%)
0 (0.0%)
reservation_status [character]
1. Canceled
2. Check-Out
3. No-Show
43017(36.0%)
75166(63.0%)
1207(1.0%)
0 (0.0%)
reservation_status_date [Date]
min : 2014-10-17
med : 2016-08-07
max : 2017-09-14
range : 2y 10m 28d
926 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21

Code
#Check for missing/null data in the hotel_data
sum(is.na(hotel_data))
[1] 4
Code
sum(is.null(hotel_data))
[1] 0
Code
#Check for the period of hotel booking dates
hotel_data%>%
  select(hotel, arrival_date_year, arrival_date_month)%>%distinct()
# A tibble: 52 × 3
   hotel        arrival_date_year arrival_date_month
   <chr>                    <dbl> <chr>             
 1 Resort Hotel              2015 July              
 2 Resort Hotel              2015 August            
 3 Resort Hotel              2015 September         
 4 Resort Hotel              2015 October           
 5 Resort Hotel              2015 November          
 6 Resort Hotel              2015 December          
 7 Resort Hotel              2016 January           
 8 Resort Hotel              2016 February          
 9 Resort Hotel              2016 March             
10 Resort Hotel              2016 April             
# … with 42 more rows

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code
# Finding the total number of adults, children and babies who stayed in the hotels

hotel_data%>%
  select(hotel, adults, children, babies)%>%
  summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
# A tibble: 1 × 4
   total total_adults total_children total_babies
   <dbl>        <dbl>          <dbl>        <dbl>
1 234979       221636          12403          949
Code
# Finding the total number of adults, children and babies for each hotel

hotel_data%>%
  select(hotel, adults, children, babies)%>%
  group_by(hotel)%>%
  summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
# A tibble: 2 × 5
  hotel         total total_adults total_children total_babies
  <chr>         <dbl>        <dbl>          <dbl>        <dbl>
1 City Hotel   154469       146838           7248          392
2 Resort Hotel  80510        74798           5155          557
Code
# Finding the total number of adults, children and babies for each hotel, year

hotel_data%>%
  select(hotel, arrival_date_year, adults, children, babies)%>%
  group_by(hotel, arrival_date_year)%>%
  summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
# A tibble: 6 × 6
# Groups:   hotel [2]
  hotel        arrival_date_year total total_adults total_children total_babies
  <chr>                    <dbl> <dbl>        <dbl>          <dbl>        <dbl>
1 City Hotel                2015 24972        24385            512           84
2 City Hotel                2016 74577        70568           3814          195
3 City Hotel                2017 54920        51885           2922          113
4 Resort Hotel              2015 16861        15877            845          139
5 Resort Hotel              2016 36758        34396           2094          268
6 Resort Hotel              2017 26891        24525           2216          150
Code
# Finding the total number of adults, children and babies for each hotel, year, month

hotel_data%>%
  select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
# A tibble: 52 × 7
# Groups:   hotel, arrival_date_year [6]
   hotel      arrival_date_year arrival_date_month total total…¹ total…² total…³
   <chr>                  <dbl> <chr>              <dbl>   <dbl>   <dbl>   <dbl>
 1 City Hotel              2015 August              4869    4697     165      16
 2 City Hotel              2015 December            3050    2907     120      23
 3 City Hotel              2015 July                2687    2671      14       2
 4 City Hotel              2015 November            1951    1918      25       8
 5 City Hotel              2015 October             6029    5910      98      21
 6 City Hotel              2015 September           6386    6282      90      14
 7 City Hotel              2016 April               7021    6671     339      11
 8 City Hotel              2016 August              7660    6904     730      26
 9 City Hotel              2016 December            5095    4702     378      15
10 City Hotel              2016 February            4331    4124     194      13
# … with 42 more rows, and abbreviated variable names ¹​total_adults,
#   ²​total_children, ³​total_babies
Code
# Finding the total number of adults, children and babies for each hotel, year for the July month

hotel_data%>%
  select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  filter(arrival_date_month=='July')%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
# A tibble: 6 × 7
# Groups:   hotel, arrival_date_year [6]
  hotel        arrival_date_year arrival_date_mo…¹ total total…² total…³ total…⁴
  <chr>                    <dbl> <chr>             <dbl>   <dbl>   <dbl>   <dbl>
1 City Hotel                2015 July               2687    2671      14       2
2 City Hotel                2016 July               6857    6231     605      21
3 City Hotel                2017 July               7659    7073     574      12
4 Resort Hotel              2015 July               3028    2752     252      24
5 Resort Hotel              2016 July               3244    2887     319      38
6 Resort Hotel              2017 July               4132    3550     558      24
# … with abbreviated variable names ¹​arrival_date_month, ²​total_adults,
#   ³​total_children, ⁴​total_babies
Code
# Finding the total number of adults, children and babies for each hotel, year for the August month

hotel_data%>%
  select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  filter(arrival_date_month=='August')%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
# A tibble: 6 × 7
# Groups:   hotel, arrival_date_year [6]
  hotel        arrival_date_year arrival_date_mo…¹ total total…² total…³ total…⁴
  <chr>                    <dbl> <chr>             <dbl>   <dbl>   <dbl>   <dbl>
1 City Hotel                2015 August             4869    4697     165      16
2 City Hotel                2016 August             7660    6904     730      26
3 City Hotel                2017 August             6863    6240     604      19
4 Resort Hotel              2015 August             3162    2848     263      51
5 Resort Hotel              2016 August             3940    3447     438      55
6 Resort Hotel              2017 August             4268    3659     580      29
# … with abbreviated variable names ¹​arrival_date_month, ²​total_adults,
#   ³​total_children, ⁴​total_babies
Code
# Finding the total number of adults, children and babies for year 2016 for each month

hotel_data%>%
  select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  filter(hotel=='City Hotel' & arrival_date_year==2016)%>%
  group_by(arrival_date_year, arrival_date_month)%>%
  summarise(total = sum(adults, children, babies), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))%>%
  arrange(match(arrival_date_month,month.name))
# A tibble: 12 × 6
# Groups:   arrival_date_year [1]
   arrival_date_year arrival_date_month total total_adults total_child…¹ total…²
               <dbl> <chr>              <dbl>        <dbl>         <dbl>   <dbl>
 1              2016 January             2285         2200            69      16
 2              2016 February            4331         4124           194      13
 3              2016 March               5893         5599           279      15
 4              2016 April               7021         6671           339      11
 5              2016 May                 6932         6671           254       7
 6              2016 June                7251         7012           222      17
 7              2016 July                6857         6231           605      21
 8              2016 August              7660         6904           730      26
 9              2016 September           7575         7287           261      27
10              2016 October             8104         7755           336      13
11              2016 November            5573         5412           147      14
12              2016 December            5095         4702           378      15
# … with abbreviated variable names ¹​total_children, ²​total_babies

From the above statistics, we can see that most people have stayed in Portugal hotels during Fall (September and October) and during August (summer break for children).

Code
# Finding the mean of adr, adults, children and babies for each hotel

hotel_data%>%
  select(hotel, adr, adults, children, babies)%>%
  group_by(hotel)%>%
  summarise(avg_adr = mean(adr), avg_total = mean(adults+children+babies, na.rm = TRUE), avg_adults = mean(adults), avg_children = mean(children, na.rm = TRUE), avg_babies = mean(babies))
# A tibble: 2 × 6
  hotel        avg_adr avg_total avg_adults avg_children avg_babies
  <chr>          <dbl>     <dbl>      <dbl>        <dbl>      <dbl>
1 City Hotel     105.       1.95       1.85       0.0914    0.00494
2 Resort Hotel    95.0      2.01       1.87       0.129     0.0139 
Code
# Finding the median of adr, adults, children and babies for each hotel

hotel_data%>%
  select(hotel, adr, adults, children, babies)%>%
  group_by(hotel)%>%
  summarise(median_adr = median(adr), median_adults = median(adults), median_children = median(children, na.rm = TRUE), median_babies = median(babies))
# A tibble: 2 × 5
  hotel        median_adr median_adults median_children median_babies
  <chr>             <dbl>         <dbl>           <dbl>         <dbl>
1 City Hotel         99.9             2               0             0
2 Resort Hotel       75               2               0             0
Code
# Finding the sd of adr, adults, children and babies for each hotel

hotel_data%>%
  select(hotel, adr, adults, children, babies)%>%
  group_by(hotel)%>%
  summarise(sd_adr = sd(adr), sd_adults = sd(adults), sd_children = sd(children, na.rm = TRUE), sd_babies = sd(babies))
# A tibble: 2 × 5
  hotel        sd_adr sd_adults sd_children sd_babies
  <chr>         <dbl>     <dbl>       <dbl>     <dbl>
1 City Hotel     43.6     0.509       0.372    0.0843
2 Resort Hotel   61.4     0.697       0.445    0.119 
Code
# Finding the mean of adr, adults, children and babies for each hotel, year

hotel_data%>%
  select(hotel, adr, arrival_date_year, adults, children, babies)%>%
  group_by(hotel, arrival_date_year)%>%
  summarise(avg_adr = mean(adr), avg_adults = mean(adults), avg_children = mean(children, na.rm = TRUE), avg_babies = mean(babies))
# A tibble: 6 × 6
# Groups:   hotel [2]
  hotel        arrival_date_year avg_adr avg_adults avg_children avg_babies
  <chr>                    <dbl>   <dbl>      <dbl>        <dbl>      <dbl>
1 City Hotel                2015    85.9       1.78       0.0374    0.00614
2 City Hotel                2016   103.        1.85       0.1       0.00511
3 City Hotel                2017   118.        1.89       0.106     0.00411
4 Resort Hotel              2015    89.4       1.91       0.102     0.0167 
5 Resort Hotel              2016    87.7       1.85       0.113     0.0144 
6 Resort Hotel              2017   109.        1.86       0.168     0.0114 
Code
# Finding the median of adr, adults, children and babies for each hotel, year

hotel_data%>%
  select(hotel, adr, arrival_date_year, adults, children, babies)%>%
  group_by(hotel, arrival_date_year)%>%
  summarise(median_adr = median(adr),  median_adults = median(adults), median_children = median(children, na.rm = TRUE), median_babies = median(babies))
# A tibble: 6 × 6
# Groups:   hotel [2]
  hotel        arrival_date_year median_adr median_adults median_child…¹ media…²
  <chr>                    <dbl>      <dbl>         <dbl>          <dbl>   <dbl>
1 City Hotel                2015       80               2              0       0
2 City Hotel                2016       99               2              0       0
3 City Hotel                2017      110               2              0       0
4 Resort Hotel              2015       75               2              0       0
5 Resort Hotel              2016       70               2              0       0
6 Resort Hotel              2017       86.2             2              0       0
# … with abbreviated variable names ¹​median_children, ²​median_babies
Code
# Finding the sd of adr, adults, children and babies for each hotel, year

hotel_data%>%
  select(hotel, adr, arrival_date_year, adults, children, babies)%>%
  group_by(hotel, arrival_date_year)%>%
  summarise(sd_adr = sd(adr), sd_adults = sd(adults), sd_children = sd(children, na.rm = TRUE), sd_babies = sd(babies))
# A tibble: 6 × 6
# Groups:   hotel [2]
  hotel        arrival_date_year sd_adr sd_adults sd_children sd_babies
  <chr>                    <dbl>  <dbl>     <dbl>       <dbl>     <dbl>
1 City Hotel                2015   34.3     0.452       0.250    0.107 
2 City Hotel                2016   45.2     0.520       0.387    0.0875
3 City Hotel                2017   41.6     0.518       0.399    0.0651
4 Resort Hotel              2015   53.8     1.25        0.406    0.130 
5 Resort Hotel              2016   57.1     0.450       0.416    0.121 
6 Resort Hotel              2017   69.1     0.448       0.502    0.108 
Code
# Finding the max of adr, adults, children and babies for each hotel, year

hotel_data%>%
  select(hotel, adr, arrival_date_year, adults, children, babies)%>%
  group_by(hotel, arrival_date_year)%>%
  summarise(max_adr = max(adr), max_adults = max(adults), max_children = max(children, na.rm = TRUE), max_babies = max(babies))
# A tibble: 6 × 6
# Groups:   hotel [2]
  hotel        arrival_date_year max_adr max_adults max_children max_babies
  <chr>                    <dbl>   <dbl>      <dbl>        <dbl>      <dbl>
1 City Hotel                2015     290          3            3          9
2 City Hotel                2016    5400          4            3         10
3 City Hotel                2017     510          4            3          2
4 Resort Hotel              2015     508         55           10          2
5 Resort Hotel              2016     384          4            3          2
6 Resort Hotel              2017     450          4            3          2
Code
# Finding the min of adr, adults, children and babies for each hotel, year

hotel_data%>%
  select(hotel, adr, arrival_date_year, adults, children, babies)%>%
  group_by(hotel, arrival_date_year)%>%
  summarise(min_adr = min(adr), min_total = min(adults+children+babies, na.rm = TRUE), min_adults = min(adults), min_children = min(children, na.rm = TRUE), min_babies = min(babies))
# A tibble: 6 × 7
# Groups:   hotel [2]
  hotel        arrival_date_year min_adr min_total min_adults min_chil…¹ min_b…²
  <chr>                    <dbl>   <dbl>     <dbl>      <dbl>      <dbl>   <dbl>
1 City Hotel                2015    0            0          0          0       0
2 City Hotel                2016    0            0          0          0       0
3 City Hotel                2017    0            0          0          0       0
4 Resort Hotel              2015    0            0          0          0       0
5 Resort Hotel              2016    0            0          0          0       0
6 Resort Hotel              2017   -6.38         0          0          0       0
# … with abbreviated variable names ¹​min_children, ²​min_babies
Code
# Finding the mean of adr, adults, children and babies for each hotel, year, month

hotel_data%>%
  select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(avg_adr = mean(adr), avg_adults = mean(adults), avg_children = mean(children, na.rm = TRUE), avg_babies = mean(babies))
# A tibble: 52 × 7
# Groups:   hotel, arrival_date_year [6]
   hotel      arrival_date_year arrival_date_m…¹ avg_adr avg_a…² avg_c…³ avg_b…⁴
   <chr>                  <dbl> <chr>              <dbl>   <dbl>   <dbl>   <dbl>
 1 City Hotel              2015 August              77.7    1.89  0.0666 0.00645
 2 City Hotel              2015 December            81.1    1.76  0.0726 0.0139 
 3 City Hotel              2015 July                69.8    1.91  0.0100 0.00143
 4 City Hotel              2015 November            73.5    1.55  0.0202 0.00648
 5 City Hotel              2015 October             89.4    1.75  0.0289 0.00620
 6 City Hotel              2015 September          101.     1.78  0.0255 0.00397
 7 City Hotel              2016 April               99.5    1.87  0.0952 0.00309
 8 City Hotel              2016 August             119.     2.04  0.216  0.00770
 9 City Hotel              2016 December            94.0    1.90  0.153  0.00605
10 City Hotel              2016 February            80.0    1.74  0.0818 0.00548
# … with 42 more rows, and abbreviated variable names ¹​arrival_date_month,
#   ²​avg_adults, ³​avg_children, ⁴​avg_babies
Code
# Finding the median of adr, adults, children and babies for each hotel, year, month

hotel_data%>%
  select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(median_adr = median(adr),  median_adults = median(adults), median_children = median(children, na.rm = TRUE), median_babies = median(babies))
# A tibble: 52 × 7
# Groups:   hotel, arrival_date_year [6]
   hotel      arrival_date_year arrival_date_m…¹ media…² media…³ media…⁴ media…⁵
   <chr>                  <dbl> <chr>              <dbl>   <dbl>   <dbl>   <dbl>
 1 City Hotel              2015 August              75         2       0       0
 2 City Hotel              2015 December            77.4       2       0       0
 3 City Hotel              2015 July                62.8       2       0       0
 4 City Hotel              2015 November            72.2       2       0       0
 5 City Hotel              2015 October             87.8       2       0       0
 6 City Hotel              2015 September           95         2       0       0
 7 City Hotel              2016 April               96.3       2       0       0
 8 City Hotel              2016 August             115.        2       0       0
 9 City Hotel              2016 December            88         2       0       0
10 City Hotel              2016 February            79         2       0       0
# … with 42 more rows, and abbreviated variable names ¹​arrival_date_month,
#   ²​median_adr, ³​median_adults, ⁴​median_children, ⁵​median_babies
Code
# Finding the sd of adr, adults, children and babies for each hotel, year, month

hotel_data%>%
  select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(sd_adr = sd(adr), sd_adults = sd(adults), sd_children = sd(children, na.rm = TRUE), sd_babies = sd(babies))
# A tibble: 52 × 7
# Groups:   hotel, arrival_date_year [6]
   hotel      arrival_date_year arrival_date_mo…¹ sd_adr sd_ad…² sd_ch…³ sd_ba…⁴
   <chr>                  <dbl> <chr>              <dbl>   <dbl>   <dbl>   <dbl>
 1 City Hotel              2015 August              26.9   0.365   0.340  0.0801
 2 City Hotel              2015 December            32.1   0.504   0.340  0.117 
 3 City Hotel              2015 July                19.4   0.365   0.107  0.0378
 4 City Hotel              2015 November            20.7   0.512   0.177  0.0803
 5 City Hotel              2015 October             30.7   0.463   0.221  0.166 
 6 City Hotel              2015 September           43.7   0.440   0.209  0.0629
 7 City Hotel              2016 April               30.5   0.478   0.381  0.0555
 8 City Hotel              2016 August              39.8   0.499   0.544  0.0874
 9 City Hotel              2016 December            36.0   0.538   0.466  0.0776
10 City Hotel              2016 February            24.7   0.535   0.355  0.0845
# … with 42 more rows, and abbreviated variable names ¹​arrival_date_month,
#   ²​sd_adults, ³​sd_children, ⁴​sd_babies
Code
# Finding the max of adr, adults, children and babies for each hotel, year, month

hotel_data%>%
  select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(max_adr = max(adr), max_adults = max(adults), max_children = max(children, na.rm = TRUE), max_babies = max(babies))%>%
  arrange(desc(max_adr))
# A tibble: 52 × 7
# Groups:   hotel, arrival_date_year [6]
   hotel        arrival_date_year arrival_date…¹ max_adr max_a…² max_c…³ max_b…⁴
   <chr>                    <dbl> <chr>            <dbl>   <dbl>   <dbl>   <dbl>
 1 City Hotel                2016 March            5400        3       2       1
 2 City Hotel                2017 May               510        4       3       1
 3 Resort Hotel              2015 July              508        4      10       2
 4 City Hotel                2016 December          452.       4       3       1
 5 Resort Hotel              2017 August            450        4       3       2
 6 Resort Hotel              2017 July              397.       4       2       1
 7 Resort Hotel              2016 December          384        3       3       2
 8 City Hotel                2017 August            372.       4       3       1
 9 Resort Hotel              2016 August            369        4       3       1
10 City Hotel                2016 July              365        4       3       2
# … with 42 more rows, and abbreviated variable names ¹​arrival_date_month,
#   ²​max_adults, ³​max_children, ⁴​max_babies
Code
# Finding the min of adr, adults, children and babies for each hotel, year, month

hotel_data%>%
  select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(min_adr = min(adr), min_total = min(adults+children+babies, na.rm = TRUE), min_adults = min(adults), min_children = min(children, na.rm = TRUE), min_babies = min(babies))
# A tibble: 52 × 8
# Groups:   hotel, arrival_date_year [6]
   hotel      arrival_date_year arriva…¹ min_adr min_t…² min_a…³ min_c…⁴ min_b…⁵
   <chr>                  <dbl> <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 City Hotel              2015 August         0       0       0       0       0
 2 City Hotel              2015 December       0       0       0       0       0
 3 City Hotel              2015 July           0       0       0       0       0
 4 City Hotel              2015 November       0       0       0       0       0
 5 City Hotel              2015 October        0       0       0       0       0
 6 City Hotel              2015 Septemb…       0       0       0       0       0
 7 City Hotel              2016 April          0       0       0       0       0
 8 City Hotel              2016 August         0       0       0       0       0
 9 City Hotel              2016 December       0       0       0       0       0
10 City Hotel              2016 February       0       0       0       0       0
# … with 42 more rows, and abbreviated variable names ¹​arrival_date_month,
#   ²​min_total, ³​min_adults, ⁴​min_children, ⁵​min_babies

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

A total of 221636 adults, 12403 children and 949 babies stayed in both hotels from July 2015 - August 2017. I grouped by “hotel” to find the total number of “adults”, “children” and “babies” that stayed in the (i) city hotel as 146838, 7248, 392 and (ii) resort hotel as 74798, 5155, 557 respectively. Looking at the group by “hotel”, we can find that families with babies prefer to stay at resort hotel compared to city hotel. This may be due to the fact that resort hotels are generally closer to the beach or has swimming pool and fun outdoor activities for babies. City hotels are preferred more by adults traveling alone without kids. City hotel may be more suitable for people traveling for business purposes. I also find the mean, median, and sd for the “adr”, “adults”, “children” and “babies” grouped by “hotel”. The difference between the mean and median values are almost similar indicating that the data is not skewed. The average “adr” for City Hotel is more than the Resort Hotel which makes sense as city hotels are usually located in central locations making transportation easier.

Next, I grouped by “hotel”, “arrival_date_year” to find the total number of “adults”, “children” and “babies” that stayed in the city hotel and resort hotel for years 2015, 2016 and 2017. I also find the mean, median, sd, max, and min for the “adr”, “adults”, “children” and “babies” grouped by “hotel” and “arrival_date_year”. We do not interpret anything new from the mean, median and sd. From the max, we identify few interesting bookings where the maximum number of “adults” is 55 / max “children” is 10 in resort hotel for the year 2015 and max “babies” is 9 / max “babies” is 10 for city hotel in years 2015 and 2016 respectively. The min values are mostly 0s except for the year 2017 in resort hotel where the min “adr” is -6.38. Having a negative “adr” is unusual and needs to be investigated further.

Finally, I grouped by “hotel”, “arrival_date_year”, ”arrival_date_month” to find the total number of “adults”, “children” and “babies” that stayed in the city hotel and resort hotel for July 2015 - August 2017. For the city hotel, the months September and October had the highest number of people as this is the best time to visit Portugal during Fall when the weather is warm and the crowd is relatively less compared to Summer. For the resort hotel, August month had the highest number of people (children and babies in particular) as this is the best time to stay in resort during Summer and enjoy outdoor activities. Hence, hotel management can price the rooms accordingly during peak season and off-season. I also find the mean, median, sd, max, and min for the “adr”, “adults”, “children” and “babies” grouped by “hotel”, “arrival_date_year”, and ”arrival_date_month”. We do not interpret anything new from the mean, median and sd. The max “adr” for City Hotel in March 2016 is the highest with 5400. The max “adults” for resort hotel in September and October 2015 is 55 and 50 respectively. The min values are mostly 0s or 1s.

I also grouped by “hotel”, “arrival_date_year”, ”arrival_date_month” to find the total number of “adults”, “children” and “babies” that stayed in the city hotel and resort hotel and filtered for the months July and August. We have data available for July and August for all the three years. We see an increasing trend in the number of people staying in hotels from year 2015 to year 2017.

Source Code
---
title: "Challenge 2 Solutions"
author: "Vinitha Maheswaran"
desription: "Data wrangling: using group() and summarise()"
date: "10/12/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.

-   railroad\*.csv or StateCounty2012.xls ⭐
-   FAOstat\*.csv or birds.csv ⭐⭐⭐
-   hotel_bookings.csv ⭐⭐⭐⭐


For this challenge I will be working with the "hotel_bookings" data set. It is a publicly available data set containing booking transactions from a city hotel and a resort hotel in Portugal.

```{r}
# Reading the hotel_bookings.csv data set and storing in a data frame
hotel_data <- read_csv("_data/hotel_bookings.csv")
print(hotel_data)
```

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

## Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).


After reading the data using read_csv function, it is stored in a dataframe "hotel_data". The hotel_bookings data set contains 119390 rows (observations) and 32 columns (attributes/variables) including information such as when the booking was made, length of stay, the number of adults, children, and/or babies, the number of available parking spaces, and many others for the respective hotels. Each row/observation represents a hotel booking. Since this is a public data set, all data elements pertaining to hotel or customer identification are not included in the data set. This data set can have an important role for research and educational purposes in hotel administration, revenue management, machine learning, data mining and other fields. This data set will be helpful for (i) Tourists (people who are booking hotels) to check and understand trends of hotel price over a period of time and plan their travel accordingly within budget, learn about hotel accommodations and features before booking (ii) Hotel Management to keep track of the relevant information about themselves as well as their competitors. Understand and analyze the seasonal trend of hotel booking and accommodate different types of visitors that they have (iii) Tourist / hospitality services - (e.g. travel agency, airlines / car rental companies) to observe the times when hotels in the region are in high demand, analyze the duration of typical stays, and use the information to help plan their own services (iv)  Local Government / independent data analysts to observe the overall trend of tourist activities in the region and analyzing the different types of visitors in the hotels during different seasons.

Using the "dfSummary" function from "summarytools" package we find that there are 31994 duplicates in the data. The reason for the identified duplication is that there is no unique id for each booking. It is possible that the booking was made by different tourists and the values for each attribute was exactly the same. This confusion could have been avoided by adding a Booking ID which would be unique for each booking. We can also see that there are 4 NA's in the data for the column "Children". We can ignore this for now as 4 NA values out of 119390 values would not change the grouped summary statistics significantly. 66.4% of the data represents city hotel and the remaining 33.6% of the data represents resort hotel. 18.4%, 47.5% and 34.1% of the data correspondingly represents years 2015, 2016 and 2017. I wanted to check the unequal distribution in data for the 3 consecutive years. I further investigated the data and found that the data we have represents hotel bookings from the period July 2015 - August 2017. Hence, we cannot compare the summary statistics between 2015, 2016 and 2017 as we do not have data for the entire years of 2015 and 2017. I performed group summary statistics on "hotel", arrival_date_year" and "arrival_date_month" combined.

Few interesting finds!!
The maximum "lead_time" is 737 indicating that someone had made a booking 2 years in advance! 37% of the bookings were cancelled. A booking was made for a maximum of 55 "adults" once. A booking was made for a maximum of 10 "children" once and a maximum of 10 "babies" another time. About 77% of the bookings preferred BB (Bed & Breakfast) "meal" package. The "country" of origin was PRT (Portugal) for around 41% of the bookings. Most of the bookings are made by Offline TA/TO and Online TA "market segment". The maximum "previous_cancellations" done by a person making the hotel booking is 26. Most of the tourists prefer "reserved_room_type" A which may be the best bang for your buck. A maximum of 21 "booking_changes" was made for a particular booking. A certain booking was on a maximum of 391 "days_in_waiting_list". The maximum "adr" (Average Daily Rate) for a booking was 5400. The maximum "required_car_parking_spaces" was a maximum of 8 in two separate booking instances.

```{r}
#Finding dimension of the data set
dim(hotel_data)
```

```{r}
#Finding column names
colnames(hotel_data)
```

```{r}
#Structure of hotel_data
str(hotel_data)
```

```{r}
#Summary of hotel_data
summary(hotel_data)
```

```{r}
#Summary of hotel_data
library(summarytools)
print(summarytools::dfSummary(hotel_data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.60, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')
```

```{r}
#Check for missing/null data in the hotel_data
sum(is.na(hotel_data))
sum(is.null(hotel_data))
```

```{r}
#Check for the period of hotel booking dates
hotel_data%>%
  select(hotel, arrival_date_year, arrival_date_month)%>%distinct()
```



## Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

```{r}
# Finding the total number of adults, children and babies who stayed in the hotels

hotel_data%>%
  select(hotel, adults, children, babies)%>%
  summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
```


```{r}
# Finding the total number of adults, children and babies for each hotel

hotel_data%>%
  select(hotel, adults, children, babies)%>%
  group_by(hotel)%>%
  summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
```


```{r}
# Finding the total number of adults, children and babies for each hotel, year

hotel_data%>%
  select(hotel, arrival_date_year, adults, children, babies)%>%
  group_by(hotel, arrival_date_year)%>%
  summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
```


```{r}
# Finding the total number of adults, children and babies for each hotel, year, month

hotel_data%>%
  select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
```


```{r}
# Finding the total number of adults, children and babies for each hotel, year for the July month

hotel_data%>%
  select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  filter(arrival_date_month=='July')%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
```


```{r}
# Finding the total number of adults, children and babies for each hotel, year for the August month

hotel_data%>%
  select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  filter(arrival_date_month=='August')%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(total = sum(adults+children+babies, na.rm = TRUE), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))
```


```{r}
# Finding the total number of adults, children and babies for year 2016 for each month

hotel_data%>%
  select(hotel, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  filter(hotel=='City Hotel' & arrival_date_year==2016)%>%
  group_by(arrival_date_year, arrival_date_month)%>%
  summarise(total = sum(adults, children, babies), total_adults = sum(adults), total_children = sum(children, na.rm = TRUE), total_babies = sum(babies))%>%
  arrange(match(arrival_date_month,month.name))
```

From the above statistics, we can see that most people have stayed in Portugal hotels during Fall (September and October) and during August (summer break for children).


```{r}
# Finding the mean of adr, adults, children and babies for each hotel

hotel_data%>%
  select(hotel, adr, adults, children, babies)%>%
  group_by(hotel)%>%
  summarise(avg_adr = mean(adr), avg_total = mean(adults+children+babies, na.rm = TRUE), avg_adults = mean(adults), avg_children = mean(children, na.rm = TRUE), avg_babies = mean(babies))
```


```{r}
# Finding the median of adr, adults, children and babies for each hotel

hotel_data%>%
  select(hotel, adr, adults, children, babies)%>%
  group_by(hotel)%>%
  summarise(median_adr = median(adr), median_adults = median(adults), median_children = median(children, na.rm = TRUE), median_babies = median(babies))
```


```{r}
# Finding the sd of adr, adults, children and babies for each hotel

hotel_data%>%
  select(hotel, adr, adults, children, babies)%>%
  group_by(hotel)%>%
  summarise(sd_adr = sd(adr), sd_adults = sd(adults), sd_children = sd(children, na.rm = TRUE), sd_babies = sd(babies))
```



```{r}
# Finding the mean of adr, adults, children and babies for each hotel, year

hotel_data%>%
  select(hotel, adr, arrival_date_year, adults, children, babies)%>%
  group_by(hotel, arrival_date_year)%>%
  summarise(avg_adr = mean(adr), avg_adults = mean(adults), avg_children = mean(children, na.rm = TRUE), avg_babies = mean(babies))
```

```{r}
# Finding the median of adr, adults, children and babies for each hotel, year

hotel_data%>%
  select(hotel, adr, arrival_date_year, adults, children, babies)%>%
  group_by(hotel, arrival_date_year)%>%
  summarise(median_adr = median(adr),  median_adults = median(adults), median_children = median(children, na.rm = TRUE), median_babies = median(babies))
```


```{r}
# Finding the sd of adr, adults, children and babies for each hotel, year

hotel_data%>%
  select(hotel, adr, arrival_date_year, adults, children, babies)%>%
  group_by(hotel, arrival_date_year)%>%
  summarise(sd_adr = sd(adr), sd_adults = sd(adults), sd_children = sd(children, na.rm = TRUE), sd_babies = sd(babies))
```


```{r}
# Finding the max of adr, adults, children and babies for each hotel, year

hotel_data%>%
  select(hotel, adr, arrival_date_year, adults, children, babies)%>%
  group_by(hotel, arrival_date_year)%>%
  summarise(max_adr = max(adr), max_adults = max(adults), max_children = max(children, na.rm = TRUE), max_babies = max(babies))
```


```{r}
# Finding the min of adr, adults, children and babies for each hotel, year

hotel_data%>%
  select(hotel, adr, arrival_date_year, adults, children, babies)%>%
  group_by(hotel, arrival_date_year)%>%
  summarise(min_adr = min(adr), min_total = min(adults+children+babies, na.rm = TRUE), min_adults = min(adults), min_children = min(children, na.rm = TRUE), min_babies = min(babies))
```



```{r}
# Finding the mean of adr, adults, children and babies for each hotel, year, month

hotel_data%>%
  select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(avg_adr = mean(adr), avg_adults = mean(adults), avg_children = mean(children, na.rm = TRUE), avg_babies = mean(babies))
```


```{r}
# Finding the median of adr, adults, children and babies for each hotel, year, month

hotel_data%>%
  select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(median_adr = median(adr),  median_adults = median(adults), median_children = median(children, na.rm = TRUE), median_babies = median(babies))
```


```{r}
# Finding the sd of adr, adults, children and babies for each hotel, year, month

hotel_data%>%
  select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(sd_adr = sd(adr), sd_adults = sd(adults), sd_children = sd(children, na.rm = TRUE), sd_babies = sd(babies))
```


```{r}
# Finding the max of adr, adults, children and babies for each hotel, year, month

hotel_data%>%
  select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(max_adr = max(adr), max_adults = max(adults), max_children = max(children, na.rm = TRUE), max_babies = max(babies))%>%
  arrange(desc(max_adr))
```


```{r}
# Finding the min of adr, adults, children and babies for each hotel, year, month

hotel_data%>%
  select(hotel, adr, arrival_date_year, arrival_date_month, adults, children, babies)%>%
  group_by(hotel, arrival_date_year, arrival_date_month)%>%
  summarise(min_adr = min(adr), min_total = min(adults+children+babies, na.rm = TRUE), min_adults = min(adults), min_children = min(children, na.rm = TRUE), min_babies = min(babies))
```




### Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.


A total of 221636 adults, 12403 children and 949 babies stayed in both hotels from July 2015 - August 2017. I grouped by “hotel” to find the total number of “adults”, “children” and “babies” that stayed in the (i) city hotel as 146838, 7248, 392 and (ii) resort hotel as 74798, 5155, 557 respectively. Looking at the group by “hotel”, we can find that families with babies prefer to stay at resort hotel compared to city hotel. This may be due to the fact that resort hotels are generally closer to the beach or has swimming pool and fun outdoor activities for babies. City hotels are preferred more by adults traveling alone without kids. City hotel may be more suitable for people traveling for business purposes. I also find the mean, median, and sd for the “adr”, “adults”, “children” and “babies” grouped by “hotel”. The difference between the mean and median values are almost similar indicating that the data is not skewed. The average “adr” for City Hotel is more than the Resort Hotel which makes sense as city hotels are usually located in central locations making transportation easier. 

Next, I grouped by “hotel”, “arrival_date_year” to find the total number of “adults”, “children” and “babies” that stayed in the city hotel and resort hotel for years 2015, 2016 and 2017. I also find the mean, median, sd, max, and min for the “adr”, “adults”, “children” and “babies” grouped by “hotel” and “arrival_date_year”. We do not interpret anything new from the mean, median and sd. From the max, we identify few interesting bookings where the maximum number of “adults” is 55 / max “children” is 10 in resort hotel for the year 2015 and max “babies” is 9 / max “babies” is 10 for city hotel in years 2015 and 2016 respectively. The min values are mostly 0s except for the year 2017 in resort hotel where the min “adr” is -6.38. Having a negative “adr” is unusual and needs to be investigated further. 

Finally, I grouped by “hotel”, “arrival_date_year”, ”arrival_date_month” to find the total number of “adults”, “children” and “babies” that stayed in the city hotel and resort hotel for July 2015 - August 2017. For the city hotel, the months September and October had the highest number of people as this is the best time to visit Portugal during Fall when the weather is warm and the crowd is relatively less compared to Summer. For the resort hotel, August month had the highest number of people (children and babies in particular) as this is the best time to stay in resort during Summer and enjoy outdoor activities. Hence, hotel management can price the rooms accordingly during peak season and off-season. I also find the mean, median, sd, max, and min for the “adr”, “adults”, “children” and “babies” grouped by “hotel”, “arrival_date_year”,  and ”arrival_date_month”. We do not interpret anything new from the mean, median and sd. The max “adr” for City Hotel in March 2016 is the highest with 5400. The max “adults” for resort hotel in September and October 2015 is 55 and 50 respectively. The min values are mostly 0s or 1s. 

I also grouped by “hotel”, “arrival_date_year”, ”arrival_date_month” to find the total number of “adults”, “children” and “babies” that stayed in the city hotel and resort hotel and filtered for the months July and August. We have data available for July and August for all the three years. We see an increasing trend in the number of people staying in hotels from year 2015 to year 2017.