challenge_2
Priyanka Perumalla
hotel_bookings
Data wrangling: using group() and summarise()
Author

Priyanka Perumalla

Published

May 15, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐
Code
hotel_data <- read_csv("_data/hotel_bookings.csv")
print(hotel_data,show_col_types = FALSE)
# A tibble: 119,390 × 32
   hotel        is_canceled lead_time arrival_date_year arrival_date_month
   <chr>              <dbl>     <dbl>             <dbl> <chr>             
 1 Resort Hotel           0       342              2015 July              
 2 Resort Hotel           0       737              2015 July              
 3 Resort Hotel           0         7              2015 July              
 4 Resort Hotel           0        13              2015 July              
 5 Resort Hotel           0        14              2015 July              
 6 Resort Hotel           0        14              2015 July              
 7 Resort Hotel           0         0              2015 July              
 8 Resort Hotel           0         9              2015 July              
 9 Resort Hotel           1        85              2015 July              
10 Resort Hotel           1        75              2015 July              
# ℹ 119,380 more rows
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
head(hotel_data)
# A tibble: 6 × 32
  hotel        is_canceled lead_time arrival_date_year arrival_date_month
  <chr>              <dbl>     <dbl>             <dbl> <chr>             
1 Resort Hotel           0       342              2015 July              
2 Resort Hotel           0       737              2015 July              
3 Resort Hotel           0         7              2015 July              
4 Resort Hotel           0        13              2015 July              
5 Resort Hotel           0        14              2015 July              
6 Resort Hotel           0        14              2015 July              
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, …
Code
str(hotel_data)
spc_tbl_ [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
 $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
 $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
 $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
 $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
 $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
 $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
 $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
 $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
 $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
 $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
 $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
 $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
 $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
 $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
 $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
 $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
 $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
 $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
 $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
 $ adr                           : num [1:119390] 0 0 75 75 98 ...
 $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
 $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
 $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
 - attr(*, "spec")=
  .. cols(
  ..   hotel = col_character(),
  ..   is_canceled = col_double(),
  ..   lead_time = col_double(),
  ..   arrival_date_year = col_double(),
  ..   arrival_date_month = col_character(),
  ..   arrival_date_week_number = col_double(),
  ..   arrival_date_day_of_month = col_double(),
  ..   stays_in_weekend_nights = col_double(),
  ..   stays_in_week_nights = col_double(),
  ..   adults = col_double(),
  ..   children = col_double(),
  ..   babies = col_double(),
  ..   meal = col_character(),
  ..   country = col_character(),
  ..   market_segment = col_character(),
  ..   distribution_channel = col_character(),
  ..   is_repeated_guest = col_double(),
  ..   previous_cancellations = col_double(),
  ..   previous_bookings_not_canceled = col_double(),
  ..   reserved_room_type = col_character(),
  ..   assigned_room_type = col_character(),
  ..   booking_changes = col_double(),
  ..   deposit_type = col_character(),
  ..   agent = col_character(),
  ..   company = col_character(),
  ..   days_in_waiting_list = col_double(),
  ..   customer_type = col_character(),
  ..   adr = col_double(),
  ..   required_car_parking_spaces = col_double(),
  ..   total_of_special_requests = col_double(),
  ..   reservation_status = col_character(),
  ..   reservation_status_date = col_date(format = "")
  .. )
 - attr(*, "problems")=<externalptr> 
Code
library(summarytools)
dfSummary(hotel_data)
Data Frame Summary  
hotel_data  
Dimensions: 119390 x 32  
Duplicates: 31994  

-----------------------------------------------------------------------------------------------------------------------------------
No   Variable                         Stats / Values             Freqs (% of Valid)     Graph                  Valid      Missing  
---- -------------------------------- -------------------------- ---------------------- ---------------------- ---------- ---------
1    hotel                            1. City Hotel              79330 (66.4%)          IIIIIIIIIIIII          119390     0        
     [character]                      2. Resort Hotel            40060 (33.6%)          IIIIII                 (100.0%)   (0.0%)   

2    is_canceled                      Min  : 0                   0 : 75166 (63.0%)      IIIIIIIIIIII           119390     0        
     [numeric]                        Mean : 0.4                 1 : 44224 (37.0%)      IIIIIII                (100.0%)   (0.0%)   
                                      Max  : 1                                                                                     

3    lead_time                        Mean (sd) : 104 (106.9)    479 distinct values    :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 69 < 737                                      :                                          
                                      IQR (CV) : 142 (1)                                : : .                                      
                                                                                        : : : . .                                  

4    arrival_date_year                Mean (sd) : 2016.2 (0.7)   2015 : 21996 (18.4%)   III                    119390     0        
     [numeric]                        min < med < max:           2016 : 56707 (47.5%)   IIIIIIIII              (100.0%)   (0.0%)   
                                      2015 < 2016 < 2017         2017 : 40687 (34.1%)   IIIIII                                     
                                      IQR (CV) : 1 (0)                                                                             

5    arrival_date_month               1. August                  13877 (11.6%)          II                     119390     0        
     [character]                      2. July                    12661 (10.6%)          II                     (100.0%)   (0.0%)   
                                      3. May                     11791 ( 9.9%)          I                                          
                                      4. October                 11160 ( 9.3%)          I                                          
                                      5. April                   11089 ( 9.3%)          I                                          
                                      6. June                    10939 ( 9.2%)          I                                          
                                      7. September               10508 ( 8.8%)          I                                          
                                      8. March                    9794 ( 8.2%)          I                                          
                                      9. February                 8068 ( 6.8%)          I                                          
                                      10. November                6794 ( 5.7%)          I                                          
                                      [ 2 others ]               12709 (10.6%)          II                                         

6    arrival_date_week_number         Mean (sd) : 27.2 (13.6)    53 distinct values           . : . . .        119390     0        
     [numeric]                        min < med < max:                                    . : : : : : :        (100.0%)   (0.0%)   
                                      1 < 28 < 53                                       . : : : : : : : : :                        
                                      IQR (CV) : 22 (0.5)                               : : : : : : : : : :                        
                                                                                        : : : : : : : : : :                        

7    arrival_date_day_of_month        Mean (sd) : 15.8 (8.8)     31 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  : : : . : : . : :      (100.0%)   (0.0%)   
                                      1 < 16 < 31                                       : : : : : : : : : :                        
                                      IQR (CV) : 15 (0.6)                               : : : : : : : : : :                        
                                                                                        : : : : : : : : : :                        

8    stays_in_weekend_nights          Mean (sd) : 0.9 (1)        17 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 1 < 19                                        :                                          
                                      IQR (CV) : 2 (1.1)                                : :                                        
                                                                                        : :                                        

9    stays_in_week_nights             Mean (sd) : 2.5 (1.9)      35 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 2 < 50                                        :                                          
                                      IQR (CV) : 2 (0.8)                                :                                          
                                                                                        :                                          

10   adults                           Mean (sd) : 1.9 (0.6)      14 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 2 < 55                                        :                                          
                                      IQR (CV) : 0 (0.3)                                :                                          
                                                                                        :                                          

11   children                         Mean (sd) : 0.1 (0.4)      0 : 110796 (92.8%)     IIIIIIIIIIIIIIIIII     119386     4        
     [numeric]                        min < med < max:           1 :   4861 ( 4.1%)                            (100.0%)   (0.0%)   
                                      0 < 0 < 10                 2 :   3652 ( 3.1%)                                                
                                      IQR (CV) : 0 (3.8)         3 :     76 ( 0.1%)                                                
                                                                 10 :      1 ( 0.0%)                                               

12   babies                           Mean (sd) : 0 (0.1)        0 : 118473 (99.2%)     IIIIIIIIIIIIIIIIIII    119390     0        
     [numeric]                        min < med < max:           1 :    900 ( 0.8%)                            (100.0%)   (0.0%)   
                                      0 < 0 < 10                 2 :     15 ( 0.0%)                                                
                                      IQR (CV) : 0 (12.3)        9 :      1 ( 0.0%)                                                
                                                                 10 :      1 ( 0.0%)                                               

13   meal                             1. BB                      92310 (77.3%)          IIIIIIIIIIIIIII        119390     0        
     [character]                      2. FB                        798 ( 0.7%)                                 (100.0%)   (0.0%)   
                                      3. HB                      14463 (12.1%)          II                                         
                                      4. SC                      10650 ( 8.9%)          I                                          
                                      5. Undefined                1169 ( 1.0%)                                                     

14   country                          1. PRT                     48590 (40.7%)          IIIIIIII               119390     0        
     [character]                      2. GBR                     12129 (10.2%)          II                     (100.0%)   (0.0%)   
                                      3. FRA                     10415 ( 8.7%)          I                                          
                                      4. ESP                      8568 ( 7.2%)          I                                          
                                      5. DEU                      7287 ( 6.1%)          I                                          
                                      6. ITA                      3766 ( 3.2%)                                                     
                                      7. IRL                      3375 ( 2.8%)                                                     
                                      8. BEL                      2342 ( 2.0%)                                                     
                                      9. BRA                      2224 ( 1.9%)                                                     
                                      10. NLD                     2104 ( 1.8%)                                                     
                                      [ 168 others ]             18590 (15.6%)          III                                        

15   market_segment                   1. Aviation                  237 ( 0.2%)                                 119390     0        
     [character]                      2. Complementary             743 ( 0.6%)                                 (100.0%)   (0.0%)   
                                      3. Corporate                5295 ( 4.4%)                                                     
                                      4. Direct                  12606 (10.6%)          II                                         
                                      5. Groups                  19811 (16.6%)          III                                        
                                      6. Offline TA/TO           24219 (20.3%)          IIII                                       
                                      7. Online TA               56477 (47.3%)          IIIIIIIII                                  
                                      8. Undefined                   2 ( 0.0%)                                                     

16   distribution_channel             1. Corporate                6677 ( 5.6%)          I                      119390     0        
     [character]                      2. Direct                  14645 (12.3%)          II                     (100.0%)   (0.0%)   
                                      3. GDS                       193 ( 0.2%)                                                     
                                      4. TA/TO                   97870 (82.0%)          IIIIIIIIIIIIIIII                           
                                      5. Undefined                   5 ( 0.0%)                                                     

17   is_repeated_guest                Min  : 0                   0 : 115580 (96.8%)     IIIIIIIIIIIIIIIIIII    119390     0        
     [numeric]                        Mean : 0                   1 :   3810 ( 3.2%)                            (100.0%)   (0.0%)   
                                      Max  : 1                                                                                     

18   previous_cancellations           Mean (sd) : 0.1 (0.8)      15 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 26                                        :                                          
                                      IQR (CV) : 0 (9.7)                                :                                          
                                                                                        :                                          

19   previous_bookings_not_canceled   Mean (sd) : 0.1 (1.5)      73 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 72                                        :                                          
                                      IQR (CV) : 0 (10.9)                               :                                          
                                                                                        :                                          

20   reserved_room_type               1. A                       85994 (72.0%)          IIIIIIIIIIIIII         119390     0        
     [character]                      2. B                        1118 ( 0.9%)                                 (100.0%)   (0.0%)   
                                      3. C                         932 ( 0.8%)                                                     
                                      4. D                       19201 (16.1%)          III                                        
                                      5. E                        6535 ( 5.5%)          I                                          
                                      6. F                        2897 ( 2.4%)                                                     
                                      7. G                        2094 ( 1.8%)                                                     
                                      8. H                         601 ( 0.5%)                                                     
                                      9. L                           6 ( 0.0%)                                                     
                                      10. P                         12 ( 0.0%)                                                     

21   assigned_room_type               1. A                       74053 (62.0%)          IIIIIIIIIIII           119390     0        
     [character]                      2. D                       25322 (21.2%)          IIII                   (100.0%)   (0.0%)   
                                      3. E                        7806 ( 6.5%)          I                                          
                                      4. F                        3751 ( 3.1%)                                                     
                                      5. G                        2553 ( 2.1%)                                                     
                                      6. C                        2375 ( 2.0%)                                                     
                                      7. B                        2163 ( 1.8%)                                                     
                                      8. H                         712 ( 0.6%)                                                     
                                      9. I                         363 ( 0.3%)                                                     
                                      10. K                        279 ( 0.2%)                                                     
                                      [ 2 others ]                  13 ( 0.0%)                                                     

22   booking_changes                  Mean (sd) : 0.2 (0.7)      21 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 21                                        :                                          
                                      IQR (CV) : 0 (2.9)                                :                                          
                                                                                        :                                          

23   deposit_type                     1. No Deposit              104641 (87.6%)         IIIIIIIIIIIIIIIII      119390     0        
     [character]                      2. Non Refund               14587 (12.2%)         II                     (100.0%)   (0.0%)   
                                      3. Refundable                 162 ( 0.1%)                                                    

24   agent                            1. 9                       31961 (26.8%)          IIIII                  119390     0        
     [character]                      2. NULL                    16340 (13.7%)          II                     (100.0%)   (0.0%)   
                                      3. 240                     13922 (11.7%)          II                                         
                                      4. 1                        7191 ( 6.0%)          I                                          
                                      5. 14                       3640 ( 3.0%)                                                     
                                      6. 7                        3539 ( 3.0%)                                                     
                                      7. 6                        3290 ( 2.8%)                                                     
                                      8. 250                      2870 ( 2.4%)                                                     
                                      9. 241                      1721 ( 1.4%)                                                     
                                      10. 28                      1666 ( 1.4%)                                                     
                                      [ 324 others ]             33250 (27.8%)          IIIII                                      

25   company                          1. NULL                    112593 (94.3%)         IIIIIIIIIIIIIIIIII     119390     0        
     [character]                      2. 40                         927 ( 0.8%)                                (100.0%)   (0.0%)   
                                      3. 223                        784 ( 0.7%)                                                    
                                      4. 67                         267 ( 0.2%)                                                    
                                      5. 45                         250 ( 0.2%)                                                    
                                      6. 153                        215 ( 0.2%)                                                    
                                      7. 174                        149 ( 0.1%)                                                    
                                      8. 219                        141 ( 0.1%)                                                    
                                      9. 281                        138 ( 0.1%)                                                    
                                      10. 154                       133 ( 0.1%)                                                    
                                      [ 343 others ]               3793 ( 3.2%)                                                    

26   days_in_waiting_list             Mean (sd) : 2.3 (17.6)     128 distinct values    :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 391                                       :                                          
                                      IQR (CV) : 0 (7.6)                                :                                          
                                                                                        :                                          

27   customer_type                    1. Contract                 4076 ( 3.4%)                                 119390     0        
     [character]                      2. Group                     577 ( 0.5%)                                 (100.0%)   (0.0%)   
                                      3. Transient               89613 (75.1%)          IIIIIIIIIIIIIII                            
                                      4. Transient-Party         25124 (21.0%)          IIII                                       

28   adr                              Mean (sd) : 101.8 (50.5)   8879 distinct values   :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      -6.4 < 94.6 < 5400                                :                                          
                                      IQR (CV) : 56.7 (0.5)                             :                                          
                                                                                        :                                          

29   required_car_parking_spaces      Mean (sd) : 0.1 (0.2)      0 : 111974 (93.8%)     IIIIIIIIIIIIIIIIII     119390     0        
     [numeric]                        min < med < max:           1 :   7383 ( 6.2%)     I                      (100.0%)   (0.0%)   
                                      0 < 0 < 8                  2 :     28 ( 0.0%)                                                
                                      IQR (CV) : 0 (3.9)         3 :      3 ( 0.0%)                                                
                                                                 8 :      2 ( 0.0%)                                                

30   total_of_special_requests        Mean (sd) : 0.6 (0.8)      0 : 70318 (58.9%)      IIIIIIIIIII            119390     0        
     [numeric]                        min < med < max:           1 : 33226 (27.8%)      IIIII                  (100.0%)   (0.0%)   
                                      0 < 0 < 5                  2 : 12969 (10.9%)      II                                         
                                      IQR (CV) : 1 (1.4)         3 :  2497 ( 2.1%)                                                 
                                                                 4 :   340 ( 0.3%)                                                 
                                                                 5 :    40 ( 0.0%)                                                 

31   reservation_status               1. Canceled                43017 (36.0%)          IIIIIII                119390     0        
     [character]                      2. Check-Out               75166 (63.0%)          IIIIIIIIIIII           (100.0%)   (0.0%)   
                                      3. No-Show                  1207 ( 1.0%)                                                     

32   reservation_status_date          min : 2014-10-17           926 distinct values            . : : : :      119390     0        
     [Date]                           med : 2016-08-07                                        : : : : : : .    (100.0%)   (0.0%)   
                                      max : 2017-09-14                                      . : : : : : : :                        
                                      range : 2y 10m 28d                                    : : : : : : : :                        
                                                                                        .   : : : : : : : :                        
-----------------------------------------------------------------------------------------------------------------------------------
Code
dim(hotel_data)
[1] 119390     32

It can be observed that there are 119390 rows and 32 columns

Printing the column names of the hotel bookings dataset :

Code
colnames(hotel_data)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       

Description : The data set involved gathering information of hotel bookings in different hotels. Every booking entry is a row having information on which hotel the booking corresponds to, when it happened, what is the start date and what is the end date, who made the booking, for who all the booking was made, the channel used for booking, the number of days for which it was made, how much money was paid, all the preferences that were entered along with the booking etc.

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Group by Arrival date month:

Code
hotel_data %>% 
  group_by(arrival_date_month) %>% 
  summarise(stays_in_week_nights = mean(stays_in_week_nights, na.rm=TRUE), StandardDeviation = sd(stays_in_week_nights, na.rm = TRUE))
# A tibble: 12 × 3
   arrival_date_month stays_in_week_nights StandardDeviation
   <chr>                             <dbl>             <dbl>
 1 April                              2.42                NA
 2 August                             2.84                NA
 3 December                           2.36                NA
 4 February                           2.18                NA
 5 January                            2.19                NA
 6 July                               2.81                NA
 7 June                               2.66                NA
 8 March                              2.56                NA
 9 May                                2.41                NA
10 November                           2.40                NA
11 October                            2.23                NA
12 September                          2.52                NA
Code
df<- hotel_data %>%
filter(country == "GBR")
print(df)
# A tibble: 12,129 × 32
   hotel        is_canceled lead_time arrival_date_year arrival_date_month
   <chr>              <dbl>     <dbl>             <dbl> <chr>             
 1 Resort Hotel           0         7              2015 July              
 2 Resort Hotel           0        13              2015 July              
 3 Resort Hotel           0        14              2015 July              
 4 Resort Hotel           0        14              2015 July              
 5 Resort Hotel           0         7              2015 July              
 6 Resort Hotel           0        37              2015 July              
 7 Resort Hotel           0       127              2015 July              
 8 Resort Hotel           0        95              2015 July              
 9 Resort Hotel           0        90              2015 July              
10 Resort Hotel           0       364              2015 July              
# ℹ 12,119 more rows
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …
Code
df<- hotel_data %>%
filter(babies > 0 && meal=="BB")
print(df)
# A tibble: 0 × 32
# ℹ 32 variables: hotel <chr>, is_canceled <dbl>, lead_time <dbl>,
#   arrival_date_year <dbl>, arrival_date_month <chr>,
#   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
#   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
#   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
#   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …
Code
#IQR for railroads
hotel_data %>%
  summarize_all(IQR, na.rm = TRUE)
# A tibble: 1 × 32
  hotel is_canceled lead_time arrival_date_year arrival_date_month
  <dbl>       <dbl>     <dbl>             <dbl>              <dbl>
1    NA           1       142                 1                 NA
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <dbl>, country <dbl>, market_segment <dbl>,
#   distribution_channel <dbl>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <dbl>, assigned_room_type <dbl>, …
Code
#Mean for railroads
hotel_data %>%
  summarize_all(mean, na.rm = TRUE)
# A tibble: 1 × 32
  hotel is_canceled lead_time arrival_date_year arrival_date_month
  <dbl>       <dbl>     <dbl>             <dbl>              <dbl>
1    NA       0.370      104.             2016.                 NA
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <dbl>, country <dbl>, market_segment <dbl>,
#   distribution_channel <dbl>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <dbl>, assigned_room_type <dbl>, …
Code
#Median for railroads
hotel_data %>%
  summarize_all(median, na.rm = TRUE)
# A tibble: 1 × 32
  hotel is_canceled lead_time arrival_date_year arrival_date_month
  <dbl>       <dbl>     <dbl>             <dbl>              <dbl>
1    NA           0        69              2016                 NA
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <dbl>, country <dbl>, market_segment <dbl>,
#   distribution_channel <dbl>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <dbl>, assigned_room_type <dbl>, …

Total number of days spent(stays in week nights) via bookings for each hotel:

Code
hotel_wise_ct = select(hotel_data, hotel,stays_in_week_nights)
hotel_wise_ct %>% 
  group_by(hotel) %>%
  summarize(stays_in_week_nights=sum(stays_in_week_nights))
# A tibble: 2 × 2
  hotel        stays_in_week_nights
  <chr>                       <dbl>
1 City Hotel                 173174
2 Resort Hotel               125337
Code
hotel_wise_ct
# A tibble: 119,390 × 2
   hotel        stays_in_week_nights
   <chr>                       <dbl>
 1 Resort Hotel                    0
 2 Resort Hotel                    0
 3 Resort Hotel                    1
 4 Resort Hotel                    1
 5 Resort Hotel                    2
 6 Resort Hotel                    2
 7 Resort Hotel                    2
 8 Resort Hotel                    2
 9 Resort Hotel                    3
10 Resort Hotel                    3
# ℹ 119,380 more rows

Median, Mean and standard deviation of stays_in_week_nights in every hotel:

Code
hotel_wise_ct = select(hotel_data, hotel, stays_in_week_nights)
hotel_wise_ct %>% 
  group_by(hotel) %>%
  summarize(meanDays=mean(stays_in_week_nights),medianDays=median(stays_in_week_nights),standardDeviation = sd(stays_in_week_nights))
# A tibble: 2 × 4
  hotel        meanDays medianDays standardDeviation
  <chr>           <dbl>      <dbl>             <dbl>
1 City Hotel       2.18          2              1.46
2 Resort Hotel     3.13          3              2.46

Hotel wise number of days busy count displayed in descending order :

Code
hotel_wise_grouped_cts <- hotel_wise_ct %>% 
  group_by(hotel) %>%
  summarize(Sum = sum(stays_in_week_nights))
count_sorted <- hotel_wise_grouped_cts %>%
  arrange(desc(Sum))
count_sorted
# A tibble: 2 × 2
  hotel           Sum
  <chr>         <dbl>
1 City Hotel   173174
2 Resort Hotel 125337

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

I attempted to see which hotel is the busiest from among all the hotels in the given data set. If I go by number of bookings, then I would not get an appropriate answer. The true measure is the availability for a booking, which in turn is the number of days.I tried to group the data by hotel and then attempted to see the total number of days that were busy at every hotel. Additionally, I have also observed some metrics relate to the data in terms of the days spent like mean, median, standard deviation.

Overall, City Hotel looks more busy compared to Resort Hotel. We can draw an assumed conclusion from the name by assuming that city hotel is more accessible than resort hotel where people only go on holidays.