challenge_2
XinyangMao
hotel_bookings
Data wrangling: using group() and summarise()
Author

XinyangMao

Published

March 17, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

  • hotel_bookings.csv
Code
df <- read.csv("_data/hotel_bookings.csv")
data <- na.omit(df)

In this challenge,I chose hotel_bookings.csv to analysis.

Describe the data

We can see that there are 119390 rows and 32 columns in this dataset.

Code
dim(data)
[1] 119386     32

The structure of the data set also tells us the number of rows and columns, but it provides even more information. It tells us the column names, the class of each column (what kind of data is stored in it), and the first few observations of each variable.

Code
str(data)
'data.frame':   119386 obs. of  32 variables:
 $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
 $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
 $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
 $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
 $ arrival_date_month            : chr  "July" "July" "July" "July" ...
 $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
 $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
 $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
 $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ meal                          : chr  "BB" "BB" "BB" "BB" ...
 $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
 $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
 $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
 $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
 $ reserved_room_type            : chr  "C" "C" "A" "A" ...
 $ assigned_room_type            : chr  "C" "C" "C" "A" ...
 $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
 $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
 $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
 $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
 $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
 $ adr                           : num  0 0 75 75 98 ...
 $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
 $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
 $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
 - attr(*, "na.action")= 'omit' Named int [1:4] 40601 40668 40680 41161
  ..- attr(*, "names")= chr [1:4] "40601" "40668" "40680" "41161"

The summary provides descriptive statistics including the min, max, mean, median, and quartiles of each column. For example, we can see in this data set that the mean rate of is_canaceled is 37.04% and the maximum number of baby is 10.

Code
summary(data)
    hotel            is_canceled       lead_time   arrival_date_year
 Length:119386      Min.   :0.0000   Min.   :  0   Min.   :2015     
 Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
 Mode  :character   Median :0.0000   Median : 69   Median :2016     
                    Mean   :0.3704   Mean   :104   Mean   :2016     
                    3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
                    Max.   :1.0000   Max.   :737   Max.   :2017     
 arrival_date_month arrival_date_week_number arrival_date_day_of_month
 Length:119386      Min.   : 1.00            Min.   : 1.0             
 Class :character   1st Qu.:16.00            1st Qu.: 8.0             
 Mode  :character   Median :28.00            Median :16.0             
                    Mean   :27.16            Mean   :15.8             
                    3rd Qu.:38.00            3rd Qu.:23.0             
                    Max.   :53.00            Max.   :31.0             
 stays_in_weekend_nights stays_in_week_nights     adults      
 Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
 1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
 Median : 1.0000         Median : 2.0         Median : 2.000  
 Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
 3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
 Max.   :19.0000         Max.   :50.0         Max.   :55.000  
    children           babies              meal             country         
 Min.   : 0.0000   Min.   : 0.000000   Length:119386      Length:119386     
 1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
 Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
 Mean   : 0.1039   Mean   : 0.007949                                        
 3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
 Max.   :10.0000   Max.   :10.000000                                        
 market_segment     distribution_channel is_repeated_guest
 Length:119386      Length:119386        Min.   :0.00000  
 Class :character   Class :character     1st Qu.:0.00000  
 Mode  :character   Mode  :character     Median :0.00000  
                                         Mean   :0.03191  
                                         3rd Qu.:0.00000  
                                         Max.   :1.00000  
 previous_cancellations previous_bookings_not_canceled reserved_room_type
 Min.   : 0.00000       Min.   : 0.0000                Length:119386     
 1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
 Median : 0.00000       Median : 0.0000                Mode  :character  
 Mean   : 0.08712       Mean   : 0.1371                                  
 3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
 Max.   :26.00000       Max.   :72.0000                                  
 assigned_room_type booking_changes   deposit_type          agent          
 Length:119386      Min.   : 0.0000   Length:119386      Length:119386     
 Class :character   1st Qu.: 0.0000   Class :character   Class :character  
 Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
                    Mean   : 0.2211                                        
                    3rd Qu.: 0.0000                                        
                    Max.   :21.0000                                        
   company          days_in_waiting_list customer_type           adr         
 Length:119386      Min.   :  0.000      Length:119386      Min.   :  -6.38  
 Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
 Mode  :character   Median :  0.000      Mode  :character   Median :  94.59  
                    Mean   :  2.321                         Mean   : 101.83  
                    3rd Qu.:  0.000                         3rd Qu.: 126.00  
                    Max.   :391.000                         Max.   :5400.00  
 required_car_parking_spaces total_of_special_requests reservation_status
 Min.   :0.00000             Min.   :0.0000            Length:119386     
 1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
 Median :0.00000             Median :0.0000            Mode  :character  
 Mean   :0.06252             Mean   :0.5713                              
 3rd Qu.:0.00000             3rd Qu.:1.0000                              
 Max.   :8.00000             Max.   :5.0000                              
 reservation_status_date
 Length:119386          
 Class :character       
 Mode  :character       
                        
                        
                        

Provide Grouped Summary Statistics

We can see there are lots of variables now,we can use cor() function to show the relationships between these variables.Obviously,we can find that for arrival_date_year variables,the most related numeric variable is arrival_date_week_number.But all of the numeric variables seem don’t have strong relationships.

Code
library(dplyr)
data <- select(data,-where(is.character))
cor(data)
                                is_canceled     lead_time arrival_date_year
is_canceled                     1.000000000  0.2931773844      0.0167324934
lead_time                       0.293177384  1.0000000000      0.0400927219
arrival_date_year               0.016732493  0.0400927219      1.0000000000
arrival_date_week_number        0.008131991  0.1268848204     -0.5405663357
arrival_date_day_of_month      -0.006084131  0.0022337238     -0.0002790381
stays_in_weekend_nights        -0.001783382  0.0856671122      0.0214888460
stays_in_week_nights            0.024771434  0.1657994743      0.0308779571
adults                          0.059990266  0.1195443171      0.0296741970
children                        0.005047790 -0.0376216776      0.0546241639
babies                         -0.032488454 -0.0209179156     -0.0131971353
is_repeated_guest              -0.084787949 -0.1244175906      0.0103318406
previous_cancellations          0.110140471  0.0860398385     -0.1198331083
previous_bookings_not_canceled -0.057355365 -0.0735521797      0.0292143485
booking_changes                -0.144370572  0.0001380965      0.0308552231
days_in_waiting_list            0.054193146  0.1700826037     -0.0565063742
adr                             0.047622225 -0.0631259954      0.1975183834
required_car_parking_spaces    -0.195492464 -0.1164605134     -0.0136989991
total_of_special_requests      -0.234705902 -0.0956878275      0.1085851089
                               arrival_date_week_number
is_canceled                                 0.008131991
lead_time                                   0.126884820
arrival_date_year                          -0.540566336
arrival_date_week_number                    1.000000000
arrival_date_day_of_month                   0.066823684
stays_in_weekend_nights                     0.018209372
stays_in_week_nights                        0.015559322
adults                                      0.025901318
children                                    0.005518419
babies                                      0.010396527
is_repeated_guest                          -0.030128572
previous_cancellations                      0.035502291
previous_bookings_not_canceled             -0.020902457
booking_changes                             0.005511772
days_in_waiting_list                        0.022934739
adr                                         0.075811103
required_car_parking_spaces                 0.001923621
total_of_special_requests                   0.026139352
                               arrival_date_day_of_month
is_canceled                                -0.0060841312
lead_time                                   0.0022337238
arrival_date_year                          -0.0002790381
arrival_date_week_number                    0.0668236844
arrival_date_day_of_month                   1.0000000000
stays_in_weekend_nights                    -0.0163702417
stays_in_week_nights                       -0.0281924631
adults                                     -0.0015394616
children                                    0.0145438342
babies                                     -0.0002332642
is_repeated_guest                          -0.0061516010
previous_cancellations                     -0.0270150357
previous_bookings_not_canceled             -0.0003030418
booking_changes                             0.0106010824
days_in_waiting_list                        0.0227233609
adr                                         0.0301886003
required_car_parking_spaces                 0.0086746240
total_of_special_requests                   0.0030944208
                               stays_in_weekend_nights stays_in_week_nights
is_canceled                               -0.001783382          0.024771434
lead_time                                  0.085667112          0.165799474
arrival_date_year                          0.021488846          0.030877957
arrival_date_week_number                   0.018209372          0.015559322
arrival_date_day_of_month                 -0.016370242         -0.028192463
stays_in_weekend_nights                    1.000000000          0.498967239
stays_in_week_nights                       0.498967239          1.000000000
adults                                     0.091888610          0.092983373
children                                   0.045792937          0.044202919
babies                                     0.018482550          0.020190920
is_repeated_guest                         -0.087241563         -0.097247278
previous_cancellations                    -0.012775390         -0.013993096
previous_bookings_not_canceled            -0.042716302         -0.048743688
booking_changes                            0.063280182          0.096209580
days_in_waiting_list                      -0.054152569         -0.002020421
adr                                        0.049323279          0.065219447
required_car_parking_spaces               -0.018555572         -0.024860942
total_of_special_requests                  0.072685999          0.068199125
                                     adults     children        babies
is_canceled                     0.059990266  0.005047790 -0.0324884545
lead_time                       0.119544317 -0.037621678 -0.0209179156
arrival_date_year               0.029674197  0.054624164 -0.0131971353
arrival_date_week_number        0.025901318  0.005518419  0.0103965273
arrival_date_day_of_month      -0.001539462  0.014543834 -0.0002332642
stays_in_weekend_nights         0.091888610  0.045792937  0.0184825497
stays_in_week_nights            0.092983373  0.044202919  0.0201909200
adults                          1.000000000  0.030446849  0.0181478114
children                        0.030446849  1.000000000  0.0240295524
babies                          0.018147811  0.024029552  1.0000000000
is_repeated_guest              -0.146424566 -0.032859382 -0.0089431363
previous_cancellations         -0.006735864 -0.024730099 -0.0075012826
previous_bookings_not_canceled -0.107982949 -0.021072490 -0.0065507062
booking_changes                -0.051666039  0.048948888  0.0834390248
days_in_waiting_list           -0.008280488 -0.033272616 -0.0106209075
adr                             0.230690021  0.324853605  0.0291829276
required_car_parking_spaces     0.014790889  0.056252847  0.0373827041
total_of_special_requests       0.122854601  0.081744822  0.0978928336
                               is_repeated_guest previous_cancellations
is_canceled                         -0.084787949            0.110140471
lead_time                           -0.124417591            0.086039838
arrival_date_year                    0.010331841           -0.119833108
arrival_date_week_number            -0.030128572            0.035502291
arrival_date_day_of_month           -0.006151601           -0.027015036
stays_in_weekend_nights             -0.087241563           -0.012775390
stays_in_week_nights                -0.097247278           -0.013993096
adults                              -0.146424566           -0.006735864
children                            -0.032859382           -0.024730099
babies                              -0.008943136           -0.007501283
is_repeated_guest                    1.000000000            0.082292667
previous_cancellations               0.082292667            1.000000000
previous_bookings_not_canceled       0.418055728            0.152727847
booking_changes                      0.012089755           -0.026993891
days_in_waiting_list                -0.022235787            0.005928488
adr                                 -0.134328457           -0.065653153
required_car_parking_spaces          0.077088149           -0.018493155
total_of_special_requests            0.013055449           -0.048382004
                               previous_bookings_not_canceled booking_changes
is_canceled                                     -0.0573553652   -0.1443705725
lead_time                                       -0.0735521797    0.0001380965
arrival_date_year                                0.0292143485    0.0308552231
arrival_date_week_number                        -0.0209024574    0.0055117720
arrival_date_day_of_month                       -0.0003030418    0.0106010824
stays_in_weekend_nights                         -0.0427163025    0.0632801821
stays_in_week_nights                            -0.0487436875    0.0962095799
adults                                          -0.1079829493   -0.0516660388
children                                        -0.0210724903    0.0489488885
babies                                          -0.0065507062    0.0834390248
is_repeated_guest                                0.4180557276    0.0120897552
previous_cancellations                           0.1527278470   -0.0269938913
previous_bookings_not_canceled                   1.0000000000    0.0116065130
booking_changes                                  0.0116065130    1.0000000000
days_in_waiting_list                            -0.0093973866   -0.0116354687
adr                                             -0.0721514059    0.0196022487
required_car_parking_spaces                      0.0476523637    0.0656174943
total_of_special_requests                        0.0378270602    0.0528441726
                               days_in_waiting_list         adr
is_canceled                             0.054193146  0.04762222
lead_time                               0.170082604 -0.06312600
arrival_date_year                      -0.056506374  0.19751838
arrival_date_week_number                0.022934739  0.07581110
arrival_date_day_of_month               0.022723361  0.03018860
stays_in_weekend_nights                -0.054152569  0.04932328
stays_in_week_nights                   -0.002020421  0.06521945
adults                                 -0.008280488  0.23069002
children                               -0.033272616  0.32485360
babies                                 -0.010620908  0.02918293
is_repeated_guest                      -0.022235787 -0.13432846
previous_cancellations                  0.005928488 -0.06565315
previous_bookings_not_canceled         -0.009397387 -0.07215141
booking_changes                        -0.011635469  0.01960225
days_in_waiting_list                    1.000000000 -0.04076433
adr                                    -0.040764331  1.00000000
required_car_parking_spaces            -0.030601215  0.05661816
total_of_special_requests              -0.082727388  0.17223837
                               required_car_parking_spaces
is_canceled                                   -0.195492464
lead_time                                     -0.116460513
arrival_date_year                             -0.013698999
arrival_date_week_number                       0.001923621
arrival_date_day_of_month                      0.008674624
stays_in_weekend_nights                       -0.018555572
stays_in_week_nights                          -0.024860942
adults                                         0.014790889
children                                       0.056252847
babies                                         0.037382704
is_repeated_guest                              0.077088149
previous_cancellations                        -0.018493155
previous_bookings_not_canceled                 0.047652364
booking_changes                                0.065617494
days_in_waiting_list                          -0.030601215
adr                                            0.056618163
required_car_parking_spaces                    1.000000000
total_of_special_requests                      0.082635165
                               total_of_special_requests
is_canceled                                 -0.234705902
lead_time                                   -0.095687827
arrival_date_year                            0.108585109
arrival_date_week_number                     0.026139352
arrival_date_day_of_month                    0.003094421
stays_in_weekend_nights                      0.072685999
stays_in_week_nights                         0.068199125
adults                                       0.122854601
children                                     0.081744822
babies                                       0.097892834
is_repeated_guest                            0.013055449
previous_cancellations                      -0.048382004
previous_bookings_not_canceled               0.037827060
booking_changes                              0.052844173
days_in_waiting_list                        -0.082727388
adr                                          0.172238374
required_car_parking_spaces                  0.082635165
total_of_special_requests                    1.000000000

According to the dataset,we can notice that it provide the number of adults, children and baby.Let’s take a look at the relationship between the type of occupant and the month of occupancy.

Code
df %>%
  select(arrival_date_month,children,adults,babies)%>%
  group_by(arrival_date_month)
# A tibble: 119,390 × 4
# Groups:   arrival_date_month [12]
   arrival_date_month children adults babies
   <chr>                 <int>  <int>  <int>
 1 July                      0      2      0
 2 July                      0      2      0
 3 July                      0      1      0
 4 July                      0      1      0
 5 July                      0      2      0
 6 July                      0      2      0
 7 July                      0      2      0
 8 July                      0      2      0
 9 July                      0      2      0
10 July                      0      2      0
# … with 119,380 more rows

We can use arrange(),desc(),select(),group_by() and slice() to show which months have most children guests below. The result is the largest number of children be hosted in a hotel happened in July,the less time happened in September.

Code
df %>%
  arrange(desc(children))%>%
  select(arrival_date_month,children)%>%
  group_by(arrival_date_month)%>%
  slice(1)%>%
  arrange(desc(children))
# A tibble: 12 × 2
# Groups:   arrival_date_month [12]
   arrival_date_month children
   <chr>                 <int>
 1 July                     10
 2 April                     3
 3 August                    3
 4 December                  3
 5 February                  3
 6 January                   3
 7 June                      3
 8 March                     3
 9 May                       3
10 November                  3
11 October                   3
12 September                 2

Same,for adults,they’re October and January.

Code
df %>%
  arrange(desc(adults))%>%
  select(arrival_date_month,adults)%>%
  group_by(arrival_date_month)%>%
  slice(1)%>%
  arrange(desc(adults))
# A tibble: 12 × 2
# Groups:   arrival_date_month [12]
   arrival_date_month adults
   <chr>               <int>
 1 October                55
 2 September              50
 3 April                   4
 4 August                  4
 5 December                4
 6 February                4
 7 July                    4
 8 June                    4
 9 March                   4
10 May                     4
11 November                4
12 January                 3

As for babies,the maximum number happened in October too.

Code
df %>%
  arrange(desc(babies))%>%
  select(arrival_date_month,babies)%>%
  group_by(arrival_date_month)%>%
  slice(1)%>%
  arrange(desc(babies))
# A tibble: 12 × 2
# Groups:   arrival_date_month [12]
   arrival_date_month babies
   <chr>               <int>
 1 January                10
 2 October                 9
 3 August                  2
 4 December                2
 5 February                2
 6 July                    2
 7 June                    2
 8 March                   2
 9 November                2
10 September               2
11 April                   1
12 May                     1

Explain and Interpret

Some interesting findings are that hotel stays for adults and babies are relatively similar, with the most and least occurring in October and January, respectively, probably because October is a good month to travel and January is usually colder. For children, the highest number of children occurred in July, probably during the summer holidays.