Challenge 4 - Hotel Bookings

challenge_4

hotel_bookings

srujan_kagitala

More data wrangling: pivoting

Author

Srujan Kagitala

Published

June 26, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
identify variables that need to be mutated
mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

hotel_bookings.csv⭐⭐⭐⭐

Code

booking_data <- read.csv("_data/hotel_bookings.csv")
head(booking_data)

         hotel is_canceled lead_time arrival_date_year arrival_date_month
1 Resort Hotel           0       342              2015               July
2 Resort Hotel           0       737              2015               July
3 Resort Hotel           0         7              2015               July
4 Resort Hotel           0        13              2015               July
5 Resort Hotel           0        14              2015               July
6 Resort Hotel           0        14              2015               July
  arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
1                       27                         1                       0
2                       27                         1                       0
3                       27                         1                       0
4                       27                         1                       0
5                       27                         1                       0
6                       27                         1                       0
  stays_in_week_nights adults children babies meal country market_segment
1                    0      2        0      0   BB     PRT         Direct
2                    0      2        0      0   BB     PRT         Direct
3                    1      1        0      0   BB     GBR         Direct
4                    1      1        0      0   BB     GBR      Corporate
5                    2      2        0      0   BB     GBR      Online TA
6                    2      2        0      0   BB     GBR      Online TA
  distribution_channel is_repeated_guest previous_cancellations
1               Direct                 0                      0
2               Direct                 0                      0
3               Direct                 0                      0
4            Corporate                 0                      0
5                TA/TO                 0                      0
6                TA/TO                 0                      0
  previous_bookings_not_canceled reserved_room_type assigned_room_type
1                              0                  C                  C
2                              0                  C                  C
3                              0                  A                  C
4                              0                  A                  A
5                              0                  A                  A
6                              0                  A                  A
  booking_changes deposit_type agent company days_in_waiting_list customer_type
1               3   No Deposit  NULL    NULL                    0     Transient
2               4   No Deposit  NULL    NULL                    0     Transient
3               0   No Deposit  NULL    NULL                    0     Transient
4               0   No Deposit   304    NULL                    0     Transient
5               0   No Deposit   240    NULL                    0     Transient
6               0   No Deposit   240    NULL                    0     Transient
  adr required_car_parking_spaces total_of_special_requests reservation_status
1   0                           0                         0          Check-Out
2   0                           0                         0          Check-Out
3  75                           0                         0          Check-Out
4  75                           0                         0          Check-Out
5  98                           0                         1          Check-Out
6  98                           0                         1          Check-Out
  reservation_status_date
1              2015-07-01
2              2015-07-01
3              2015-07-02
4              2015-07-02
5              2015-07-03
6              2015-07-03

Briefly describe the data

Code

#Number of bookings per hotel.
no_bookings_by_hotel <- booking_data %>%
  group_by(hotel) %>%
  summarise(count = n())

This data set contains booking data of City Hotel, Resort Hotel. It contains 119390 bookings information. 79330 and 40060 observations are respectively captured from City Hotel and Resort Hotel. The information available per booking are hotel, is_canceled, lead_time, arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month, stays_in_weekend_nights, stays_in_week_nights, adults, children, babies, meal, country, market_segment, distribution_channel, is_repeated_guest, previous_cancellations, previous_bookings_not_canceled, reserved_room_type, assigned_room_type, booking_changes, deposit_type, agent, company, days_in_waiting_list, customer_type, adr, required_car_parking_spaces, total_of_special_requests, reservation_status, reservation_status_date.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Code

#Overall summary of data across variables before converting character to factor.
summary(booking_data)

    hotel            is_canceled       lead_time   arrival_date_year
 Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
 Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
 Mode  :character   Median :0.0000   Median : 69   Median :2016     
                    Mean   :0.3704   Mean   :104   Mean   :2016     
                    3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
                    Max.   :1.0000   Max.   :737   Max.   :2017     
                                                                    
 arrival_date_month arrival_date_week_number arrival_date_day_of_month
 Length:119390      Min.   : 1.00            Min.   : 1.0             
 Class :character   1st Qu.:16.00            1st Qu.: 8.0             
 Mode  :character   Median :28.00            Median :16.0             
                    Mean   :27.17            Mean   :15.8             
                    3rd Qu.:38.00            3rd Qu.:23.0             
                    Max.   :53.00            Max.   :31.0             
                                                                      
 stays_in_weekend_nights stays_in_week_nights     adults      
 Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
 1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
 Median : 1.0000         Median : 2.0         Median : 2.000  
 Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
 3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
 Max.   :19.0000         Max.   :50.0         Max.   :55.000  
                                                              
    children           babies              meal             country         
 Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
 1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
 Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
 Mean   : 0.1039   Mean   : 0.007949                                        
 3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
 Max.   :10.0000   Max.   :10.000000                                        
 NA's   :4                                                                  
 market_segment     distribution_channel is_repeated_guest
 Length:119390      Length:119390        Min.   :0.00000  
 Class :character   Class :character     1st Qu.:0.00000  
 Mode  :character   Mode  :character     Median :0.00000  
                                         Mean   :0.03191  
                                         3rd Qu.:0.00000  
                                         Max.   :1.00000  
                                                          
 previous_cancellations previous_bookings_not_canceled reserved_room_type
 Min.   : 0.00000       Min.   : 0.0000                Length:119390     
 1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
 Median : 0.00000       Median : 0.0000                Mode  :character  
 Mean   : 0.08712       Mean   : 0.1371                                  
 3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
 Max.   :26.00000       Max.   :72.0000                                  
                                                                         
 assigned_room_type booking_changes   deposit_type          agent          
 Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
 Class :character   1st Qu.: 0.0000   Class :character   Class :character  
 Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
                    Mean   : 0.2211                                        
                    3rd Qu.: 0.0000                                        
                    Max.   :21.0000                                        
                                                                           
   company          days_in_waiting_list customer_type           adr         
 Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
 Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
 Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
                    Mean   :  2.321                         Mean   : 101.83  
                    3rd Qu.:  0.000                         3rd Qu.: 126.00  
                    Max.   :391.000                         Max.   :5400.00  
                                                                             
 required_car_parking_spaces total_of_special_requests reservation_status
 Min.   :0.00000             Min.   :0.0000            Length:119390     
 1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
 Median :0.00000             Median :0.0000            Mode  :character  
 Mean   :0.06252             Mean   :0.5714                              
 3rd Qu.:0.00000             3rd Qu.:1.0000                              
 Max.   :8.00000             Max.   :5.0000                              
                                                                         
 reservation_status_date
 Length:119390          
 Class :character       
 Mode  :character

Code

names(booking_data)

 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"

We can see from summary that all character variables don’t give any summary of the data. They have to be converted to factor to get valuable insights from summary. We can drop ‘arrival_date_week_number’ variable because this is redundant information. If we want to do some analysis to find effective marketing strategies,we need to drop rows with NA values(Undefined) for distribution_channel and market_segment variables. After all this pre processing, we will be left with observations less than or equal to 119390 and 31 columns.

Code

# Drop rows with NA values for market_segment, distribution_channel.
# Convert character variables to factor.
# Drop arrival_date_week_number variable.
clean_booking_data <- booking_data %>%
  filter(distribution_channel != "Undefined",
         market_segment != "Undefined") %>%
  mutate_if(is.character, as.factor) %>%
  select (-arrival_date_week_number)

summary(clean_booking_data)

          hotel        is_canceled       lead_time   arrival_date_year
 City Hotel  :79326   Min.   :0.0000   Min.   :  0   Min.   :2015     
 Resort Hotel:40059   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
                      Median :0.0000   Median : 69   Median :2016     
                      Mean   :0.3704   Mean   :104   Mean   :2016     
                      3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
                      Max.   :1.0000   Max.   :737   Max.   :2017     
                                                                      
 arrival_date_month arrival_date_day_of_month stays_in_weekend_nights
 August :13873      Min.   : 1.0              Min.   : 0.0000        
 July   :12660      1st Qu.: 8.0              1st Qu.: 0.0000        
 May    :11791      Median :16.0              Median : 1.0000        
 October:11160      Mean   :15.8              Mean   : 0.9276        
 April  :11089      3rd Qu.:23.0              3rd Qu.: 2.0000        
 June   :10939      Max.   :31.0              Max.   :19.0000        
 (Other):47873                                                       
 stays_in_week_nights     adults          children           babies         
 Min.   : 0.0         Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000000  
 1st Qu.: 1.0         1st Qu.: 2.000   1st Qu.: 0.0000   1st Qu.: 0.000000  
 Median : 2.0         Median : 2.000   Median : 0.0000   Median : 0.000000  
 Mean   : 2.5         Mean   : 1.856   Mean   : 0.1039   Mean   : 0.007949  
 3rd Qu.: 3.0         3rd Qu.: 2.000   3rd Qu.: 0.0000   3rd Qu.: 0.000000  
 Max.   :50.0         Max.   :55.000   Max.   :10.0000   Max.   :10.000000  
                                                                            
        meal          country            market_segment  distribution_channel
 BB       :92306   PRT    :48585   Aviation     :  237   Corporate: 6677     
 FB       :  798   GBR    :12129   Complementary:  743   Direct   :14645     
 HB       :14462   FRA    :10415   Corporate    : 5295   GDS      :  193     
 SC       :10650   ESP    : 8568   Direct       :12604   TA/TO    :97870     
 Undefined: 1169   DEU    : 7287   Groups       :19811                       
                   ITA    : 3766   Offline TA/TO:24219                       
                   (Other):28635   Online TA    :56476                       
 is_repeated_guest previous_cancellations previous_bookings_not_canceled
 Min.   :0.00000   Min.   : 0.00000       Min.   : 0.0000               
 1st Qu.:0.00000   1st Qu.: 0.00000       1st Qu.: 0.0000               
 Median :0.00000   Median : 0.00000       Median : 0.0000               
 Mean   :0.03191   Mean   : 0.08712       Mean   : 0.1371               
 3rd Qu.:0.00000   3rd Qu.: 0.00000       3rd Qu.: 0.0000               
 Max.   :1.00000   Max.   :26.00000       Max.   :72.0000               
                                                                        
 reserved_room_type assigned_room_type booking_changes       deposit_type   
 A      :85993      A      :74052      Min.   : 0.0000   No Deposit:104636  
 D      :19201      D      :25322      1st Qu.: 0.0000   Non Refund: 14587  
 E      : 6535      E      : 7806      Median : 0.0000   Refundable:   162  
 F      : 2897      F      : 3751      Mean   : 0.2211                      
 G      : 2094      G      : 2553      3rd Qu.: 0.0000                      
 B      : 1114      C      : 2375      Max.   :21.0000                      
 (Other): 1551      (Other): 3526                                           
     agent          company       days_in_waiting_list         customer_type  
 9      :31960   NULL   :112588   Min.   :  0.000      Contract       : 4076  
 NULL   :16337   40     :   927   1st Qu.:  0.000      Group          :  577  
 240    :13922   223    :   784   Median :  0.000      Transient      :89612  
 1      : 7191   67     :   267   Mean   :  2.321      Transient-Party:25120  
 14     : 3639   45     :   250   3rd Qu.:  0.000                             
 7      : 3539   153    :   215   Max.   :391.000                             
 (Other):42797   (Other):  4354                                               
      adr          required_car_parking_spaces total_of_special_requests
 Min.   :  -6.38   Min.   :0.00000             Min.   :0.0000           
 1st Qu.:  69.29   1st Qu.:0.00000             1st Qu.:0.0000           
 Median :  94.59   Median :0.00000             Median :0.0000           
 Mean   : 101.83   Mean   :0.06251             Mean   :0.5713           
 3rd Qu.: 126.00   3rd Qu.:0.00000             3rd Qu.:1.0000           
 Max.   :5400.00   Max.   :8.00000             Max.   :5.0000           
                                                                        
 reservation_status reservation_status_date
 Canceled :43013    2015-10-21:  1461      
 Check-Out:75165    2015-07-06:   805      
 No-Show  : 1207    2016-11-25:   790      
                    2015-01-01:   763      
                    2016-01-18:   625      
                    2015-07-02:   469      
                    (Other)   :114472

So, we have 119385 observations and 31 columns after cleaning the data.

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

There are three independent variables for arrival_date_year, arrival_date_month, arrival_date_day_of_month which is unnecessary. Hence, they can be combined to one variable arrival_date. Also, variables like is_canceled, is_repeated_guest have to be converted to Boolean instead of integer for readability. We will be left with 29 columns after this step.

Code

# combine day, month and year columns to date column.
# Remove arrival_date_week_number_column.
booking_data_tidy <- clean_booking_data %>%
  mutate(arrival_date = as.Date(paste(arrival_date_month,
                                      arrival_date_day_of_month,
                                      arrival_date_year, sep = "-"),
                                format="%B-%d-%Y"),
         is_repeated_guest = as.logical(is_repeated_guest),
         is_canceled = as.logical(is_canceled)) %>%
  select(-arrival_date_month,
         -arrival_date_day_of_month,
         -arrival_date_year)

head(booking_data_tidy)

         hotel is_canceled lead_time stays_in_weekend_nights
1 Resort Hotel       FALSE       342                       0
2 Resort Hotel       FALSE       737                       0
3 Resort Hotel       FALSE         7                       0
4 Resort Hotel       FALSE        13                       0
5 Resort Hotel       FALSE        14                       0
6 Resort Hotel       FALSE        14                       0
  stays_in_week_nights adults children babies meal country market_segment
1                    0      2        0      0   BB     PRT         Direct
2                    0      2        0      0   BB     PRT         Direct
3                    1      1        0      0   BB     GBR         Direct
4                    1      1        0      0   BB     GBR      Corporate
5                    2      2        0      0   BB     GBR      Online TA
6                    2      2        0      0   BB     GBR      Online TA
  distribution_channel is_repeated_guest previous_cancellations
1               Direct             FALSE                      0
2               Direct             FALSE                      0
3               Direct             FALSE                      0
4            Corporate             FALSE                      0
5                TA/TO             FALSE                      0
6                TA/TO             FALSE                      0
  previous_bookings_not_canceled reserved_room_type assigned_room_type
1                              0                  C                  C
2                              0                  C                  C
3                              0                  A                  C
4                              0                  A                  A
5                              0                  A                  A
6                              0                  A                  A
  booking_changes deposit_type agent company days_in_waiting_list customer_type
1               3   No Deposit  NULL    NULL                    0     Transient
2               4   No Deposit  NULL    NULL                    0     Transient
3               0   No Deposit  NULL    NULL                    0     Transient
4               0   No Deposit   304    NULL                    0     Transient
5               0   No Deposit   240    NULL                    0     Transient
6               0   No Deposit   240    NULL                    0     Transient
  adr required_car_parking_spaces total_of_special_requests reservation_status
1   0                           0                         0          Check-Out
2   0                           0                         0          Check-Out
3  75                           0                         0          Check-Out
4  75                           0                         0          Check-Out
5  98                           0                         1          Check-Out
6  98                           0                         1          Check-Out
  reservation_status_date arrival_date
1              2015-07-01   2015-07-01
2              2015-07-01   2015-07-01
3              2015-07-02   2015-07-01
4              2015-07-02   2015-07-01
5              2015-07-03   2015-07-01
6              2015-07-03   2015-07-01

So, we have 119385 observations and 29 columns after combing all the date related variables to “arrival_date” variable.

Code

# Hotel wise frequency of distribution_channel
booking_data_tidy %>% group_by(hotel, distribution_channel) %>% summarise(count=n())

# A tibble: 7 × 3
# Groups:   hotel [2]
  hotel        distribution_channel count
  <fct>        <fct>                <int>
1 City Hotel   Corporate             3408
2 City Hotel   Direct                6780
3 City Hotel   GDS                    193
4 City Hotel   TA/TO                68945
5 Resort Hotel Corporate             3269
6 Resort Hotel Direct                7865
7 Resort Hotel TA/TO                28925

Code

# Hotel wise frequency of market_segment
booking_data_tidy %>% group_by(hotel, market_segment) %>% summarise(count=n())

# A tibble: 13 × 3
# Groups:   hotel [2]
   hotel        market_segment count
   <fct>        <fct>          <int>
 1 City Hotel   Aviation         237
 2 City Hotel   Complementary    542
 3 City Hotel   Corporate       2986
 4 City Hotel   Direct          6092
 5 City Hotel   Groups         13975
 6 City Hotel   Offline TA/TO  16747
 7 City Hotel   Online TA      38747
 8 Resort Hotel Complementary    201
 9 Resort Hotel Corporate       2309
10 Resort Hotel Direct          6512
11 Resort Hotel Groups          5836
12 Resort Hotel Offline TA/TO   7472
13 Resort Hotel Online TA      17729