challenge_2
Author

Pavan Datta Abbineni

Published

August 16, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

I’ve decided to use the hotel bookings dataset for this challenge.

Code
hotelBookingsCompleteData <- read_csv("_data/hotel_bookings.csv")

Describe the data

Let’s get a brief understanding of our dataset.

Code
head(hotelBookingsCompleteData)
# A tibble: 6 × 32
  hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <chr>     <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 Resort…       0     342    2015 July         27       1       0       0      2
2 Resort…       0     737    2015 July         27       1       0       0      2
3 Resort…       0       7    2015 July         27       1       0       1      1
4 Resort…       0      13    2015 July         27       1       0       1      1
5 Resort…       0      14    2015 July         27       1       0       2      2
6 Resort…       0      14    2015 July         27       1       0       2      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
# ℹ Use `colnames()` to see all variable names
Code
tail(hotelBookingsCompleteData)
# A tibble: 6 × 32
  hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <chr>     <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 City H…       0      21    2017 August       35      30       2       5      2
2 City H…       0      23    2017 August       35      30       2       5      2
3 City H…       0     102    2017 August       35      31       2       5      3
4 City H…       0      34    2017 August       35      31       2       5      2
5 City H…       0     109    2017 August       35      31       2       5      2
6 City H…       0     205    2017 August       35      29       2       7      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
# ℹ Use `colnames()` to see all variable names
Code
dim(hotelBookingsCompleteData)
[1] 119390     32

The dataset contains a total of 119390 observations with a total of 32 columns, lets see the column names next.

Code
colnames(hotelBookingsCompleteData)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       

To get the dispersion values of the data set let’s call the summary function.

Code
summary(hotelBookingsCompleteData)
    hotel            is_canceled       lead_time   arrival_date_year
 Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
 Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
 Mode  :character   Median :0.0000   Median : 69   Median :2016     
                    Mean   :0.3704   Mean   :104   Mean   :2016     
                    3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
                    Max.   :1.0000   Max.   :737   Max.   :2017     
                                                                    
 arrival_date_month arrival_date_week_number arrival_date_day_of_month
 Length:119390      Min.   : 1.00            Min.   : 1.0             
 Class :character   1st Qu.:16.00            1st Qu.: 8.0             
 Mode  :character   Median :28.00            Median :16.0             
                    Mean   :27.17            Mean   :15.8             
                    3rd Qu.:38.00            3rd Qu.:23.0             
                    Max.   :53.00            Max.   :31.0             
                                                                      
 stays_in_weekend_nights stays_in_week_nights     adults      
 Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
 1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
 Median : 1.0000         Median : 2.0         Median : 2.000  
 Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
 3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
 Max.   :19.0000         Max.   :50.0         Max.   :55.000  
                                                              
    children           babies              meal             country         
 Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
 1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
 Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
 Mean   : 0.1039   Mean   : 0.007949                                        
 3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
 Max.   :10.0000   Max.   :10.000000                                        
 NA's   :4                                                                  
 market_segment     distribution_channel is_repeated_guest
 Length:119390      Length:119390        Min.   :0.00000  
 Class :character   Class :character     1st Qu.:0.00000  
 Mode  :character   Mode  :character     Median :0.00000  
                                         Mean   :0.03191  
                                         3rd Qu.:0.00000  
                                         Max.   :1.00000  
                                                          
 previous_cancellations previous_bookings_not_canceled reserved_room_type
 Min.   : 0.00000       Min.   : 0.0000                Length:119390     
 1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
 Median : 0.00000       Median : 0.0000                Mode  :character  
 Mean   : 0.08712       Mean   : 0.1371                                  
 3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
 Max.   :26.00000       Max.   :72.0000                                  
                                                                         
 assigned_room_type booking_changes   deposit_type          agent          
 Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
 Class :character   1st Qu.: 0.0000   Class :character   Class :character  
 Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
                    Mean   : 0.2211                                        
                    3rd Qu.: 0.0000                                        
                    Max.   :21.0000                                        
                                                                           
   company          days_in_waiting_list customer_type           adr         
 Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
 Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
 Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
                    Mean   :  2.321                         Mean   : 101.83  
                    3rd Qu.:  0.000                         3rd Qu.: 126.00  
                    Max.   :391.000                         Max.   :5400.00  
                                                                             
 required_car_parking_spaces total_of_special_requests reservation_status
 Min.   :0.00000             Min.   :0.0000            Length:119390     
 1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
 Median :0.00000             Median :0.0000            Mode  :character  
 Mean   :0.06252             Mean   :0.5714                              
 3rd Qu.:0.00000             3rd Qu.:1.0000                              
 Max.   :8.00000             Max.   :5.0000                              
                                                                         
 reservation_status_date
 Min.   :2014-10-17     
 1st Qu.:2016-02-01     
 Median :2016-08-07     
 Mean   :2016-07-30     
 3rd Qu.:2017-02-08     
 Max.   :2017-09-14     
                        

Provide Grouped Summary Statistics

To find which hotels are accomodating more guests lets find the mean for both the stays_in_week_nights and stays_in_weekend_nights.

Code
hotelBookingsCompleteData %>%
  group_by(hotel) %>%
  summarise(mean = mean(stays_in_week_nights))
# A tibble: 2 × 2
  hotel         mean
  <chr>        <dbl>
1 City Hotel    2.18
2 Resort Hotel  3.13
Code
hotelBookingsCompleteData %>%
  group_by(hotel) %>%
  summarise(mean = mean(stays_in_weekend_nights))
# A tibble: 2 × 2
  hotel         mean
  <chr>        <dbl>
1 City Hotel   0.795
2 Resort Hotel 1.19 

We can see that the resort hotel is generally more popular with guests both during the week and on the weekends, in general both hotels are more popular during the week than the weekends.

Code
hotelBookingsCompleteData %>%       
  group_by(hotel) %>%
  summarise(totalCancellations = sum(is_canceled))
# A tibble: 2 × 2
  hotel        totalCancellations
  <chr>                     <dbl>
1 City Hotel                33102
2 Resort Hotel              11122

We can see that the City Hotel had more cancellations when compared to a resort hotel.

Let’s find out which countries are more popular

Code
countryTableData<-hotelBookingsCompleteData %>% group_by(country) %>% summarise(hotel = n())

countryTableData %>% group_by(country) %>% arrange(desc(hotel))
# A tibble: 178 × 2
# Groups:   country [178]
   country hotel
   <chr>   <int>
 1 PRT     48590
 2 GBR     12129
 3 FRA     10415
 4 ESP      8568
 5 DEU      7287
 6 ITA      3766
 7 IRL      3375
 8 BEL      2342
 9 BRA      2224
10 NLD      2104
# … with 168 more rows
# ℹ Use `print(n = ...)` to see more rows

From the above results we can find the countries with highest hotel bookings and the ones with the least hotel bookings.

But some of reservations are cancelled so let’s find the actual number of stays.

Code
hotelBookingsCompleteData %>%       
  group_by(country) %>%
  summarise(totalCancellations = sum(is_canceled))%>%
  arrange(desc(totalCancellations))
# A tibble: 178 × 2
   country totalCancellations
   <chr>                <dbl>
 1 PRT                  27519
 2 GBR                   2453
 3 ESP                   2177
 4 FRA                   1934
 5 ITA                   1333
 6 DEU                   1218
 7 IRL                    832
 8 BRA                    830
 9 USA                    501
10 BEL                    474
# … with 168 more rows
# ℹ Use `print(n = ...)` to see more rows

Explain and Interpret

Firstly I wanted to find which of the hotels is more popular ( City Hotel or the Resort Hotel ), from the data obtained it is very clear that the Resort Hotel on an average has more reservations both during the weekends and week-days, it also has less cancellations when compared to City Hotel.

I decided to find which country has a more successfull hotel business, it is clear that PRT has the highest number of hotel bookings.