Challenge 2_Solution

challenge_2

railroads

faostat

hotel_bookings

Data wrangling: using group() and summarise()

Author

Zhongyue Lin

Published

June 1, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

hotel_bookings.csv ⭐⭐⭐⭐ In

In this challenge 2 I chose the hotel_bookings.csv dataset as the raw data for the analysis

Code

#Data Import
hotel_bookings <- read_csv("_data/hotel_bookings.csv")

Describe the data

Code

# Preview the first few rows of data
head(hotel_bookings)

# A tibble: 6 × 32
  hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <chr>     <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 Resort…       0     342    2015 July         27       1       0       0      2
2 Resort…       0     737    2015 July         27       1       0       0      2
3 Resort…       0       7    2015 July         27       1       0       1      1
4 Resort…       0      13    2015 July         27       1       0       1      1
5 Resort…       0      14    2015 July         27       1       0       2      2
6 Resort…       0      14    2015 July         27       1       0       2      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Code

# Display the dimension of the data frame
print(paste("The dataset has", dim(hotel_bookings)[1], "rows and", dim(hotel_bookings)[2], "columns."))

[1] "The dataset has 119390 rows and 32 columns."

A quick preview of the data was performed after reading, and the head()function and dim() function were used to observe the underlying information of the data set (119,390 rows 32 columns).

Code

# Check the type of each variable
var_types <- sapply(hotel_bookings, class)

# Calculate the total number of variables
total_vars <- length(var_types)

# Calculate the number of numerical and categorical variables
num_vars <- sum(var_types == "numeric" | var_types == "integer")
cat_vars <- sum(var_types == "factor" | var_types == "character")

# Output results
cat("Total variables: ", total_vars, "\n")

Total variables:  32

Code

cat("Numeric variables: ", num_vars, "\n")

Numeric variables:  18

Code

cat("Categorical variables: ", cat_vars, "\n")

Categorical variables:  13

Code

str(hotel_bookings)

spc_tbl_ [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
 $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
 $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
 $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
 $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
 $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
 $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
 $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
 $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
 $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
 $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
 $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
 $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
 $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
 $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
 $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
 $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
 $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
 $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
 $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
 $ adr                           : num [1:119390] 0 0 75 75 98 ...
 $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
 $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
 $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
 - attr(*, "spec")=
  .. cols(
  ..   hotel = col_character(),
  ..   is_canceled = col_double(),
  ..   lead_time = col_double(),
  ..   arrival_date_year = col_double(),
  ..   arrival_date_month = col_character(),
  ..   arrival_date_week_number = col_double(),
  ..   arrival_date_day_of_month = col_double(),
  ..   stays_in_weekend_nights = col_double(),
  ..   stays_in_week_nights = col_double(),
  ..   adults = col_double(),
  ..   children = col_double(),
  ..   babies = col_double(),
  ..   meal = col_character(),
  ..   country = col_character(),
  ..   market_segment = col_character(),
  ..   distribution_channel = col_character(),
  ..   is_repeated_guest = col_double(),
  ..   previous_cancellations = col_double(),
  ..   previous_bookings_not_canceled = col_double(),
  ..   reserved_room_type = col_character(),
  ..   assigned_room_type = col_character(),
  ..   booking_changes = col_double(),
  ..   deposit_type = col_character(),
  ..   agent = col_character(),
  ..   company = col_character(),
  ..   days_in_waiting_list = col_double(),
  ..   customer_type = col_character(),
  ..   adr = col_double(),
  ..   required_car_parking_spaces = col_double(),
  ..   total_of_special_requests = col_double(),
  ..   reservation_status = col_character(),
  ..   reservation_status_date = col_date(format = "")
  .. )
 - attr(*, "problems")=<externalptr>

However, since the dim() function cannot count the number of different kinds of variables in the dataset, I used the sapply() function to construct a code that classifies the variables in the dataset (numeric variables: 18,categorical variables 13). However, the total number of variables in the dataset is 32, and after using str()double check, I found that “reservation_status_date” is a date type variable.

Code

#Summary statistics on the data
summary(hotel_bookings)

    hotel            is_canceled       lead_time   arrival_date_year
 Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
 Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
 Mode  :character   Median :0.0000   Median : 69   Median :2016     
                    Mean   :0.3704   Mean   :104   Mean   :2016     
                    3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
                    Max.   :1.0000   Max.   :737   Max.   :2017     
                                                                    
 arrival_date_month arrival_date_week_number arrival_date_day_of_month
 Length:119390      Min.   : 1.00            Min.   : 1.0             
 Class :character   1st Qu.:16.00            1st Qu.: 8.0             
 Mode  :character   Median :28.00            Median :16.0             
                    Mean   :27.17            Mean   :15.8             
                    3rd Qu.:38.00            3rd Qu.:23.0             
                    Max.   :53.00            Max.   :31.0             
                                                                      
 stays_in_weekend_nights stays_in_week_nights     adults      
 Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
 1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
 Median : 1.0000         Median : 2.0         Median : 2.000  
 Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
 3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
 Max.   :19.0000         Max.   :50.0         Max.   :55.000  
                                                              
    children           babies              meal             country         
 Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
 1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
 Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
 Mean   : 0.1039   Mean   : 0.007949                                        
 3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
 Max.   :10.0000   Max.   :10.000000                                        
 NA's   :4                                                                  
 market_segment     distribution_channel is_repeated_guest
 Length:119390      Length:119390        Min.   :0.00000  
 Class :character   Class :character     1st Qu.:0.00000  
 Mode  :character   Mode  :character     Median :0.00000  
                                         Mean   :0.03191  
                                         3rd Qu.:0.00000  
                                         Max.   :1.00000  
                                                          
 previous_cancellations previous_bookings_not_canceled reserved_room_type
 Min.   : 0.00000       Min.   : 0.0000                Length:119390     
 1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
 Median : 0.00000       Median : 0.0000                Mode  :character  
 Mean   : 0.08712       Mean   : 0.1371                                  
 3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
 Max.   :26.00000       Max.   :72.0000                                  
                                                                         
 assigned_room_type booking_changes   deposit_type          agent          
 Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
 Class :character   1st Qu.: 0.0000   Class :character   Class :character  
 Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
                    Mean   : 0.2211                                        
                    3rd Qu.: 0.0000                                        
                    Max.   :21.0000                                        
                                                                           
   company          days_in_waiting_list customer_type           adr         
 Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
 Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
 Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
                    Mean   :  2.321                         Mean   : 101.83  
                    3rd Qu.:  0.000                         3rd Qu.: 126.00  
                    Max.   :391.000                         Max.   :5400.00  
                                                                             
 required_car_parking_spaces total_of_special_requests reservation_status
 Min.   :0.00000             Min.   :0.0000            Length:119390     
 1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
 Median :0.00000             Median :0.0000            Mode  :character  
 Mean   :0.06252             Mean   :0.5714                              
 3rd Qu.:0.00000             3rd Qu.:1.0000                              
 Max.   :8.00000             Max.   :5.0000                              
                                                                         
 reservation_status_date
 Min.   :2014-10-17     
 1st Qu.:2016-02-01     
 Median :2016-08-07     
 Mean   :2016-07-30     
 3rd Qu.:2017-02-08     
 Max.   :2017-09-14

Then use the summary() function to perform a statistical summary of the dataset, but the summary() function does not perform a statistical summary of the nominal data.

Code

#Identify non-numeric (categorical) variables in the dataframe
non_numeric_vars <- sapply(hotel_bookings, function(x) !is.numeric(x))

# Subset your dataframe to include only these variables
non_numeric_data <- hotel_bookings[, non_numeric_vars]

#Now can now operate on "non-numeric data". For example, the "hotel" variable
summary_list <- lapply(names(non_numeric_data), function(var_name) {
  freq_table <- table(non_numeric_data[[var_name]])
  freq_df <- as.data.frame(freq_table)
  colnames(freq_df) <- c("Level", "Frequency")
  
  return(freq_df)
})

names(summary_list) <- names(non_numeric_data)
print(summary_list$hotel)

         Level Frequency
1   City Hotel     79330
2 Resort Hotel     40060

To compensate for the statistical subsidy of the summary() function on non-numeric variables, I used lapply() for frequency statistics on non-numeric variables.

Provide Grouped Summary Statistics

If I were a data analyst for a hotel company, I would be very interested in the cancellation rate data, so I chose to explore the relationship between the cancellation rate and other characteristics data when I conducted further data exploration.

Code

# Grouping hotel types and calculating cancellation rates and average booking lead times
hotel_stats <- hotel_bookings %>%
  group_by(hotel) %>%
  summarise(
    count = n(),
    cancel_rate = mean(is_canceled),
    avg_lead_time = mean(lead_time),
  )

# Grouping of meal types and calculation of cancellation rates
meal_stats <- hotel_bookings %>%
  group_by(meal) %>%
  summarise(
    count = n(),
    cancel_rate = mean(is_canceled),
  )

# Group the number of special requests and calculate the cancellation rate
requests_stats <- hotel_bookings %>%
  group_by(total_of_special_requests) %>%
  summarise(
    count = n(),
    cancel_rate = mean(is_canceled),
  )

# Grouping customer types and calculating cancellation rates and average booking lead times
customer_stats <- hotel_bookings %>%
  group_by(customer_type) %>%
  summarise(
    count = n(),
    cancel_rate = mean(is_canceled),
    avg_lead_time = mean(lead_time),
  )

# Grouping booking intervals and calculating cancellation rates
waiting_stats <- hotel_bookings %>%
  filter(days_in_waiting_list > 0) %>%
  summarise(
    count = n(),
    cancel_rate = mean(is_canceled),
  )

# Print the corresponding statistics
print(hotel_stats)

# A tibble: 2 × 4
  hotel        count cancel_rate avg_lead_time
  <chr>        <int>       <dbl>         <dbl>
1 City Hotel   79330       0.417         110. 
2 Resort Hotel 40060       0.278          92.7

Code

print(meal_stats)

# A tibble: 5 × 3
  meal      count cancel_rate
  <chr>     <int>       <dbl>
1 BB        92310       0.374
2 FB          798       0.599
3 HB        14463       0.345
4 SC        10650       0.372
5 Undefined  1169       0.245

Code

print(requests_stats)

# A tibble: 6 × 3
  total_of_special_requests count cancel_rate
                      <dbl> <int>       <dbl>
1                         0 70318       0.477
2                         1 33226       0.220
3                         2 12969       0.221
4                         3  2497       0.179
5                         4   340       0.106
6                         5    40       0.05

Code

print(customer_stats)

# A tibble: 4 × 4
  customer_type   count cancel_rate avg_lead_time
  <chr>           <int>       <dbl>         <dbl>
1 Contract         4076       0.310         143. 
2 Group             577       0.102          55.1
3 Transient       89613       0.407          93.3
4 Transient-Party 25124       0.254         137.

Code

print(waiting_stats)

# A tibble: 1 × 2
  count cancel_rate
  <int>       <dbl>
1  3698       0.638

Explain and Interpret

In exploratory data analysis of hotel booking behavior, I focused on specific groupings such as hotel type, meal type, number of special requests, customer type, and booking interval. We observed that city hotels (79,330 bookings) had significantly higher bookings and cancellation rates (41.73%) than resort hotels (40,060 bookings, 27.76% cancellation rate), potentially due to their location and diverse clientele.

Regarding meal type, although Full Board (FB) had the fewest bookings (798), it had the highest cancellation rate (59.90%), which might be due to its high price. As the number of special requests increased, the cancellation rate tended to decrease, suggesting that customers with specific demands are more likely to stick to their reservations.

Transient customers made the most bookings (89,613), but also had a relatively high cancellation rate (40.75%). On the other hand, Contract and Transient-Party customers had lower cancellation rates, 30.96% and 25.43% respectively, possibly due to group booking nature reducing cancellation likelihood.

Regarding the booking interval, we initially filtered reservations that had been waiting in the list for at least a day. Because for those reservations with no waiting time (i.e., booking interval of 0), calculating the cancellation rate may be meaningless. We found that bookings waiting in the list (3,698) were fewer than other categories but had a high cancellation rate of 63.79%, suggesting that long waits might lead to cancellations.

It’s important to note the limitations of statistical information. For instance, calculating an average cancellation rate could mask data distribution. If the data distribution is skewed, the average may not be a good measure. In this case, we might need to use other statistical methods such as median or mode to better describe the data.

These exploratory data analysis results provide some insights into hotel booking behavior but require further research and validation for a comprehensive understanding of factors influencing hotel booking behavior, enabling more accurate predictions and decisions.