Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Zhongyue Lin
June 1, 2023
Today’s challenge is to
In this challenge 2 I chose the hotel_bookings.csv dataset as the raw data for the analysis
# A tibble: 6 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resort… 0 342 2015 July 27 1 0 0 2
2 Resort… 0 737 2015 July 27 1 0 0 2
3 Resort… 0 7 2015 July 27 1 0 1 1
4 Resort… 0 13 2015 July 27 1 0 1 1
5 Resort… 0 14 2015 July 27 1 0 2 2
6 Resort… 0 14 2015 July 27 1 0 2 2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
# assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
# agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
[1] "The dataset has 119390 rows and 32 columns."
A quick preview of the data was performed after reading, and the head()
function and dim()
function were used to observe the underlying information of the data set (119,390 rows 32 columns).
# Check the type of each variable
var_types <- sapply(hotel_bookings, class)
# Calculate the total number of variables
total_vars <- length(var_types)
# Calculate the number of numerical and categorical variables
num_vars <- sum(var_types == "numeric" | var_types == "integer")
cat_vars <- sum(var_types == "factor" | var_types == "character")
# Output results
cat("Total variables: ", total_vars, "\n")
Total variables: 32
Numeric variables: 18
Categorical variables: 13
spc_tbl_ [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
$ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
$ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
$ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
$ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
$ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
$ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
$ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
$ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
$ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
$ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
$ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
$ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
$ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
$ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
$ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
$ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
$ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
$ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
$ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
$ adr : num [1:119390] 0 0 75 75 98 ...
$ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
$ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
$ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
- attr(*, "spec")=
.. cols(
.. hotel = col_character(),
.. is_canceled = col_double(),
.. lead_time = col_double(),
.. arrival_date_year = col_double(),
.. arrival_date_month = col_character(),
.. arrival_date_week_number = col_double(),
.. arrival_date_day_of_month = col_double(),
.. stays_in_weekend_nights = col_double(),
.. stays_in_week_nights = col_double(),
.. adults = col_double(),
.. children = col_double(),
.. babies = col_double(),
.. meal = col_character(),
.. country = col_character(),
.. market_segment = col_character(),
.. distribution_channel = col_character(),
.. is_repeated_guest = col_double(),
.. previous_cancellations = col_double(),
.. previous_bookings_not_canceled = col_double(),
.. reserved_room_type = col_character(),
.. assigned_room_type = col_character(),
.. booking_changes = col_double(),
.. deposit_type = col_character(),
.. agent = col_character(),
.. company = col_character(),
.. days_in_waiting_list = col_double(),
.. customer_type = col_character(),
.. adr = col_double(),
.. required_car_parking_spaces = col_double(),
.. total_of_special_requests = col_double(),
.. reservation_status = col_character(),
.. reservation_status_date = col_date(format = "")
.. )
- attr(*, "problems")=<externalptr>
However, since the dim()
function cannot count the number of different kinds of variables in the dataset, I used the sapply()
function to construct a code that classifies the variables in the dataset (numeric variables: 18,categorical variables 13). However, the total number of variables in the dataset is 32, and after using str()
double check, I found that “reservation_status_date” is a date type variable.
hotel is_canceled lead_time arrival_date_year
Length:119390 Min. :0.0000 Min. : 0 Min. :2015
Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
Mode :character Median :0.0000 Median : 69 Median :2016
Mean :0.3704 Mean :104 Mean :2016
3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
Max. :1.0000 Max. :737 Max. :2017
arrival_date_month arrival_date_week_number arrival_date_day_of_month
Length:119390 Min. : 1.00 Min. : 1.0
Class :character 1st Qu.:16.00 1st Qu.: 8.0
Mode :character Median :28.00 Median :16.0
Mean :27.17 Mean :15.8
3rd Qu.:38.00 3rd Qu.:23.0
Max. :53.00 Max. :31.0
stays_in_weekend_nights stays_in_week_nights adults
Min. : 0.0000 Min. : 0.0 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
Median : 1.0000 Median : 2.0 Median : 2.000
Mean : 0.9276 Mean : 2.5 Mean : 1.856
3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
Max. :19.0000 Max. :50.0 Max. :55.000
children babies meal country
Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
Median : 0.0000 Median : 0.000000 Mode :character Mode :character
Mean : 0.1039 Mean : 0.007949
3rd Qu.: 0.0000 3rd Qu.: 0.000000
Max. :10.0000 Max. :10.000000
NA's :4
market_segment distribution_channel is_repeated_guest
Length:119390 Length:119390 Min. :0.00000
Class :character Class :character 1st Qu.:0.00000
Mode :character Mode :character Median :0.00000
Mean :0.03191
3rd Qu.:0.00000
Max. :1.00000
previous_cancellations previous_bookings_not_canceled reserved_room_type
Min. : 0.00000 Min. : 0.0000 Length:119390
1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
Median : 0.00000 Median : 0.0000 Mode :character
Mean : 0.08712 Mean : 0.1371
3rd Qu.: 0.00000 3rd Qu.: 0.0000
Max. :26.00000 Max. :72.0000
assigned_room_type booking_changes deposit_type agent
Length:119390 Min. : 0.0000 Length:119390 Length:119390
Class :character 1st Qu.: 0.0000 Class :character Class :character
Mode :character Median : 0.0000 Mode :character Mode :character
Mean : 0.2211
3rd Qu.: 0.0000
Max. :21.0000
company days_in_waiting_list customer_type adr
Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
Mode :character Median : 0.000 Mode :character Median : 94.58
Mean : 2.321 Mean : 101.83
3rd Qu.: 0.000 3rd Qu.: 126.00
Max. :391.000 Max. :5400.00
required_car_parking_spaces total_of_special_requests reservation_status
Min. :0.00000 Min. :0.0000 Length:119390
1st Qu.:0.00000 1st Qu.:0.0000 Class :character
Median :0.00000 Median :0.0000 Mode :character
Mean :0.06252 Mean :0.5714
3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :8.00000 Max. :5.0000
reservation_status_date
Min. :2014-10-17
1st Qu.:2016-02-01
Median :2016-08-07
Mean :2016-07-30
3rd Qu.:2017-02-08
Max. :2017-09-14
Then use the summary()
function to perform a statistical summary of the dataset, but the summary()
function does not perform a statistical summary of the nominal data.
#Identify non-numeric (categorical) variables in the dataframe
non_numeric_vars <- sapply(hotel_bookings, function(x) !is.numeric(x))
# Subset your dataframe to include only these variables
non_numeric_data <- hotel_bookings[, non_numeric_vars]
#Now can now operate on "non-numeric data". For example, the "hotel" variable
summary_list <- lapply(names(non_numeric_data), function(var_name) {
freq_table <- table(non_numeric_data[[var_name]])
freq_df <- as.data.frame(freq_table)
colnames(freq_df) <- c("Level", "Frequency")
return(freq_df)
})
names(summary_list) <- names(non_numeric_data)
print(summary_list$hotel)
Level Frequency
1 City Hotel 79330
2 Resort Hotel 40060
To compensate for the statistical subsidy of the summary()
function on non-numeric variables, I used lapply()
for frequency statistics on non-numeric variables.
If I were a data analyst for a hotel company, I would be very interested in the cancellation rate data, so I chose to explore the relationship between the cancellation rate and other characteristics data when I conducted further data exploration.
# Grouping hotel types and calculating cancellation rates and average booking lead times
hotel_stats <- hotel_bookings %>%
group_by(hotel) %>%
summarise(
count = n(),
cancel_rate = mean(is_canceled),
avg_lead_time = mean(lead_time),
)
# Grouping of meal types and calculation of cancellation rates
meal_stats <- hotel_bookings %>%
group_by(meal) %>%
summarise(
count = n(),
cancel_rate = mean(is_canceled),
)
# Group the number of special requests and calculate the cancellation rate
requests_stats <- hotel_bookings %>%
group_by(total_of_special_requests) %>%
summarise(
count = n(),
cancel_rate = mean(is_canceled),
)
# Grouping customer types and calculating cancellation rates and average booking lead times
customer_stats <- hotel_bookings %>%
group_by(customer_type) %>%
summarise(
count = n(),
cancel_rate = mean(is_canceled),
avg_lead_time = mean(lead_time),
)
# Grouping booking intervals and calculating cancellation rates
waiting_stats <- hotel_bookings %>%
filter(days_in_waiting_list > 0) %>%
summarise(
count = n(),
cancel_rate = mean(is_canceled),
)
# Print the corresponding statistics
print(hotel_stats)
# A tibble: 2 × 4
hotel count cancel_rate avg_lead_time
<chr> <int> <dbl> <dbl>
1 City Hotel 79330 0.417 110.
2 Resort Hotel 40060 0.278 92.7
# A tibble: 5 × 3
meal count cancel_rate
<chr> <int> <dbl>
1 BB 92310 0.374
2 FB 798 0.599
3 HB 14463 0.345
4 SC 10650 0.372
5 Undefined 1169 0.245
# A tibble: 6 × 3
total_of_special_requests count cancel_rate
<dbl> <int> <dbl>
1 0 70318 0.477
2 1 33226 0.220
3 2 12969 0.221
4 3 2497 0.179
5 4 340 0.106
6 5 40 0.05
# A tibble: 4 × 4
customer_type count cancel_rate avg_lead_time
<chr> <int> <dbl> <dbl>
1 Contract 4076 0.310 143.
2 Group 577 0.102 55.1
3 Transient 89613 0.407 93.3
4 Transient-Party 25124 0.254 137.
# A tibble: 1 × 2
count cancel_rate
<int> <dbl>
1 3698 0.638
In exploratory data analysis of hotel booking behavior, I focused on specific groupings such as hotel type, meal type, number of special requests, customer type, and booking interval. We observed that city hotels (79,330 bookings) had significantly higher bookings and cancellation rates (41.73%) than resort hotels (40,060 bookings, 27.76% cancellation rate), potentially due to their location and diverse clientele.
Regarding meal type, although Full Board (FB) had the fewest bookings (798), it had the highest cancellation rate (59.90%), which might be due to its high price. As the number of special requests increased, the cancellation rate tended to decrease, suggesting that customers with specific demands are more likely to stick to their reservations.
Transient customers made the most bookings (89,613), but also had a relatively high cancellation rate (40.75%). On the other hand, Contract and Transient-Party customers had lower cancellation rates, 30.96% and 25.43% respectively, possibly due to group booking nature reducing cancellation likelihood.
Regarding the booking interval, we initially filtered reservations that had been waiting in the list for at least a day. Because for those reservations with no waiting time (i.e., booking interval of 0), calculating the cancellation rate may be meaningless. We found that bookings waiting in the list (3,698) were fewer than other categories but had a high cancellation rate of 63.79%, suggesting that long waits might lead to cancellations.
It’s important to note the limitations of statistical information. For instance, calculating an average cancellation rate could mask data distribution. If the data distribution is skewed, the average may not be a good measure. In this case, we might need to use other statistical methods such as median or mode to better describe the data.
These exploratory data analysis results provide some insights into hotel booking behavior but require further research and validation for a comprehensive understanding of factors influencing hotel booking behavior, enabling more accurate predictions and decisions.
---
title: "Challenge 2_Solutions "
author: "Zhongyue Lin"
description: "Data wrangling: using group() and summarise()"
date: "6/1/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
- hotel_bookings.csv ⭐⭐⭐⭐ In
In this challenge 2 I chose the **hotel_bookings.csv** dataset as the raw data for the analysis
```{r}
#Data Import
hotel_bookings <- read_csv("_data/hotel_bookings.csv")
```
## Describe the data
```{r}
#| label: summary
# Preview the first few rows of data
head(hotel_bookings)
# Display the dimension of the data frame
print(paste("The dataset has", dim(hotel_bookings)[1], "rows and", dim(hotel_bookings)[2], "columns."))
```
A quick preview of the data was performed after reading, and the `head()`function and `dim()` function were used to observe the underlying information of the data set (**119,390** rows **32** columns).
```{r}
# Check the type of each variable
var_types <- sapply(hotel_bookings, class)
# Calculate the total number of variables
total_vars <- length(var_types)
# Calculate the number of numerical and categorical variables
num_vars <- sum(var_types == "numeric" | var_types == "integer")
cat_vars <- sum(var_types == "factor" | var_types == "character")
# Output results
cat("Total variables: ", total_vars, "\n")
cat("Numeric variables: ", num_vars, "\n")
cat("Categorical variables: ", cat_vars, "\n")
str(hotel_bookings)
```
However, since the `dim()` function cannot count the number of different kinds of variables in the dataset, I used the `sapply()` function to construct a code that classifies the variables in the dataset (**numeric variables: 18**,**categorical variables 13**). However, the total number of variables in the dataset is **32**, and after using `str()`double check, I found that **"reservation_status_date"** is a date type variable.
```{r}
#Summary statistics on the data
summary(hotel_bookings)
```
Then use the `summary()` function to perform a statistical summary of the dataset, but the `summary()` function does not perform a statistical summary of the nominal data.
```{r}
#Identify non-numeric (categorical) variables in the dataframe
non_numeric_vars <- sapply(hotel_bookings, function(x) !is.numeric(x))
# Subset your dataframe to include only these variables
non_numeric_data <- hotel_bookings[, non_numeric_vars]
#Now can now operate on "non-numeric data". For example, the "hotel" variable
summary_list <- lapply(names(non_numeric_data), function(var_name) {
freq_table <- table(non_numeric_data[[var_name]])
freq_df <- as.data.frame(freq_table)
colnames(freq_df) <- c("Level", "Frequency")
return(freq_df)
})
names(summary_list) <- names(non_numeric_data)
print(summary_list$hotel)
```
To compensate for the statistical subsidy of the `summary()` function on non-numeric variables, I used `lapply()` for frequency statistics on non-numeric variables.
## Provide Grouped Summary Statistics
If I were a data analyst for a hotel company, I would be very interested in the **cancellation rate** data, so I chose to explore the relationship between the cancellation rate and other characteristics data when I conducted further data exploration.
```{r}
# Grouping hotel types and calculating cancellation rates and average booking lead times
hotel_stats <- hotel_bookings %>%
group_by(hotel) %>%
summarise(
count = n(),
cancel_rate = mean(is_canceled),
avg_lead_time = mean(lead_time),
)
# Grouping of meal types and calculation of cancellation rates
meal_stats <- hotel_bookings %>%
group_by(meal) %>%
summarise(
count = n(),
cancel_rate = mean(is_canceled),
)
# Group the number of special requests and calculate the cancellation rate
requests_stats <- hotel_bookings %>%
group_by(total_of_special_requests) %>%
summarise(
count = n(),
cancel_rate = mean(is_canceled),
)
# Grouping customer types and calculating cancellation rates and average booking lead times
customer_stats <- hotel_bookings %>%
group_by(customer_type) %>%
summarise(
count = n(),
cancel_rate = mean(is_canceled),
avg_lead_time = mean(lead_time),
)
# Grouping booking intervals and calculating cancellation rates
waiting_stats <- hotel_bookings %>%
filter(days_in_waiting_list > 0) %>%
summarise(
count = n(),
cancel_rate = mean(is_canceled),
)
# Print the corresponding statistics
print(hotel_stats)
print(meal_stats)
print(requests_stats)
print(customer_stats)
print(waiting_stats)
```
### Explain and Interpret
In exploratory data analysis of hotel booking behavior, I focused on specific groupings such as hotel type, meal type, number of special requests, customer type, and booking interval. We observed that city hotels (**79,330 bookings**) had significantly higher bookings and cancellation rates (**41.73%**) than resort hotels (**40,060 bookings**, **27.76% cancellation rate**), potentially due to their location and diverse clientele.
Regarding meal type, although Full Board (**FB**) had the fewest bookings (**798**), it had the highest cancellation rate (**59.90%**), which might be due to its high price. As the number of special requests increased, the cancellation rate tended to decrease, suggesting that customers with specific demands are more likely to stick to their reservations.
Transient customers made the most bookings (**89,613**), but also had a relatively high cancellation rate (**40.75%**). On the other hand, Contract and Transient-Party customers had lower cancellation rates, **30.96%** and **25.43%** respectively, possibly due to group booking nature reducing cancellation likelihood.
Regarding the booking interval, we initially filtered reservations that had been waiting in the list for at least a day. Because for those reservations with no waiting time (i.e., booking interval of 0), calculating the cancellation rate may be meaningless. We found that bookings waiting in the list (**3,698**) were fewer than other categories but had a high cancellation rate of **63.79%**, suggesting that long waits might lead to cancellations.
It's important to note the limitations of statistical information. For instance, calculating an average cancellation rate could mask data distribution. If the data distribution is skewed, the average may not be a good measure. In this case, we might need to use other statistical methods such as median or mode to better describe the data.
These exploratory data analysis results provide some insights into hotel booking behavior but require further research and validation for a comprehensive understanding of factors influencing hotel booking behavior, enabling more accurate predictions and decisions.