Code
library(tidyverse)
library(caret)
library(randomForest)
<- read_csv("_data/hotel_bookings.csv")
data ::opts_chunk$set(echo = TRUE) knitr
Xinpeng Liu
June 13, 2023
In this project, we will be working with a hotel booking dataset. This dataset includes detailed booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, duration of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.
I first explored the hotel booking dataset from 2015 to 2017, then understood the significance of each variable, cleaned the data by removing irrelevant columns and handling missing values, and finally, converted the date fields to the correct data type; this helped me formulate key research questions about hotel booking behaviors, after which I prepared a comprehensible narrative for non-experts to understand the findings.
hotel is_canceled lead_time arrival_date_year
Length:119390 Min. :0.0000 Min. : 0 Min. :2015
Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
Mode :character Median :0.0000 Median : 69 Median :2016
Mean :0.3704 Mean :104 Mean :2016
3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
Max. :1.0000 Max. :737 Max. :2017
arrival_date_month arrival_date_week_number arrival_date_day_of_month
Length:119390 Min. : 1.00 Min. : 1.0
Class :character 1st Qu.:16.00 1st Qu.: 8.0
Mode :character Median :28.00 Median :16.0
Mean :27.17 Mean :15.8
3rd Qu.:38.00 3rd Qu.:23.0
Max. :53.00 Max. :31.0
stays_in_weekend_nights stays_in_week_nights adults
Min. : 0.0000 Min. : 0.0 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
Median : 1.0000 Median : 2.0 Median : 2.000
Mean : 0.9276 Mean : 2.5 Mean : 1.856
3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
Max. :19.0000 Max. :50.0 Max. :55.000
children babies meal country
Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
Median : 0.0000 Median : 0.000000 Mode :character Mode :character
Mean : 0.1039 Mean : 0.007949
3rd Qu.: 0.0000 3rd Qu.: 0.000000
Max. :10.0000 Max. :10.000000
NA's :4
market_segment distribution_channel is_repeated_guest
Length:119390 Length:119390 Min. :0.00000
Class :character Class :character 1st Qu.:0.00000
Mode :character Mode :character Median :0.00000
Mean :0.03191
3rd Qu.:0.00000
Max. :1.00000
previous_cancellations previous_bookings_not_canceled reserved_room_type
Min. : 0.00000 Min. : 0.0000 Length:119390
1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
Median : 0.00000 Median : 0.0000 Mode :character
Mean : 0.08712 Mean : 0.1371
3rd Qu.: 0.00000 3rd Qu.: 0.0000
Max. :26.00000 Max. :72.0000
assigned_room_type booking_changes deposit_type agent
Length:119390 Min. : 0.0000 Length:119390 Length:119390
Class :character 1st Qu.: 0.0000 Class :character Class :character
Mode :character Median : 0.0000 Mode :character Mode :character
Mean : 0.2211
3rd Qu.: 0.0000
Max. :21.0000
company days_in_waiting_list customer_type adr
Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
Mode :character Median : 0.000 Mode :character Median : 94.58
Mean : 2.321 Mean : 101.83
3rd Qu.: 0.000 3rd Qu.: 126.00
Max. :391.000 Max. :5400.00
required_car_parking_spaces total_of_special_requests reservation_status
Min. :0.00000 Min. :0.0000 Length:119390
1st Qu.:0.00000 1st Qu.:0.0000 Class :character
Median :0.00000 Median :0.0000 Mode :character
Mean :0.06252 Mean :0.5714
3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :8.00000 Max. :5.0000
reservation_status_date
Min. :2014-10-17
1st Qu.:2016-02-01
Median :2016-08-07
Mean :2016-07-30
3rd Qu.:2017-02-08
Max. :2017-09-14
in this part we done three things
Now, we can see the new data set.
library(tidyr)
# 1. Remove irrelevant or redundant columns
data <- data %>%
select(-c(agent, company))
# 2. Handle missing values: replace the missing values of the list of values with the average value
data$children[is.na(data$children)] <- mean(data$children, na.rm = TRUE)
# Note: we assume the day of each observation to be the 1st
data <- data %>% mutate(arrival_date = ymd(paste(arrival_date_year, arrival_date_month, arrival_date_day_of_month, sep = "-")))
#drop the original Year and Month columns
data <- data %>% select(-arrival_date_year, -arrival_date_month, -arrival_date_day_of_month)
data
The dataset that we are analyzing is about hotel bookings. It contains 119,390 records, each representing a separate hotel booking. These bookings span from the year 2015 to 2017, covering various hotels, customers from different countries, and diverse market segments. It is a rich dataset providing various insights into hotel booking patterns, customer preferences, and booking cancellations.
This dataset comprises various information about hotel bookings. For each booking, the following attributes are provided:
---
title: "Homework 2: Hotel bookings"
author: "Xinpeng Liu"
description: "Study for Hotel bookings data set"
date: "06/13/2023"
format:
html:
df-print: paged
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw2
- Xinpeng Liu
- Hotel_bookings
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(caret)
library(randomForest)
data <- read_csv("_data/hotel_bookings.csv")
knitr::opts_chunk$set(echo = TRUE)
```
## Project Overview
In this project, we will be working with a hotel booking dataset. This dataset includes detailed booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, duration of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.
I first explored the hotel booking dataset from 2015 to 2017, then understood the significance of each variable, cleaned the data by removing irrelevant columns and handling missing values, and finally, converted the date fields to the correct data type; this helped me formulate key research questions about hotel booking behaviors, after which I prepared a comprehensible narrative for non-experts to understand the findings.
```{r}
```
## Read Data
```{r}
data
summary(data)
```
## Mutate variable and Clean
in this part we done three things
- 1. Removal of irrelevant or redundant columns: The first line of code is using the 'select' function from the 'dplyr' package to remove the 'agent' and 'company' columns from the 'data' data frame. These columns are considered unnecessary for further analysis.
- 2. Handling missing values: The next few lines are dealing with missing values in the 'children' column. Any 'NA' values in this column are being replaced with the mean value of the 'children' column (calculated without considering the 'NA' values).
- 3. Data type conversion: The last few lines of code are creating a new 'arrival_date' column by combining the 'arrival_date_year', 'arrival_date_month', and 'arrival_date_day_of_month' columns. The 'paste' function is used to concatenate these three columns into a string in the format 'year-month-day'. The 'as.Date' function is then used to convert this string into a date object. Finally, the original 'arrival_date_year', 'arrival_date_month', and 'arrival_date_day_of_month' columns are removed from the 'data' data frame as they are now redundant.
Now, we can see the new data set.
```{r}
library(tidyr)
# 1. Remove irrelevant or redundant columns
data <- data %>%
select(-c(agent, company))
# 2. Handle missing values: replace the missing values of the list of values with the average value
data$children[is.na(data$children)] <- mean(data$children, na.rm = TRUE)
# Note: we assume the day of each observation to be the 1st
data <- data %>% mutate(arrival_date = ymd(paste(arrival_date_year, arrival_date_month, arrival_date_day_of_month, sep = "-")))
#drop the original Year and Month columns
data <- data %>% select(-arrival_date_year, -arrival_date_month, -arrival_date_day_of_month)
data
```
## The narrative about the data set
The dataset that we are analyzing is about hotel bookings. It contains 119,390 records, each representing a separate hotel booking. These bookings span from the year 2015 to 2017, covering various hotels, customers from different countries, and diverse market segments. It is a rich dataset providing various insights into hotel booking patterns, customer preferences, and booking cancellations.
This dataset comprises various information about hotel bookings. For each booking, the following attributes are provided:
- hotel: Hotel (Resort Hotel or City Hotel) where the booking was made.
- is_canceled: Value indicating if the booking was canceled (1) or not (0).
- lead_time: Number of days that elapsed between the entering date of the booking into the Property Management System and the arrival date.
- arrival_date_year: Year of arrival date.
- arrival_date_month: Month of arrival date.
- arrival_date_week_number: Week number of year for arrival date.
- arrival_date_day_of_month: Day of arrival date.
- stays_in_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel.
- stays_in_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel.
- adults: Number of adults.
- children: Number of children.
- babies: Number of babies.
- meal: Type of meal booked.
- country: Country of origin of the booking.
- market_segment: Market segment designation.
- distribution_channel: Booking distribution channel.
- is_repeated_guest: Value indicating if the booking name was from a repeated guest (1) or not (0).
- previous_cancellations: Number of previous bookings that were cancelled by the customer prior to the current booking.
- previous_bookings_not_canceled: Number of previous bookings not cancelled by the customer prior to the current booking.
- reserved_room_type: Code of room type reserved.
- assigned_room_type: Code for the type of room assigned to the booking.
- booking_changes: Number of changes/amendments made to the booking from the moment the booking was entered on - the Property Management System until the moment of check-in or cancellation.
- deposit_type: Indication on if the customer made a deposit to guarantee the booking.
- agent: ID of the travel agency that made the booking.
- company: ID of the company/entity that made the booking or responsible for paying the booking.
- days_in_waiting_list: Number of days the booking was in the waiting list before it was confirmed to the customer.
- customer_type: Type of booking.
- adr: Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights.
- required_car_parking_spaces: Number of car parking spaces required by the customer.
- total_of_special_requests: Number of special requests made by the customer.
- reservation_status: Last reservation status, assuming one of three categories: Canceled, Check-Out, No-Show.
- reservation_status_date: Date at which the last status was set.
The data are with very long content.
## research questions
- 1. Booking Patterns: What are the busiest months for hotels? Does the lead time of booking relate to the busy periods?
- 2. Guest Preferences: Are certain room types more popular than others? Does the choice of room type correlate with any other variables like the number of adults, children, or babies in the party?
- 3. Cancellations: Are cancellations more likely for certain types of bookings (e.g., long lead time, specific room type, repeated guest etc.)? Can we predict if a booking will be cancelled based on certain characteristics?
- 4. Special Requests: How often do customers make special requests? Does the presence of a special request impact the likelihood of cancellation?
- 5. Pricing Strategies: Does the price (ADR) vary by the time of year or by room type? What factors might be influencing these pricing strategies?