Final project for Erin Liu
Nowadays when people book hotels, there are many factors that could affect the daily price of the hotel rooms (known as ADR: The number represents the average rental income per paid occupied room in a given time period[1]). In this project, I will be using the advanced dataset “hotel_bookings.csv” from the “Sample Datasets” section on Google Classroom and I will be studying the relationship between ADR(average daily rate for hotel booking) and the one of the most important factors that play a role in people’s decision in terms of booking hotel: lead_time, the amount of time period people book the hotel ahead of, and try to find a conclusion of which situation will give the cheapest daily rate between two different hotel types: city hotel and resort hotel.
HW3_data<- read.csv('/Users/erinliu/Downloads/hotel_bookings.csv',TRUE,',')
Since I will be studying the influence of different factors on ADR, I will first need to clean the dataset to have all positive ADR. In addition, I will shorter the range of lead_time to be between 0 and 300 and ADR below 300 for further analysis.
mean, median, and standard deviation for numerical variables, and frequencies for categorical variables.
summarize(resort_hotel_data, mean.resort_lead_time = mean(`lead_time`),
mean.resort_ADR = mean(`adr`),
mean.resort_days_weekend_stay = mean(`stays_in_weekend_nights`),
mean.resort_days_weekday_stay = mean(`stays_in_week_nights`))
mean.resort_lead_time mean.resort_ADR mean.resort_days_weekend_stay
1 89.52453 97.55005 1.25622
mean.resort_days_weekday_stay
1 3.271196
summarize(resort_hotel_data, median.resort_lead_time = median(`lead_time`),
median.resort_ADR = median(`adr`),
median.resort_days_weekend_stay = median(`stays_in_weekend_nights`),
median.resort_days_weekday_stay = median(`stays_in_week_nights`))
median.resort_lead_time median.resort_ADR
1 65 79.135
median.resort_days_weekend_stay median.resort_days_weekday_stay
1 1 3
summarize(resort_hotel_data, sd.resort_lead_time = sd(`lead_time`),
sd.resort_ADR = sd(`adr`),
sd.resort_days_weekend_stay = sd(`stays_in_weekend_nights`),
sd.resort_days_weekday_stay = sd(`stays_in_week_nights`))
sd.resort_lead_time sd.resort_ADR sd.resort_days_weekend_stay
1 81.72569 58.94009 1.143435
sd.resort_days_weekday_stay
1 2.446155
summarize(city_hotel_data, mean.city_lead_time = mean(`lead_time`),
mean.city_ADR = mean(`adr`),
mean.city_days_weekend_stay = mean(`stays_in_weekend_nights`),
mean.city_days_weekday_stay = mean(`stays_in_week_nights`))
mean.city_lead_time mean.city_ADR mean.city_days_weekend_stay
1 91.16719 108.6206 0.8371763
mean.city_days_weekday_stay
1 2.251454
summarize(city_hotel_data, median.city_lead_time = median(`lead_time`),
median.city_ADR = median(`adr`),
median.city_days_weekend_stay = median(`stays_in_weekend_nights`),
median.city_days_weekday_stay = median(`stays_in_week_nights`))
median.city_lead_time median.city_ADR median.city_days_weekend_stay
1 69 101 1
median.city_days_weekday_stay
1 2
summarize(city_hotel_data,
sd.city_lead_time = sd(`lead_time`),
sd.city_ADR = sd(`adr`),
sd.city_days_weekend_stay = sd(`stays_in_weekend_nights`),
sd.city_days_weekday_stay = sd(`stays_in_week_nights`))
sd.city_lead_time sd.city_ADR sd.city_days_weekend_stay
1 78.32797 36.94334 0.8889769
sd.city_days_weekday_stay
1 1.465679
This is the visualization between the city_lead_time_category and the count in each category. It shows the distribution of lead_time data in different ranges. We can tell from this graph that most of the lead_time happen in the range of 0-60 and the number of count decreases as lead_time goes larger.
city_hotel_data %>%
group_by(city_lead_time_category=cut(lead_time, breaks= seq(0, 300, by = 30), right=F) ) %>%
summarise(lead_time_frequency= n()) %>%
arrange(as.numeric(city_lead_time_category)) %>%
ggplot(aes(city_lead_time_category, group = 1)) +
geom_line(aes(y = lead_time_frequency, colour = "days that people book the hotel prior")) +
labs(title = "Density for lead_time groups for city hotel")
This is the visualization about the relationship between ADR and lead_time for city hotel. We can see that although it shows quite a large density around the range of 0-60 (as align with the previous graph), the ADR scatters around the range of 60-200 while the lead_time falls into the range of 0-60, and we can see that the count of higher price of ADR (larger than 150) reduces significantly once the lead_time falls beyond 100 days. This shows that generally price will go down a bit as the lead_time is longer, but not by a significant amount.
ggplot(city_hotel_data, aes(lead_time, adr)) + geom_point(alpha=0.04) +
labs(title = "The relationship between ADR and lead_time for city hotel with lead_time in category", x="days that people book the hotel prior")
This is the visualization between the resort_lead_time_category and the count in each category. It shows the distribution of lead_time data in different ranges. We can tell from this graph that most of the lead_time happen in the range of 0-30 and the number of count decreases as lead_time goes larger. As compare to the city hotel, the data of resort hotel shows a shorter range of lead_time as to be the popular option.
resort_hotel_data %>%
group_by(resort_lead_time_category=cut(lead_time, breaks= seq(0, 300, by = 30), right=F) ) %>%
summarise(lead_time_frequency= n()) %>%
arrange(as.numeric(resort_lead_time_category)) %>%
ggplot(aes(resort_lead_time_category, group = 1)) +
geom_line(aes(y = lead_time_frequency, colour = "days that people book the hotel prior")) +
labs(title = "Density for lead_time groups for resort hotel")
This is the visualization about the relationship between ADR and lead_time for resort hotel. We can see that although it shows quite a large density around the range of 0-30 (as align with the previous graph), the ADR scatters around the range of 30-150 while the lead_time falls into the range of 0-30, and we can see that the count of higher price of ADR (larger than 100) reduces significantly once the lead_time falls beyond 20 days. This shows that resort hotel has a lower average price as compare to city hotel and prices drop significantly more than city hotel as lead_time becomes longer. So in order to achieve a lower price when booking a resort hotel, people can generally book ahead of time for just a bit more days (no need to be longer than 20 days) to watch for the price to drop. This will be more efficient than waiting for the price to drop for city hotel because city hotel generally takes longer lead_time (around 60 days) to have a big variance in the ADR according to our data.
ggplot(resort_hotel_data, aes(lead_time, adr)) + geom_point(alpha=0.05) +
labs(title = "The relationship between ADR and lead_time for resort hotel with lead_time in category", x="days that people book the hotel prior")
This is the visualization of ADR’s distribution for resort hotel, and it helps us to understand the data more besides the conclusion from mean/median and we can see the peak of frequency happens in around 50.
ggplot(resort_hotel_data, aes(adr)) +
geom_histogram(aes(y = ..density..), alpha = 0.5) +
labs(title = "ADR density for resort hotel data") +
geom_density(alpha = 0.2, fill="red")
This is the visualization of ADR’s distribution for city hotel, and it helps us to understand the data more besides the conclusion from mean/median and we can see the peak of frequency happens in around 100.
ggplot(city_hotel_data, aes(adr)) +
geom_histogram(aes(y = ..density..), alpha = 0.5) +
labs(title = "ADR density for city hotel data") +
geom_density(alpha = 0.2, fill="red")
As we can see from all the above visualizations, city hotel has a larger average ADR(around 100) than resort hotel(around 50) and its price has less dependency on the lead_time variables. The price of a city hotel will rarely go down based on the lead_time and if people really want to book the hotel ahead of time to save money then they should look for the lead_time longer than 100 days, which is a big planning period ahead of time. On the other hand, resort hotels have a larger flexibility based on the lead_time and we observe the price to go down shortly after 20 days of lead_time. So it will be more efficient to book ahead of time for a resort hotel but not a city hotel. Besides this trend that I observed from this dataset, according to SmarterTravel.com, “As a general rule of thumb, booking more than 21 days ahead of your arrival date is a no-no for the most popular destinations” and people should be observing the price prior to their start day for around 40 days for some hotels[2]. This will give people a more clear idea of when to book and if booking ahead of time really helps with saving money on hotel price.
[1] https://en.wikipedia.org/wiki/Average_daily_rate
[2] https://www.usatoday.com/story/travel/hotels/2014/08/14/hotel-booking-tips/14006883/
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Liu (2022, Jan. 25). Data Analytics and Computational Social Science: When is the best time for hotel booking for different hotel types?. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomerinliufinal/
BibTeX citation
@misc{liu2022when, author = {Liu, Erin}, title = {Data Analytics and Computational Social Science: When is the best time for hotel booking for different hotel types?}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomerinliufinal/}, year = {2022} }