HW6
Nowadays when people book hotels, there are many factors that could affect the daily price of the hotel rooms (known as ADR: The number represents the average rental income per paid occupied room in a given time period[1]). In this project, I will be using the advanced dataset “hotel_bookings.csv” from the “Sample Datasets” section on Google Classroom and I will be studying the relationship between ADR(average daily rate for hotel booking) and several other common factors(such as the amount of time that people booked the hotel ahead of(lead_time), customer types and room types) that play a role in people’s decision in terms of booking hotel and try to find a conclusion of which situation will give the cheapest daily rate.
HW3_data<- read.csv('/Users/erinliu/Downloads/hotel_bookings.csv',TRUE,',')
Since I will be studying the influence of different factors on ADR, I will first need to clean the dataset to have all positive ADR. In addition, I will shorter the range of lead_time to be between 0 and 300 and ADR below 300 for further analysis.
mean, median, and standard deviation for numerical variables, and frequencies for categorical variables.
summarize(resort_hotel_data, mean.resort_lead_time = mean(`lead_time`),
mean.resort_ADR = mean(`adr`),
mean.resort_days_weekend_stay = mean(`stays_in_weekend_nights`),
mean.resort_days_weekday_stay = mean(`stays_in_week_nights`))
mean.resort_lead_time mean.resort_ADR mean.resort_days_weekend_stay
1 89.52453 97.55005 1.25622
mean.resort_days_weekday_stay
1 3.271196
summarize(resort_hotel_data, median.resort_lead_time = median(`lead_time`),
median.resort_ADR = median(`adr`),
median.resort_days_weekend_stay = median(`stays_in_weekend_nights`),
median.resort_days_weekday_stay = median(`stays_in_week_nights`))
median.resort_lead_time median.resort_ADR
1 65 79.135
median.resort_days_weekend_stay median.resort_days_weekday_stay
1 1 3
summarize(resort_hotel_data, sd.resort_lead_time = sd(`lead_time`),
sd.resort_ADR = sd(`adr`),
sd.resort_days_weekend_stay = sd(`stays_in_weekend_nights`),
sd.resort_days_weekday_stay = sd(`stays_in_week_nights`))
sd.resort_lead_time sd.resort_ADR sd.resort_days_weekend_stay
1 81.72569 58.94009 1.143435
sd.resort_days_weekday_stay
1 2.446155
summarize(city_hotel_data, mean.city_lead_time = mean(`lead_time`),
mean.city_ADR = mean(`adr`),
mean.city_days_weekend_stay = mean(`stays_in_weekend_nights`),
mean.city_days_weekday_stay = mean(`stays_in_week_nights`))
mean.city_lead_time mean.city_ADR mean.city_days_weekend_stay
1 91.16719 108.6206 0.8371763
mean.city_days_weekday_stay
1 2.251454
summarize(city_hotel_data, median.city_lead_time = median(`lead_time`),
median.city_ADR = median(`adr`),
median.city_days_weekend_stay = median(`stays_in_weekend_nights`),
median.city_days_weekday_stay = median(`stays_in_week_nights`))
median.city_lead_time median.city_ADR median.city_days_weekend_stay
1 69 101 1
median.city_days_weekday_stay
1 2
summarize(city_hotel_data,
sd.city_lead_time = sd(`lead_time`),
sd.city_ADR = sd(`adr`),
sd.city_days_weekend_stay = sd(`stays_in_weekend_nights`),
sd.city_days_weekday_stay = sd(`stays_in_week_nights`))
sd.city_lead_time sd.city_ADR sd.city_days_weekend_stay
1 78.32797 36.94334 0.8889769
sd.city_days_weekday_stay
1 1.465679
This is the visualization between the city_lead_time_category and the count in each category. It shows the distribution of lead_time data in different ranges. We can tell from this graph that most of the lead_time happen in the range of 0-60 and the number of count decreases as lead_time goes larger.
city_hotel_data %>%
group_by(city_lead_time_category=cut(lead_time, breaks= seq(0, 300, by = 30), right=F) ) %>%
summarise(lead_time_frequency= n()) %>%
arrange(as.numeric(city_lead_time_category)) %>%
ggplot(aes(city_lead_time_category, group = 1)) +
geom_line(aes(y = lead_time_frequency, colour = "count for lead_time")) +
labs(title = "Density for lead_time groups for city hotel")
This is the visualization about the relationship between ADR and lead_time for city hotel. We can see that although it shows quite a large density around the range of 0-60 (as align with the previous graph), the ADR scatters around the range of 60-200 while the lead_time falls into the range of 0-60, and we can see that the count of higher price of ADR (larger than 150) reduces significantly once the lead_time falls beyond 100 days. This shows that generally price will go down a bit as the lead_time is longer, but not by a significant amount.
ggplot(city_hotel_data, aes(lead_time, adr)) + geom_point(alpha=0.04) +
labs(title = "The relationship between ADR and lead_time for city hotel with lead_time in category")
This is the visualization between the resort_lead_time_category and the count in each category. It shows the distribution of lead_time data in different ranges. We can tell from this graph that most of the lead_time happen in the range of 0-30 and the number of count decreases as lead_time goes larger. As compare to the city hotel, the data of resort hotel shows a shorter range of lead_time as to be the popular option.
resort_hotel_data %>%
group_by(resort_lead_time_category=cut(lead_time, breaks= seq(0, 300, by = 30), right=F) ) %>%
summarise(lead_time_frequency= n()) %>%
arrange(as.numeric(resort_lead_time_category)) %>%
ggplot(aes(resort_lead_time_category, group = 1)) +
geom_line(aes(y = lead_time_frequency, colour = "count for lead_time")) +
labs(title = "Density for lead_time groups for resort hotel")
This is the visualization about the relationship between ADR and lead_time for resort hotel. We can see that although it shows quite a large density around the range of 0-30 (as align with the previous graph), the ADR scatters around the range of 30-150 while the lead_time falls into the range of 0-30, and we can see that the count of higher price of ADR (larger than 100) reduces significantly once the lead_time falls beyond 20 days. This shows that resort hotel has a lower average price as compare to city hotel and prices drop significantly more than city hotel as lead_time becomes longer. So in order to achieve a lower price when booking a resort hotel, people can generally book ahead of time for just a bit more days (no need to be longer than 20 days) to watch for the price to drop. This will be more efficient than waiting for the price to drop for city hotel because city hotel generally takes longer lead_time (around 60 days) to have a big variance in the ADR according to our data.
ggplot(resort_hotel_data, aes(lead_time, adr)) + geom_point(alpha=0.05) +
labs(title = "The relationship between ADR and lead_time for resort hotel with lead_time in category")
This is the visualization of ADR’s distribution for resort hotel, and it helps us to understand the data more besides the conclusion from mean/median and we can see the peak of frequency happens in around 50.
ggplot(resort_hotel_data, aes(adr)) +
geom_histogram(aes(y = ..density..), alpha = 0.5) +
labs(title = "ADR density for resort hotel data") +
geom_density(alpha = 0.2, fill="red")
This is the visualization of ADR’s distribution for city hotel, and it helps us to understand the data more besides the conclusion from mean/median and we can see the peak of frequency happens in around 100.
ggplot(city_hotel_data, aes(adr)) +
geom_histogram(aes(y = ..density..), alpha = 0.5) +
labs(title = "ADR density for city hotel data") +
geom_density(alpha = 0.2, fill="red")
This is the visualization for ADR and different room types. We can see that room type A has the largest number and almost every type has an average of daily rate around 0-200 except for room type F which seems to have a higher average around 200-300.
all_data <- filter(HW3_data, adr > 0 & lead_time>0 & lead_time<300 & adr <300 )
ggplot(all_data, aes(adr)) +
geom_histogram(binwidth = 40) +
labs(title = "ADR with different room type reservations") +
theme_bw() +
facet_wrap(vars(reserved_room_type))
This is the visualization between different customer types and ADR. We can see that type Transient is the most popular type and almost every type has a median of ADR around 100.
all_data <- filter(HW3_data, adr > 0 & lead_time>0 & lead_time<300 & adr <300 )
ggplot(all_data, aes(adr)) +
geom_histogram(binwidth = 100) +
labs(title = "ADR with different customer types") +
theme_bw() +
facet_wrap(vars(customer_type))
As we can see from all the above visualizations, city hotel has a larger average ADR(around 100) than resort hotel(around 50) and its price has less dependency on the lead_time variables. The price of a city hotel will rarely go down based on the lead_time and if people really want to book the hotel ahead of time to save money then they should look for the lead_time longer than 100 days, which is a big planning period ahead of time. On the other hand, resort hotels have a larger flexibility based on the lead_time and we observe the price to go down shortly after 20 days of lead_time. So it will be more efficient to book ahead of time for a resort hotel but not a city hotel. This will give people a more clear idea of when to book and if booking ahead of time really helps with saving money on hotel price.
[1] https://en.wikipedia.org/wiki/Average_daily_rate
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Liu (2022, Jan. 20). Data Analytics and Computational Social Science: Erin Liu HW6. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomerinliuhw6/
BibTeX citation
@misc{liu2022erin, author = {Liu, Erin}, title = {Data Analytics and Computational Social Science: Erin Liu HW6}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomerinliuhw6/}, year = {2022} }