Data Analytics and Computational Social Science: Erin Liu HW5

Erin Liu

Dataset and project idea

I will be using the advanced dataset “hotel_bookings.csv” from the “Sample Datasets” section on Google Classroom for my final project. In this project, I will be studying the relationship between ADR(average daily rate for hotel booking) and several other common factors(such as lead_time, stays_in_weekend_nights, stays_in_week_nights, is_repeated_guest) that play a role in people’s decision in terms of booking hotel and try to find a conclusion of which situation will give the cheapest daily rate.

HW3_data<- read.csv('/Users/erinliu/Downloads/hotel_bookings.csv',TRUE,',')

Read in/clean the dataset

Since I will be studying the influence of different factors on ADR, I will first need to clean the dataset to have all positive ADR. In addition, I will shorter the range of lead_time to be between 0 and 300 for further analysis.

 resort_hotel_data <- filter(HW3_data, adr > 0 & hotel=="Resort Hotel" & lead_time>0 & lead_time<300)
 city_hotel_data <- filter(HW3_data, adr > 0 & hotel=="City Hotel" & lead_time>0 & lead_time<300)

Identify potential research questions that your dataset can help answer.

This dataset contains hotel booking information, especially for ADR (average daily rate) and other factors for two types of hotels: resort hotel or city hotel. This dataset can be used to identify how much can different factors influence the ADR.

For example, how can lead_time (Number of days that elapsed between the entering date of the booking into the PMS and the arrival date) affect the daily rate and what’s the best time to book a hotel ahead of time? What is the cheapest date to book a hotel in a month and how much will it vary from month to month?

Descriptive Statistics

mean, median, and standard deviation for numerical variables, and frequencies for categorical variables.

summarize(resort_hotel_data, mean.resort_lead_time = mean(`lead_time`), 
          mean.resort_ADR = mean(`adr`), 
          mean.resort_days_weekend_stay = mean(`stays_in_weekend_nights`), 
          mean.resort_days_weekday_stay = mean(`stays_in_week_nights`))

  mean.resort_lead_time mean.resort_ADR mean.resort_days_weekend_stay
1              89.35403         99.1029                      1.255716
  mean.resort_days_weekday_stay
1                       3.27007

summarize(resort_hotel_data, median.resort_lead_time = median(`lead_time`), 
          median.resort_ADR = median(`adr`),
          median.resort_days_weekend_stay = median(`stays_in_weekend_nights`), 
          median.resort_days_weekday_stay = median(`stays_in_week_nights`))

  median.resort_lead_time median.resort_ADR
1                      65             79.89
  median.resort_days_weekend_stay median.resort_days_weekday_stay
1                               1                               3

summarize(resort_hotel_data, sd.resort_lead_time = sd(`lead_time`), 
          sd.resort_ADR = sd(`adr`), 
          sd.resort_days_weekend_stay = sd(`stays_in_weekend_nights`), 
          sd.resort_days_weekday_stay = sd(`stays_in_week_nights`))

  sd.resort_lead_time sd.resort_ADR sd.resort_days_weekend_stay
1            81.60798      61.74735                    1.142275
  sd.resort_days_weekday_stay
1                    2.441784

summarize(city_hotel_data, mean.city_lead_time = mean(`lead_time`), 
          mean.city_ADR = mean(`adr`), 
          mean.city_days_weekend_stay = mean(`stays_in_weekend_nights`), 
          mean.city_days_weekday_stay = mean(`stays_in_week_nights`))

  mean.city_lead_time mean.city_ADR mean.city_days_weekend_stay
1            91.14922      108.8456                   0.8371182
  mean.city_days_weekday_stay
1                    2.251709

summarize(city_hotel_data, median.city_lead_time = median(`lead_time`), 
          median.city_ADR = median(`adr`), 
          median.city_days_weekend_stay = median(`stays_in_weekend_nights`), 
          median.city_days_weekday_stay = median(`stays_in_week_nights`))

  median.city_lead_time median.city_ADR median.city_days_weekend_stay
1                    69             101                             1
  median.city_days_weekday_stay
1                             2

summarize(city_hotel_data, 
          sd.city_lead_time = sd(`lead_time`), 
          sd.city_ADR = sd(`adr`), 
          sd.city_days_weekend_stay = sd(`stays_in_weekend_nights`), 
          sd.city_days_weekday_stay = sd(`stays_in_week_nights`))

  sd.city_lead_time sd.city_ADR sd.city_days_weekend_stay
1           78.3192    42.43589                 0.8889498
  sd.city_days_weekday_stay
1                  1.465614

# trying to make the lead_time data into categories 
# as grouped by range for further analysis.
summarize(resort_hotel_data, max(`lead_time`))

  max(lead_time)
1            299

#709 
# after the calculation, the max value seems very extreme 
# and not a good reference for breaking the data into category group
summarize(city_hotel_data, max(`lead_time`)) #629

  max(lead_time)
1            299

# trying to make the lead_time data into categories for further analysis.
# so I change the idea to take the mean value of lead_time as a reference
# for the range of category group(triple the mean value: approximately 300) 
# and set the interval to be 30. This can be modified later.
resort_hotel_data %>% 
  group_by(resort_lead_time_category=cut(lead_time, breaks= seq(0, 300, by = 30), right=F) ) %>% 
  summarise(lead_time_frequency = n()) %>%
  arrange(as.numeric(resort_lead_time_category))

# A tibble: 10 × 2
   resort_lead_time_category lead_time_frequency
   <fct>                                   <int>
 1 [0,30)                                  11548
 2 [30,60)                                  5085
 3 [60,90)                                  3681
 4 [90,120)                                 2989
 5 [120,150)                                2514
 6 [150,180)                                2668
 7 [180,210)                                2048
 8 [210,240)                                1827
 9 [240,270)                                1290
10 [270,300)                                 904

city_hotel_data %>% 
  group_by(city_lead_time_category=cut(lead_time, breaks= seq(0, 300, by = 30), right=F) ) %>% 
  summarise(lead_time_frequency= n()) %>%
  arrange(as.numeric(city_lead_time_category))

# A tibble: 10 × 2
   city_lead_time_category lead_time_frequency
   <fct>                                 <int>
 1 [0,30)                                19296
 2 [30,60)                               11949
 3 [60,90)                                8949
 4 [90,120)                               7435
 5 [120,150)                              5520
 6 [150,180)                              5205
 7 [180,210)                              3654
 8 [210,240)                              2518
 9 [240,270)                              2128
10 [270,300)                              2525

This is the visualization between the city_lead_time_category and the count in each category. It shows the distribution of lead_time data in different ranges. We can tell from this graph that most of the lead_time happen in the range of 0-60 and the number of count decreases as lead_time goes larger. However, there are a quite number of extreme numbers around 270-300 which could be affecting the overall conclusion for further analysis.

city_hotel_data %>% 
  group_by(city_lead_time_category=cut(lead_time, breaks= seq(0, 300, by = 30), right=F) ) %>% 
  summarise(lead_time_frequency= n()) %>%
  arrange(as.numeric(city_lead_time_category)) %>%
  ggplot(aes(city_lead_time_category, group = 1)) + 
  geom_line(aes(y = lead_time_frequency, colour = "var0")) +
  labs(title = "Density for lead_time groups for city hotel")

This is the visualization between the resort_lead_time_category and the count in each category. It shows the distribution of lead_time data in different ranges. We can tell from this graph that most of the lead_time happen in the range of 0-60 and the number of count decreases as lead_time goes larger. However, there are a quite number of extreme numbers(relatively less than the city one) around 270-300 which could be affecting the overall conclusion for further analysis.

resort_hotel_data %>% 
  group_by(resort_lead_time_category=cut(lead_time, breaks= seq(0, 300, by = 30), right=F) ) %>% 
  summarise(lead_time_frequency= n()) %>%
 arrange(as.numeric(resort_lead_time_category)) %>%
  ggplot(aes(resort_lead_time_category, group = 1)) + 
 geom_line(aes(y = lead_time_frequency, colour = "var0")) + 
  labs(title = "Density for lead_time groups for resort hotel")

This is the visualization of ADR’s distribution, and it helps us to understand the data more besides the conclusion from mean/median and we can see the peak of frequency happens in around 50.

ggplot(resort_hotel_data, aes(adr)) + 
  geom_histogram(aes(y = ..density..), alpha = 0.5) +
  labs(title = "ADR density for resort hotel data") + 
  geom_density(alpha = 0.2, fill="red")

This is the visualization for ADR and different room types. We can see that room type A has the largest number and almost every type has an average of daily rate around 0-200 except for room type F which seems to have a higher average around 200-300. This could be better scaled to show more information for the less popular types.

all_data <- filter(HW3_data, adr > 0 & lead_time>0 & lead_time<300 & adr <2000 )
ggplot(all_data, aes(adr)) + 
  geom_histogram(binwidth = 40) + 
  labs(title = "ADR with different room type reservations") + 
  theme_bw() +
  facet_wrap(vars(reserved_room_type))

This is the visualization between different customer types and ADR. We can see that type Transient is the most popular type and almost every type has a median of ADR around 100. This could be better scaled to show more information for the less popular types.

all_data <- filter(HW3_data, adr > 0 & lead_time>0 & lead_time<300 & adr <2000 )
ggplot(all_data, aes(adr)) + 
  geom_histogram(binwidth = 100) + 
  labs(title = "ADR with different customer types") + 
  theme_bw() +
  facet_wrap(vars(customer_type))

This is the visualization between adr and lead_time, which is the essential question of this project to study the relation between adr and lead_time: how do they affect each other and what’s their distributions, etc. We can see the data points gather around the bottom left corner and centralized around x=60. It also shows that people can usually get a good price with lead_time less than 20 days so planning ahead of time isn’t that necessary to get a good price.

ggplot(resort_hotel_data, aes(adr, lead_time)) + geom_point(alpha=0.05) + 
labs(title = "The relationship between ADR and lead_time for resort hotel")

For the next HW I will compare the trend in different hotel types(resort vs city) which has not been implemented so far. And some visualizations in this HW need to be scaled properly to show more information. I also want them to show numeric data point on the graph. I used to think that lead_time will influence the ADR largely under the impression that my parents always booked the hotel way ahead of time. But it turns out to be not the case and the current conclusion is that lead_time does not have a clear relationship with ADR. Room types and customer types will definitely influence the ADR and I will further study the relationship in it. I think for naive reader to fully understand my graphs they will need to look at the x and y value in the graph and the title and my explanation.

Comment on this article Share:

Erin Liu HW5

Dataset and project idea

Read in/clean the dataset

Identify potential research questions that your dataset can help answer.

Descriptive Statistics

Reuse

Citation