Data Analytics and Computational Social Science: Erin Liu HW4

Erin Liu

Identify the dataset for the final project

I will be using the advanced dataset “hotel_bookings.csv” from the “Sample Datasets” section on Google Classroom for my final project. Identify the variables in the dataset

HW3_data<- read.csv('/Users/erinliu/Downloads/hotel_bookings.csv',TRUE,',')

Read in/clean the dataset

Since I will be studying the influence of different factors on ADR, I will first need to clean the dataset to have all positive ADR. This will reduce the rows of data from 119,390 to 117,430. And also I want two separate tables for different hotel types. So resort_hotel_data will have size of 39,308 and city_hotel_data will have size of 78,122.

 resort_hotel_data <- filter(filter(select(HW3_data,everything()),adr>0), hotel=='Resort Hotel')
 city_hotel_data <- filter(filter(select(HW3_data,everything()),adr>0), hotel=='City Hotel')

Identify potential research questions that your dataset can help answer.

This dataset contains hotel booking information, especially for ADR (average daily rate) and other factors for two types of hotels: resort hotel or city hotel. This dataset can be used to identify how much can different factors influence the ADR.

For example, how can lead_time (Number of days that elapsed between the entering date of the booking into the PMS and the arrival date) affect the daily rate and what’s the best time to book a hotel ahead of time? What is the cheapest date to book a hotel in a month and how much will it vary from month to month?

Descriptive Statistics

mean, median, and standard deviation for numerical variables, and frequencies for categorical variables.

summarize(resort_hotel_data, mean.resort_lead_time = mean(`lead_time`), 
          mean.resort_ADR = mean(`adr`), 
          mean.resort_days_weekend_stay = mean(`stays_in_weekend_nights`), 
          mean.resort_days_weekday_stay = mean(`stays_in_week_nights`))

  mean.resort_lead_time mean.resort_ADR mean.resort_days_weekend_stay
1              93.42996        96.76963                      1.204818
  mean.resort_days_weekday_stay
1                      3.164954

summarize(resort_hotel_data, median.resort_lead_time = median(`lead_time`), 
          median.resort_ADR = median(`adr`),
          median.resort_days_weekend_stay = median(`stays_in_weekend_nights`), 
          median.resort_days_weekday_stay = median(`stays_in_week_nights`))

  median.resort_lead_time median.resort_ADR
1                      59              76.8
  median.resort_days_weekend_stay median.resort_days_weekday_stay
1                               1                               3

summarize(resort_hotel_data, sd.resort_lead_time = sd(`lead_time`), 
          sd.resort_ADR = sd(`adr`), 
          sd.resort_days_weekend_stay = sd(`stays_in_weekend_nights`), 
          sd.resort_days_weekday_stay = sd(`stays_in_week_nights`))

  sd.resort_lead_time sd.resort_ADR sd.resort_days_weekend_stay
1            97.22353       60.5935                    1.145506
  sd.resort_days_weekday_stay
1                     2.44944

summarize(city_hotel_data, mean.city_lead_time = mean(`lead_time`), 
          mean.city_ADR = mean(`adr`), 
          mean.city_days_weekend_stay = mean(`stays_in_weekend_nights`), 
          mean.city_days_weekday_stay = mean(`stays_in_week_nights`))

  mean.city_lead_time mean.city_ADR mean.city_days_weekend_stay
1            110.9484      106.9328                   0.8013748
  mean.city_days_weekday_stay
1                     2.19732

summarize(city_hotel_data, median.city_lead_time = median(`lead_time`), 
          median.city_ADR = median(`adr`), 
          median.city_days_weekend_stay = median(`stays_in_weekend_nights`), 
          median.city_days_weekday_stay = median(`stays_in_week_nights`))

  median.city_lead_time median.city_ADR median.city_days_weekend_stay
1                    75             100                             1
  median.city_days_weekday_stay
1                             2

summarize(city_hotel_data, 
          sd.city_lead_time = sd(`lead_time`), 
          sd.city_ADR = sd(`adr`), 
          sd.city_days_weekend_stay = sd(`stays_in_weekend_nights`), 
          sd.city_days_weekday_stay = sd(`stays_in_week_nights`))

  sd.city_lead_time sd.city_ADR sd.city_days_weekend_stay
1          111.0005    41.91054                 0.8807563
  sd.city_days_weekday_stay
1                  1.434753

# trying to make the lead_time data into categories 
# as grouped by range for further analysis.
summarize(resort_hotel_data, max(`lead_time`))

  max(lead_time)
1            709

#709 
# after the calculation, the max value seems very extreme 
# and not a good reference for breaking the data into category group
summarize(city_hotel_data, max(`lead_time`)) #629

  max(lead_time)
1            629

# trying to make the lead_time data into categories for further analysis.
# so I change the idea to take the mean value of lead_time as a reference
# for the range of category group(triple the mean value: approximately 300) 
# and set the interval to be 30. This can be modified later.
resort_hotel_data %>% 
  group_by(resort_lead_time_category=cut(lead_time, breaks= seq(0, 300, by = 30), right=F) ) %>% 
  summarise(lead_time_frequency = n()) %>%
  arrange(as.numeric(resort_lead_time_category))

# A tibble: 11 × 2
   resort_lead_time_category lead_time_frequency
   <fct>                                   <int>
 1 [0,30)                                  14634
 2 [30,60)                                  5085
 3 [60,90)                                  3681
 4 [90,120)                                 2989
 5 [120,150)                                2514
 6 [150,180)                                2668
 7 [180,210)                                2048
 8 [210,240)                                1827
 9 [240,270)                                1290
10 [270,300)                                 904
11 <NA>                                     1668

city_hotel_data %>% 
  group_by(city_lead_time_category=cut(lead_time, breaks= seq(0, 300, by = 30), right=F) ) %>% 
  summarise(lead_time_frequency= n()) %>%
  arrange(as.numeric(city_lead_time_category))

# A tibble: 11 × 2
   city_lead_time_category lead_time_frequency
   <fct>                                 <int>
 1 [0,30)                                21979
 2 [30,60)                               11949
 3 [60,90)                                8949
 4 [90,120)                               7435
 5 [120,150)                              5520
 6 [150,180)                              5205
 7 [180,210)                              3654
 8 [210,240)                              2518
 9 [240,270)                              2128
10 [270,300)                              2525
11 <NA>                                   6260

This is the visualization between the city_lead_time_category and the count in each category. It shows the distribution of lead_time data in different ranges. We can tell from this graph that most of the lead_time happen in the range of 0-60 and the number of count decreases as lead_time goes larger. However, there are a quite number of extreme numbers around 270-300 which could be affecting the overall conclusion for further analysis.

city_hotel_data %>% 
  group_by(city_lead_time_category=cut(lead_time, breaks= seq(0, 300, by = 30), right=F) ) %>% 
  summarise(lead_time_frequency= n()) %>%
 arrange(as.numeric(city_lead_time_category)) %>%
  ggplot(aes(city_lead_time_category, group = 1)) + 
 geom_line(aes(y = lead_time_frequency, colour = "var0"))

This is the visualization between the resort_lead_time_category and the count in each category. It shows the distribution of lead_time data in different ranges. We can tell from this graph that most of the lead_time happen in the range of 0-60 and the number of count decreases as lead_time goes larger. However, there are a quite number of extreme numbers(relatively less than the city one) around 270-300 which could be affecting the overall conclusion for further analysis.

resort_hotel_data %>% 
  group_by(resort_lead_time_category=cut(lead_time, breaks= seq(0, 300, by = 30), right=F) ) %>% 
  summarise(lead_time_frequency= n()) %>%
 arrange(as.numeric(resort_lead_time_category)) %>%
  ggplot(aes(resort_lead_time_category, group = 1)) + 
 geom_line(aes(y = lead_time_frequency, colour = "var0")) + 
   geom_line(aes(y = lead_time_frequency, colour = "var1"))

This is the visualization of ADR’s distribution, and it helps us to understand the data more besides the conclusion from mean/median and we can see the peak of frequency happens in around 50.

ggplot(resort_hotel_data, aes(adr)) + 
  geom_histogram(aes(y = ..density..), alpha = 0.5) +
  geom_density(alpha = 0.2, fill="red")

This is the visualization between adr and lead_time, which is the essential question of this project to study the relation between adr and lead_time: how do they affect each other and what’s their distributions, etc. We can see the data points gather around the bottom left corner which shows some relation between the two variables.

ggplot(resort_hotel_data, aes(adr, lead_time)) + geom_point()

However, these are not satisfying plots as it is very hard to tell the relation. For this HW I spent some time to try to group the lead_time into different groups based on the range they fall into because the actual numeric values are scattered around and it’s impossible to simply count for their occurances. However, after I grouped them together, it created a new data frame which is simply just the count of each group and I tried to combine it with the original data and failed to do so. Ideally I want to show the relation of ADR with categorical data of lead_time in visualizations, but I found that it needs to be something like a tag for each row in the original data to mark the category of lead_time in order to combine it with ADR. So far I am thinking of adding a column in the original data to show the category of lead_time (A for 0-30, B for 30-60, C for 60-90, etc.) but failed to find a way to do that smartly. Not sure if this is the correct way to think about that, and any suggestion will be very appreciated!

Comment on this article Share:

Erin Liu HW4

Identify the dataset for the final project

Read in/clean the dataset

Identify potential research questions that your dataset can help answer.

Descriptive Statistics

Reuse

Citation