HW3
I will be using the advanced dataset “hotel_bookings.csv” from the “Sample Datasets” section on Google Classroom for my final project. Identify the variables in the dataset
[1] 119390 32
str(HW3_data)
'data.frame': 119390 obs. of 32 variables:
$ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
$ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
$ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
$ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
$ arrival_date_month : chr "July" "July" "July" "July" ...
$ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
$ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
$ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
$ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
$ adults : int 2 2 1 1 2 2 2 2 2 2 ...
$ children : int 0 0 0 0 0 0 0 0 0 0 ...
$ babies : int 0 0 0 0 0 0 0 0 0 0 ...
$ meal : chr "BB" "BB" "BB" "BB" ...
$ country : chr "PRT" "PRT" "GBR" "GBR" ...
$ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
$ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
$ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
$ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
$ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
$ reserved_room_type : chr "C" "C" "A" "A" ...
$ assigned_room_type : chr "C" "C" "C" "A" ...
$ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
$ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
$ agent : chr "NULL" "NULL" "NULL" "304" ...
$ company : chr "NULL" "NULL" "NULL" "NULL" ...
$ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
$ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
$ adr : num 0 0 75 75 98 ...
$ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
$ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
$ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
$ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
Since I will be studying the influence of different factors on ADR, I will first need to clean the dataset to have all positive ADR. This will reduce the rows of data from 119,390 to 117,430. And also I want two separate tables for different hotel types. So resort_hotel_data will have size of 39,308 and city_hotel_data will have size of 78,122.
resort_hotel_data <- filter(filter(select(HW3_data,everything()),adr>0), hotel=='Resort Hotel')
city_hotel_data <- filter(filter(select(HW3_data,everything()),adr>0), hotel=='City Hotel')
head(city_hotel_data)
hotel is_canceled lead_time arrival_date_year
1 City Hotel 1 88 2015
2 City Hotel 1 65 2015
3 City Hotel 1 92 2015
4 City Hotel 1 100 2015
5 City Hotel 1 79 2015
6 City Hotel 0 3 2015
arrival_date_month arrival_date_week_number
1 July 27
2 July 27
3 July 27
4 July 27
5 July 27
6 July 27
arrival_date_day_of_month stays_in_weekend_nights
1 1 0
2 1 0
3 1 2
4 2 0
5 2 0
6 2 0
stays_in_week_nights adults children babies meal country
1 4 2 0 0 BB PRT
2 4 1 0 0 BB PRT
3 4 2 0 0 BB PRT
4 2 2 0 0 BB PRT
5 3 2 0 0 BB PRT
6 3 1 0 0 HB PRT
market_segment distribution_channel is_repeated_guest
1 Online TA TA/TO 0
2 Online TA TA/TO 0
3 Online TA TA/TO 0
4 Online TA TA/TO 0
5 Online TA TA/TO 0
6 Groups TA/TO 0
previous_cancellations previous_bookings_not_canceled
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
reserved_room_type assigned_room_type booking_changes deposit_type
1 A A 0 No Deposit
2 A A 0 No Deposit
3 A A 0 No Deposit
4 A A 0 No Deposit
5 A A 0 No Deposit
6 A A 1 No Deposit
agent company days_in_waiting_list customer_type adr
1 9 NULL 0 Transient 76.50
2 9 NULL 0 Transient 68.00
3 9 NULL 0 Transient 76.50
4 9 NULL 0 Transient 76.50
5 9 NULL 0 Transient 76.50
6 1 NULL 0 Transient-Party 58.67
required_car_parking_spaces total_of_special_requests
1 0 1
2 0 1
3 0 2
4 0 1
5 0 1
6 0 0
reservation_status reservation_status_date
1 Canceled 2015-07-01
2 Canceled 2015-04-30
3 Canceled 2015-06-23
4 Canceled 2015-04-02
5 Canceled 2015-06-25
6 Check-Out 2015-07-05
This dataset contains hotel booking information, especially for ADR (average daily rate) and other factors for two types of hotels: resort hotel or city hotel. This dataset can be used to identify how much can different factors influence the ADR.
For example, how can lead_time (Number of days that elapsed between the entering date of the booking into the PMS and the arrival date) affect the daily rate and what’s the best time to book a hotel ahead of time? What is the cheapest date to book a hotel in a month and how much will it vary from month to month?
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Liu (2022, Jan. 3). Data Analytics and Computational Social Science: Erin Liu HW3. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomerinliuhw3/
BibTeX citation
@misc{liu2022erin, author = {Liu, Erin}, title = {Data Analytics and Computational Social Science: Erin Liu HW3}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomerinliuhw3/}, year = {2022} }