hotel_bookings Dataset
This data set contains a single file which compares various booking information between hotels.
importing the data and reading the top 5 rows
# A tibble: 6 × 32
hotel is_canceled lead_time arrival_date_ye… arrival_date_mo…
<chr> <dbl> <dbl> <dbl> <chr>
1 Resort Hotel 0 342 2015 July
2 Resort Hotel 0 737 2015 July
3 Resort Hotel 0 7 2015 July
4 Resort Hotel 0 13 2015 July
5 Resort Hotel 0 14 2015 July
6 Resort Hotel 0 14 2015 July
# … with 27 more variables: arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
# stays_in_week_nights <dbl>, adults <dbl>, children <dbl>,
# babies <dbl>, meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …
skim() is used to for getting summary statistics about variables in dataframe,tibbles,datatablesand vectors. It is mostly used with grouped dataframes (source: https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html)
Name | hotel_bookings |
Number of rows | 119390 |
Number of columns | 32 |
_______________________ | |
Column type frequency: | |
character | 13 |
Date | 1 |
numeric | 18 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
hotel | 0 | 1 | 10 | 12 | 0 | 2 | 0 |
arrival_date_month | 0 | 1 | 3 | 9 | 0 | 12 | 0 |
meal | 0 | 1 | 2 | 9 | 0 | 5 | 0 |
country | 0 | 1 | 2 | 4 | 0 | 178 | 0 |
market_segment | 0 | 1 | 6 | 13 | 0 | 8 | 0 |
distribution_channel | 0 | 1 | 3 | 9 | 0 | 5 | 0 |
reserved_room_type | 0 | 1 | 1 | 1 | 0 | 10 | 0 |
assigned_room_type | 0 | 1 | 1 | 1 | 0 | 12 | 0 |
deposit_type | 0 | 1 | 10 | 10 | 0 | 3 | 0 |
agent | 0 | 1 | 1 | 4 | 0 | 334 | 0 |
company | 0 | 1 | 1 | 4 | 0 | 353 | 0 |
customer_type | 0 | 1 | 5 | 15 | 0 | 4 | 0 |
reservation_status | 0 | 1 | 7 | 9 | 0 | 3 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
reservation_status_date | 0 | 1 | 2014-10-17 | 2017-09-14 | 2016-08-07 | 926 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
is_canceled | 0 | 1 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1 | 1 | ▇▁▁▁▅ |
lead_time | 0 | 1 | 104.01 | 106.86 | 0.00 | 18.00 | 69.00 | 160 | 737 | ▇▂▁▁▁ |
arrival_date_year | 0 | 1 | 2016.16 | 0.71 | 2015.00 | 2016.00 | 2016.00 | 2017 | 2017 | ▃▁▇▁▆ |
arrival_date_week_number | 0 | 1 | 27.17 | 13.61 | 1.00 | 16.00 | 28.00 | 38 | 53 | ▅▇▇▇▅ |
arrival_date_day_of_month | 0 | 1 | 15.80 | 8.78 | 1.00 | 8.00 | 16.00 | 23 | 31 | ▇▇▇▇▆ |
stays_in_weekend_nights | 0 | 1 | 0.93 | 1.00 | 0.00 | 0.00 | 1.00 | 2 | 19 | ▇▁▁▁▁ |
stays_in_week_nights | 0 | 1 | 2.50 | 1.91 | 0.00 | 1.00 | 2.00 | 3 | 50 | ▇▁▁▁▁ |
adults | 0 | 1 | 1.86 | 0.58 | 0.00 | 2.00 | 2.00 | 2 | 55 | ▇▁▁▁▁ |
children | 4 | 1 | 0.10 | 0.40 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
babies | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
is_repeated_guest | 0 | 1 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0 | 1 | ▇▁▁▁▁ |
previous_cancellations | 0 | 1 | 0.09 | 0.84 | 0.00 | 0.00 | 0.00 | 0 | 26 | ▇▁▁▁▁ |
previous_bookings_not_canceled | 0 | 1 | 0.14 | 1.50 | 0.00 | 0.00 | 0.00 | 0 | 72 | ▇▁▁▁▁ |
booking_changes | 0 | 1 | 0.22 | 0.65 | 0.00 | 0.00 | 0.00 | 0 | 21 | ▇▁▁▁▁ |
days_in_waiting_list | 0 | 1 | 2.32 | 17.59 | 0.00 | 0.00 | 0.00 | 0 | 391 | ▇▁▁▁▁ |
adr | 0 | 1 | 101.83 | 50.54 | -6.38 | 69.29 | 94.58 | 126 | 5400 | ▇▁▁▁▁ |
required_car_parking_spaces | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0 | 8 | ▇▁▁▁▁ |
total_of_special_requests | 0 | 1 | 0.57 | 0.79 | 0.00 | 0.00 | 0.00 | 1 | 5 | ▇▁▁▁▁ |
from the above summary statistics we can see there are a total of 119390 rows and 32 columns in the hotel_bookings dataset. 13 character variables, 18 numeric variables, and 1 date variable. there are a total of 4 missing values in the children variable. for the analysis now i will be using hotel, market segment, stays_in_weekend_nights and stays_in_week_nights.
hotel variable: type of hotel booked
market segment : Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
stays_in_weekend_nights : guest stayed at the hotel in weekend nights
stays_in_week_nights : guest stayed at the hotel in week nights
I am using the select() from the dplyr package which comes with tidyverse package and the piping for selecting columns
# A tibble: 119,390 × 4
hotel stays_in_weekend_ni… stays_in_week_nig… market_segment
<chr> <dbl> <dbl> <chr>
1 Resort Hotel 0 0 Direct
2 Resort Hotel 0 0 Direct
3 Resort Hotel 0 1 Direct
4 Resort Hotel 0 1 Corporate
5 Resort Hotel 0 2 Online TA
6 Resort Hotel 0 2 Online TA
7 Resort Hotel 0 2 Direct
8 Resort Hotel 0 2 Direct
9 Resort Hotel 0 3 Online TA
10 Resort Hotel 0 3 Offline TA/TO
# … with 119,380 more rows
[1] "Resort Hotel" "City Hotel"
[1] "Direct" "Corporate" "Online TA" "Offline TA/TO"
[5] "Complementary" "Groups" "Undefined" "Aviation"
bookings in different market segments
city hotel
# A tibble: 8 × 2
market_segment n
<chr> <int>
1 Aviation 237
2 Complementary 542
3 Corporate 2986
4 Direct 6093
5 Groups 13975
6 Offline TA/TO 16747
7 Online TA 38748
8 Undefined 2
resort hotel
# A tibble: 6 × 2
market_segment n
<chr> <int>
1 Complementary 201
2 Corporate 2309
3 Direct 6513
4 Groups 5836
5 Offline TA/TO 7472
6 Online TA 17729
# A tibble: 14 × 3
# Groups: hotel, market_segment [14]
hotel market_segment n
<chr> <chr> <int>
1 City Hotel Aviation 237
2 City Hotel Complementary 542
3 City Hotel Corporate 2986
4 City Hotel Direct 6093
5 City Hotel Groups 13975
6 City Hotel Offline TA/TO 16747
7 City Hotel Online TA 38748
8 City Hotel Undefined 2
9 Resort Hotel Complementary 201
10 Resort Hotel Corporate 2309
11 Resort Hotel Direct 6513
12 Resort Hotel Groups 5836
13 Resort Hotel Offline TA/TO 7472
14 Resort Hotel Online TA 17729
how many number of days do people stay in the hotel?
Resort hotel
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
sathvik_thogaru (2021, Aug. 18). DACSS 601 August 2021: sathvik_thogaru_homework4. Retrieved from https://mrolfe.github.io/DACSS601August2021/posts/2021-08-18-sathvikthogaruhomework4/
BibTeX citation
@misc{sathvik_thogaru2021sathvik_thogaru_homework4, author = {sathvik_thogaru, }, title = {DACSS 601 August 2021: sathvik_thogaru_homework4}, url = {https://mrolfe.github.io/DACSS601August2021/posts/2021-08-18-sathvikthogaruhomework4/}, year = {2021} }