An introdution to the hotel dataset
I will be processing a dataset about hotel bookings from 2015 to 2017. First, I import the dataset, which requires the tidyverse package to be loaded.
Next, I use head() to give an example of what the dataset looks like.
head(hotel_data)
# A tibble: 6 × 32
hotel is_canceled lead_time arrival_date_ye… arrival_date_mo…
<chr> <dbl> <dbl> <dbl> <chr>
1 Resort Hotel 0 342 2015 July
2 Resort Hotel 0 737 2015 July
3 Resort Hotel 0 7 2015 July
4 Resort Hotel 0 13 2015 July
5 Resort Hotel 0 14 2015 July
6 Resort Hotel 0 14 2015 July
# … with 27 more variables: arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
# stays_in_week_nights <dbl>, adults <dbl>, children <dbl>,
# babies <dbl>, meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …
To find out how many rows are in the dataset, I use dim(). I also use colnames() to figure out what the data drawn from each observation are.
dim(hotel_data)
[1] 119390 32
colnames(hotel_data)
[1] "hotel" "is_canceled"
[3] "lead_time" "arrival_date_year"
[5] "arrival_date_month" "arrival_date_week_number"
[7] "arrival_date_day_of_month" "stays_in_weekend_nights"
[9] "stays_in_week_nights" "adults"
[11] "children" "babies"
[13] "meal" "country"
[15] "market_segment" "distribution_channel"
[17] "is_repeated_guest" "previous_cancellations"
[19] "previous_bookings_not_canceled" "reserved_room_type"
[21] "assigned_room_type" "booking_changes"
[23] "deposit_type" "agent"
[25] "company" "days_in_waiting_list"
[27] "customer_type" "adr"
[29] "required_car_parking_spaces" "total_of_special_requests"
[31] "reservation_status" "reservation_status_date"
The result from dim() means that there are 119390 rows and 32 columns. This is important since the number of rows tells us how many hotel bookings there are in this dataset, and the columns tell us how many pieces of data are available per booking.
The colnames function is helpful for this next step, where I select columns. Here, I select the year of arrival and the month.
select(hotel_data, arrival_date_year, arrival_date_month )
# A tibble: 119,390 × 2
arrival_date_year arrival_date_month
<dbl> <chr>
1 2015 July
2 2015 July
3 2015 July
4 2015 July
5 2015 July
6 2015 July
7 2015 July
8 2015 July
9 2015 July
10 2015 July
# … with 119,380 more rows
I can do more with select, such as selecting all columns that start with ‘arrival_date’ to get more clear information about when each the booking is.
select(hotel_data, starts_with("arrival_date"))
# A tibble: 119,390 × 4
arrival_date_ye… arrival_date_mo… arrival_date_we… arrival_date_da…
<dbl> <chr> <dbl> <dbl>
1 2015 July 27 1
2 2015 July 27 1
3 2015 July 27 1
4 2015 July 27 1
5 2015 July 27 1
6 2015 July 27 1
7 2015 July 27 1
8 2015 July 27 1
9 2015 July 27 1
10 2015 July 27 1
# … with 119,380 more rows
If I want to look at all the bookings where there are no children, I use filter() as follows:
filter(hotel_data, children== 0)
# A tibble: 110,796 × 32
hotel is_canceled lead_time arrival_date_ye… arrival_date_mo…
<chr> <dbl> <dbl> <dbl> <chr>
1 Resort Hotel 0 342 2015 July
2 Resort Hotel 0 737 2015 July
3 Resort Hotel 0 7 2015 July
4 Resort Hotel 0 13 2015 July
5 Resort Hotel 0 14 2015 July
6 Resort Hotel 0 14 2015 July
7 Resort Hotel 0 0 2015 July
8 Resort Hotel 0 9 2015 July
9 Resort Hotel 1 85 2015 July
10 Resort Hotel 1 75 2015 July
# … with 110,786 more rows, and 27 more variables:
# arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
# stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
# adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …
If I want to arrange the hotel data by putting a column in order, I use arrange(). In this example, I order the data by the arrival month, which is sorted alphabetically.
arrange(hotel_data, arrival_date_month)
# A tibble: 119,390 × 32
hotel is_canceled lead_time arrival_date_ye… arrival_date_mo…
<chr> <dbl> <dbl> <dbl> <chr>
1 Resort Hotel 1 31 2016 April
2 Resort Hotel 0 0 2016 April
3 Resort Hotel 0 144 2016 April
4 Resort Hotel 0 144 2016 April
5 Resort Hotel 0 144 2016 April
6 Resort Hotel 0 163 2016 April
7 Resort Hotel 1 38 2016 April
8 Resort Hotel 0 175 2016 April
9 Resort Hotel 1 39 2016 April
10 Resort Hotel 1 32 2016 April
# … with 119,380 more rows, and 27 more variables:
# arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
# stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
# adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …
If I actually wanted to have the months in order, I would have to make a numerical variable in the dataset using mutate() and case_when() and then arrange by that variable.
hotel_data<- hotel_data %>%
mutate(arrival_date_month_num = case_when(
arrival_date_month == "January" ~ 1,
arrival_date_month == "February" ~ 2,
arrival_date_month == "March" ~ 3,
arrival_date_month == "April" ~ 4,
arrival_date_month == "May" ~ 5,
arrival_date_month == "June" ~ 6,
arrival_date_month == "July" ~ 7,
arrival_date_month == "August" ~ 8,
arrival_date_month == "September" ~ 9,
arrival_date_month == "October" ~ 10,
arrival_date_month == "November" ~ 11,
arrival_date_month == "December" ~ 12
))
arrange(hotel_data, arrival_date_month_num)
# A tibble: 119,390 × 33
hotel is_canceled lead_time arrival_date_ye… arrival_date_mo…
<chr> <dbl> <dbl> <dbl> <chr>
1 Resort Hotel 0 109 2016 January
2 Resort Hotel 0 109 2016 January
3 Resort Hotel 1 2 2016 January
4 Resort Hotel 0 88 2016 January
5 Resort Hotel 1 20 2016 January
6 Resort Hotel 1 76 2016 January
7 Resort Hotel 0 88 2016 January
8 Resort Hotel 1 113 2016 January
9 Resort Hotel 1 113 2016 January
10 Resort Hotel 1 113 2016 January
# … with 119,380 more rows, and 28 more variables:
# arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
# stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
# adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …
If I wanted to know the average amount of adults per booking, I would use summarise() like so:
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Bean (2021, Aug. 15). DACSS 601 August 2021: Zoe Hotel Bookings Dataset. Retrieved from https://mrolfe.github.io/DACSS601August2021/posts/2021-08-15-zoe-hotel-bookings-dataset/
BibTeX citation
@misc{bean2021zoe, author = {Bean, Zoe}, title = {DACSS 601 August 2021: Zoe Hotel Bookings Dataset}, url = {https://mrolfe.github.io/DACSS601August2021/posts/2021-08-15-zoe-hotel-bookings-dataset/}, year = {2021} }