DACSS 601 August 2021: Zoe Hotel Bookings Dataset

Zoe Bean

Homework 2

I will be processing a dataset about hotel bookings from 2015 to 2017. First, I import the dataset, which requires the tidyverse package to be loaded.

library(tidyverse)
hotel_data=read_csv("../../_data/hotel_bookings.csv")

Next, I use head() to give an example of what the dataset looks like.

head(hotel_data)

# A tibble: 6 × 32
  hotel        is_canceled lead_time arrival_date_ye… arrival_date_mo…
  <chr>              <dbl>     <dbl>            <dbl> <chr>           
1 Resort Hotel           0       342             2015 July            
2 Resort Hotel           0       737             2015 July            
3 Resort Hotel           0         7             2015 July            
4 Resort Hotel           0        13             2015 July            
5 Resort Hotel           0        14             2015 July            
6 Resort Hotel           0        14             2015 July            
# … with 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>,
#   babies <dbl>, meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …

To find out how many rows are in the dataset, I use dim(). I also use colnames() to figure out what the data drawn from each observation are.

dim(hotel_data)

[1] 119390     32

colnames(hotel_data)

 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"

The result from dim() means that there are 119390 rows and 32 columns. This is important since the number of rows tells us how many hotel bookings there are in this dataset, and the columns tell us how many pieces of data are available per booking.

Homework 3

The colnames function is helpful for this next step, where I select columns. Here, I select the year of arrival and the month.

select(hotel_data, arrival_date_year, arrival_date_month )

# A tibble: 119,390 × 2
   arrival_date_year arrival_date_month
               <dbl> <chr>             
 1              2015 July              
 2              2015 July              
 3              2015 July              
 4              2015 July              
 5              2015 July              
 6              2015 July              
 7              2015 July              
 8              2015 July              
 9              2015 July              
10              2015 July              
# … with 119,380 more rows

I can do more with select, such as selecting all columns that start with ‘arrival_date’ to get more clear information about when each the booking is.

select(hotel_data, starts_with("arrival_date"))

# A tibble: 119,390 × 4
   arrival_date_ye… arrival_date_mo… arrival_date_we… arrival_date_da…
              <dbl> <chr>                       <dbl>            <dbl>
 1             2015 July                           27                1
 2             2015 July                           27                1
 3             2015 July                           27                1
 4             2015 July                           27                1
 5             2015 July                           27                1
 6             2015 July                           27                1
 7             2015 July                           27                1
 8             2015 July                           27                1
 9             2015 July                           27                1
10             2015 July                           27                1
# … with 119,380 more rows

If I want to look at all the bookings where there are no children, I use filter() as follows:

filter(hotel_data, children== 0)

# A tibble: 110,796 × 32
   hotel        is_canceled lead_time arrival_date_ye… arrival_date_mo…
   <chr>              <dbl>     <dbl>            <dbl> <chr>           
 1 Resort Hotel           0       342             2015 July            
 2 Resort Hotel           0       737             2015 July            
 3 Resort Hotel           0         7             2015 July            
 4 Resort Hotel           0        13             2015 July            
 5 Resort Hotel           0        14             2015 July            
 6 Resort Hotel           0        14             2015 July            
 7 Resort Hotel           0         0             2015 July            
 8 Resort Hotel           0         9             2015 July            
 9 Resort Hotel           1        85             2015 July            
10 Resort Hotel           1        75             2015 July            
# … with 110,786 more rows, and 27 more variables:
#   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
#   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
#   adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …

If I want to arrange the hotel data by putting a column in order, I use arrange(). In this example, I order the data by the arrival month, which is sorted alphabetically.

arrange(hotel_data, arrival_date_month)

# A tibble: 119,390 × 32
   hotel        is_canceled lead_time arrival_date_ye… arrival_date_mo…
   <chr>              <dbl>     <dbl>            <dbl> <chr>           
 1 Resort Hotel           1        31             2016 April           
 2 Resort Hotel           0         0             2016 April           
 3 Resort Hotel           0       144             2016 April           
 4 Resort Hotel           0       144             2016 April           
 5 Resort Hotel           0       144             2016 April           
 6 Resort Hotel           0       163             2016 April           
 7 Resort Hotel           1        38             2016 April           
 8 Resort Hotel           0       175             2016 April           
 9 Resort Hotel           1        39             2016 April           
10 Resort Hotel           1        32             2016 April           
# … with 119,380 more rows, and 27 more variables:
#   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
#   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
#   adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …

If I actually wanted to have the months in order, I would have to make a numerical variable in the dataset using mutate() and case_when() and then arrange by that variable.

hotel_data<- hotel_data %>%
  mutate(arrival_date_month_num = case_when(
         arrival_date_month == "January" ~ 1,
         arrival_date_month == "February" ~ 2,
         arrival_date_month == "March" ~ 3,
         arrival_date_month == "April" ~ 4,
         arrival_date_month == "May" ~ 5,
         arrival_date_month == "June" ~ 6,
         arrival_date_month == "July" ~ 7,
         arrival_date_month == "August" ~ 8,
         arrival_date_month == "September" ~ 9,
         arrival_date_month == "October" ~ 10,
         arrival_date_month == "November" ~ 11,
         arrival_date_month == "December" ~ 12
         ))

arrange(hotel_data, arrival_date_month_num)

# A tibble: 119,390 × 33
   hotel        is_canceled lead_time arrival_date_ye… arrival_date_mo…
   <chr>              <dbl>     <dbl>            <dbl> <chr>           
 1 Resort Hotel           0       109             2016 January         
 2 Resort Hotel           0       109             2016 January         
 3 Resort Hotel           1         2             2016 January         
 4 Resort Hotel           0        88             2016 January         
 5 Resort Hotel           1        20             2016 January         
 6 Resort Hotel           1        76             2016 January         
 7 Resort Hotel           0        88             2016 January         
 8 Resort Hotel           1       113             2016 January         
 9 Resort Hotel           1       113             2016 January         
10 Resort Hotel           1       113             2016 January         
# … with 119,380 more rows, and 28 more variables:
#   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
#   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
#   adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …

If I wanted to know the average amount of adults per booking, I would use summarise() like so:

summarise(hotel_data, mean=mean(adults))

# A tibble: 1 × 1
   mean
  <dbl>
1  1.86

Comment on this article Share:

Zoe Hotel Bookings Dataset

Homework 2

Homework 3

Reuse

Citation