hw2
hotel_bookings.csv
Author

Cristhian Barba Garzon

Published

January 3, 2023

Code
#install.packages("dyplr")
library(tidyverse)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE)

Reading in and Describing Data

Reading in this data tells us many things about hotel reservations. It provides about 32 different variables relating to how many people reserve hotels, when they reserve their hotels, when they check-out, if they had previous cancellations and much more information. We can also see, through R functions, that this data was taken from the years 2015, 2016, and 2017. Additionally, the data was recorded from all 12 months in each year. We can see that there are about 119,390 observations, which is a lot of information to use.

Code
data = read_csv("_data/hotel_bookings.csv")
Rows: 119390 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
date  (1): reservation_status_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
data %>%
  view()

data %>%
  dim() # 119390 rows by 32 columns
[1] 119390     32
Code
data %>% 
  colnames() #displays column names
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       
Code
data %>%
  distinct(arrival_date_year) %>% #chooses distinct value of the years (only recorded over 3 years)
  arrange(desc(arrival_date_year)) #sets most recent year as most recent to view
# A tibble: 3 × 1
  arrival_date_year
              <dbl>
1              2017
2              2016
3              2015
Code
data %>% 
  distinct(arrival_date_month) #recorded over 12 months in each year
# A tibble: 12 × 1
   arrival_date_month
   <chr>             
 1 July              
 2 August            
 3 September         
 4 October           
 5 November          
 6 December          
 7 January           
 8 February          
 9 March             
10 April             
11 May               
12 June              
Code
data %>% 
  distinct(hotel)
# A tibble: 2 × 1
  hotel       
  <chr>       
1 Resort Hotel
2 City Hotel  

Tidying Data

Most, if not all, of this data is already in a tidy format. All variables have their own respective columns and values within them, and all observations have their own rows. The data is very large, but that is because there are many different variables to examine. With this in mind, we can make this data set look more presentable by mutating columns; we can reduce the amount of columns there are by combining values–specifically with the columns that have dates. We can combine the day, month, and year columns to create one singular date, further arranging our data into a better format to view. Using the case_when() function within mutate(), the string values in the month column can be converted to their ordered values; since the values in the day and year columns are already numerical, we do not have to change them. By creating an empty vector and using the sprintf() function, we can use a for loop to iterate through the length of the columns and format the values in dd/mm/yyyy. The formatted values can be appended to the empty vector, which then allows us to add the vector to the data set. After this is done, we can remove the used columns using the subset() and select() functions, and we can move our new column over using the relocate() function. Using the mutate() function again, we can change values in the columns “is_canceled” and “is_repeated_guest” to be string values for “yes” and “no”; this can be done to allow the viewer to see what data is being recorded rather than inferring what zeros and ones mean. We can also change the names of some columns, using the rename() function, to appear more viewer-friendly. Using the relocate() function, we can also move columns around to make the data set easier to read and compare column values.

Code
mutated.data = data %>%
  mutate(arrival_date_month = case_when(
    arrival_date_month == "January" ~ 1,
    arrival_date_month == "February" ~ 2,
    arrival_date_month == "March" ~ 3,
    arrival_date_month == "April" ~ 4,
    arrival_date_month == "May" ~ 5 ,
    arrival_date_month == "June" ~ 6,
    arrival_date_month == "July" ~ 7,
    arrival_date_month == "August" ~ 8,
    arrival_date_month == "September" ~ 9,
    arrival_date_month == "October" ~ 10,
    arrival_date_month == "November" ~ 11,
    arrival_date_month == "December" ~ 12
  ))

mutated.data %>%
  arrange(arrival_date_month)
# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     109    2016       1       1       1       0       1      2
 2 Resor…       0     109    2016       1       1       1       0       1      2
 3 Resor…       1       2    2016       1       1       1       0       1      2
 4 Resor…       0      88    2016       1       1       1       0       2      2
 5 Resor…       1      20    2016       1       1       1       0       2      2
 6 Resor…       1      76    2016       1       1       1       1       2      2
 7 Resor…       0      88    2016       1       1       1       0       2      2
 8 Resor…       1     113    2016       1       1       2       0       1      2
 9 Resor…       1     113    2016       1       1       2       0       1      2
10 Resor…       1     113    2016       1       1       2       0       1      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Code
arrival_date = c() # Create an empty vector to store the dates

for (i in 1:length(mutated.data$arrival_date_year)) { #for loop -- in range 1 to total length of Year column
  date_string = sprintf("%02d/%02d/%04d",mutated.data$arrival_date_day_of_month[i], mutated.data$arrival_date_month[i], mutated.data$arrival_date_year[i]) # Format as a string
  arrival_date = c(arrival_date, date_string)  # Append the date string to the vector
}

new = mutated.data %>%
   mutate(Arrival_date = arrival_date, is_canceled = case_when(
     is_canceled == 0 ~ "no",
     is_canceled == 1 ~ "yes",
     TRUE ~ "Not Available"
   ), is_repeated_guest = case_when(
     is_repeated_guest == 0 ~ "no",
     is_repeated_guest == 1 ~ "yes",
     TRUE ~ "Not Available"
   )) %>% #adds new column Date with the vector dates from for-loop
  subset(select = -c(arrival_date_year,arrival_date_month,arrival_date_day_of_month)) %>% #drops Year and Month columns so new date column will replace
  relocate(Arrival_date,.before = 2, .after = NULL) %>%
  relocate(reservation_status_date, .before = 2, .after = NULL) %>% #moves the column to the left hand side 
  relocate(reservation_status, .before = 3, .after = NULL) %>%
  rename("Arrival Week #"="arrival_date_week_number","Weekend Nights"= "stays_in_weekend_nights","Week Nights"= "stays_in_week_nights")

new
# A tibble: 119,390 × 30
   hotel      reservat…¹ reser…² Arriv…³ is_ca…⁴ lead_…⁵ Arriv…⁶ Weeke…⁷ Week …⁸
   <chr>      <date>     <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl>   <dbl>
 1 Resort Ho… 2015-07-01 Check-… 01/07/… no          342      27       0       0
 2 Resort Ho… 2015-07-01 Check-… 01/07/… no          737      27       0       0
 3 Resort Ho… 2015-07-02 Check-… 01/07/… no            7      27       0       1
 4 Resort Ho… 2015-07-02 Check-… 01/07/… no           13      27       0       1
 5 Resort Ho… 2015-07-03 Check-… 01/07/… no           14      27       0       2
 6 Resort Ho… 2015-07-03 Check-… 01/07/… no           14      27       0       2
 7 Resort Ho… 2015-07-03 Check-… 01/07/… no            0      27       0       2
 8 Resort Ho… 2015-07-03 Check-… 01/07/… no            9      27       0       2
 9 Resort Ho… 2015-05-06 Cancel… 01/07/… yes          85      27       0       3
10 Resort Ho… 2015-04-22 Cancel… 01/07/… yes          75      27       0       3
# … with 119,380 more rows, 21 more variables: adults <dbl>, children <dbl>,
#   babies <dbl>, meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <chr>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Research Questions for Data

This data can be useful for many research questions. It can help answer questions that compare the amount of adults and children in reservations, which in turn could indicate if families or couples travel to hotels more often than others. We can also look at the types of hotels and see if people canceled their reservations, and this can lead to some hypotheses regarding people’s preferences. Using statistical summary functions can also help answer how many nights people stay during the weekend or week; these functions can also help create visualizations to better see any trends in the data. Two example questions that we can answer are the following:

  1. What is the most amount of adults in a reservation?
  • We can answer this question by using R functions to select the specific column that indicates number of adults in a reservation. This column would be the “adults” column, and we would use the max() function within the summarize() function to find the maximum amount of adults.
  1. Provide a statistical summary of the number of adults, children, and babies.
  • We can use the summary() function to begin summarizing this data. Functions like max(), mean(), median(), and sd() are all examples that we can use to summarize our data–specifically any columns we like. These functions can help us understand the average amount of people in a reservation, the maximum amount, and the deviation of amount of people. This information can tell us about large and small reservations, families and couples, or single people who possibly travel for work.