challenge_4
Shaunak Padhye
hotel_bookings
More data wrangling: pivoting
Author

Shaunak Padhye

Published

May 13, 2023

Setup

library(tidyverse)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

We will be using the following dataset for this challenge:

  • hotel_bookings.csv⭐⭐⭐⭐
Code
bookings <- read.csv("_data/hotel_bookings.csv")
head(bookings)
         hotel is_canceled lead_time arrival_date_year arrival_date_month
1 Resort Hotel           0       342              2015               July
2 Resort Hotel           0       737              2015               July
3 Resort Hotel           0         7              2015               July
4 Resort Hotel           0        13              2015               July
5 Resort Hotel           0        14              2015               July
6 Resort Hotel           0        14              2015               July
  arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
1                       27                         1                       0
2                       27                         1                       0
3                       27                         1                       0
4                       27                         1                       0
5                       27                         1                       0
6                       27                         1                       0
  stays_in_week_nights adults children babies meal country market_segment
1                    0      2        0      0   BB     PRT         Direct
2                    0      2        0      0   BB     PRT         Direct
3                    1      1        0      0   BB     GBR         Direct
4                    1      1        0      0   BB     GBR      Corporate
5                    2      2        0      0   BB     GBR      Online TA
6                    2      2        0      0   BB     GBR      Online TA
  distribution_channel is_repeated_guest previous_cancellations
1               Direct                 0                      0
2               Direct                 0                      0
3               Direct                 0                      0
4            Corporate                 0                      0
5                TA/TO                 0                      0
6                TA/TO                 0                      0
  previous_bookings_not_canceled reserved_room_type assigned_room_type
1                              0                  C                  C
2                              0                  C                  C
3                              0                  A                  C
4                              0                  A                  A
5                              0                  A                  A
6                              0                  A                  A
  booking_changes deposit_type agent company days_in_waiting_list customer_type
1               3   No Deposit  NULL    NULL                    0     Transient
2               4   No Deposit  NULL    NULL                    0     Transient
3               0   No Deposit  NULL    NULL                    0     Transient
4               0   No Deposit   304    NULL                    0     Transient
5               0   No Deposit   240    NULL                    0     Transient
6               0   No Deposit   240    NULL                    0     Transient
  adr required_car_parking_spaces total_of_special_requests reservation_status
1   0                           0                         0          Check-Out
2   0                           0                         0          Check-Out
3  75                           0                         0          Check-Out
4  75                           0                         0          Check-Out
5  98                           0                         1          Check-Out
6  98                           0                         1          Check-Out
  reservation_status_date
1              2015-07-01
2              2015-07-01
3              2015-07-02
4              2015-07-02
5              2015-07-03
6              2015-07-03

Briefly describe the data

Code
dim(bookings)
[1] 119390     32
Code
colnames(bookings)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       

This dataset contains the booking details for a number of hotels. This is a very comprehensive dataset having 32 variables. It covers various details such as the details of the booking party (adults, children, babies, meal, country, etc), details of the hotel(hotel, reserved room type, assigned room type, etc) and details of the booking itself(arrival date details, reservation status details, etc).

Tidy Data (as needed)

We can do some simple checks to see if the dataset is tidy. Lets check if there are any duplicate rows or null values in the dataset

print(anyDuplicated(bookings) == 0)
[1] FALSE
print(sum(is.na(bookings)) == 0)
[1] FALSE

We can see that there are no duplicate rows or null rows. The dataset is tidy.

Identify variables that need to be mutated

We can see that the arrival details have been split into 4 columns:

Code
grep("arrival",names(bookings), value = TRUE)
[1] "arrival_date_year"         "arrival_date_month"       
[3] "arrival_date_week_number"  "arrival_date_day_of_month"

Instead we can convert this into a single variable which will hold the arrival date in a standard format. In this case we will convert it to the same format as “reservation_status_date” variable (YYYY-MM-DD).

Code
bookings$date_str <- paste(bookings$arrival_date_year, bookings$arrival_date_month, bookings$arrival_date_day_of_month, sep = "-")

bookings$arrival_date <- as.Date(bookings$date_str, format = "%Y-%B-%d")

bookings <- bookings[, !(names(bookings) %in% "date_str")]

head(select(bookings, arrival_date, reservation_status_date))
  arrival_date reservation_status_date
1   2015-07-01              2015-07-01
2   2015-07-01              2015-07-01
3   2015-07-01              2015-07-02
4   2015-07-01              2015-07-02
5   2015-07-01              2015-07-03
6   2015-07-01              2015-07-03

Now we can remove the original 4 arrival variables:

Code
bookings <- select(bookings, -arrival_date_year, -arrival_date_month, -arrival_date_week_number, -arrival_date_day_of_month)
colnames(bookings)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "stays_in_weekend_nights"       
 [5] "stays_in_week_nights"           "adults"                        
 [7] "children"                       "babies"                        
 [9] "meal"                           "country"                       
[11] "market_segment"                 "distribution_channel"          
[13] "is_repeated_guest"              "previous_cancellations"        
[15] "previous_bookings_not_canceled" "reserved_room_type"            
[17] "assigned_room_type"             "booking_changes"               
[19] "deposit_type"                   "agent"                         
[21] "company"                        "days_in_waiting_list"          
[23] "customer_type"                  "adr"                           
[25] "required_car_parking_spaces"    "total_of_special_requests"     
[27] "reservation_status"             "reservation_status_date"       
[29] "arrival_date"                  

The lead_time variable tells us the number of days between the reservation date and the arrival date. Therefore, using the lead_time variable, we can add a variable to store the day the reservation was made.

Code
bookings$reservation_booking_date <- bookings$arrival_date - days(bookings$lead_time)
head(select(bookings, lead_time, arrival_date, reservation_booking_date))
  lead_time arrival_date reservation_booking_date
1       342   2015-07-01               2014-07-24
2       737   2015-07-01               2013-06-24
3         7   2015-07-01               2015-06-24
4        13   2015-07-01               2015-06-18
5        14   2015-07-01               2015-06-17
6        14   2015-07-01               2015-06-17

We can look at the final dataset after the mutations:

Code
colnames(bookings)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "stays_in_weekend_nights"       
 [5] "stays_in_week_nights"           "adults"                        
 [7] "children"                       "babies"                        
 [9] "meal"                           "country"                       
[11] "market_segment"                 "distribution_channel"          
[13] "is_repeated_guest"              "previous_cancellations"        
[15] "previous_bookings_not_canceled" "reserved_room_type"            
[17] "assigned_room_type"             "booking_changes"               
[19] "deposit_type"                   "agent"                         
[21] "company"                        "days_in_waiting_list"          
[23] "customer_type"                  "adr"                           
[25] "required_car_parking_spaces"    "total_of_special_requests"     
[27] "reservation_status"             "reservation_status_date"       
[29] "arrival_date"                   "reservation_booking_date"      

With this our dataset has been mutated to a more convenient format for further analysis.