challenge_4
More data wrangling: mutate
Author

Henry Mitrano

Published

January 20, 2023

Code
library(tidyverse)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Code
hotel_bookings = read_csv("_data/hotel_bookings.csv")
head(hotel_bookings)
# A tibble: 6 × 32
  hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <chr>     <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 Resort…       0     342    2015 July         27       1       0       0      2
2 Resort…       0     737    2015 July         27       1       0       0      2
3 Resort…       0       7    2015 July         27       1       0       1      1
4 Resort…       0      13    2015 July         27       1       0       1      1
5 Resort…       0      14    2015 July         27       1       0       2      2
6 Resort…       0      14    2015 July         27       1       0       2      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Briefly describe the data

It appears this is some kind of data regarding hotels in different countries, and statistics about their bookings. These statistics include number of people in a reservation, whether they’re kids or adults, how their room was booked, whether they’ve been there before, etc. The data isn’t organized horribly, but some stuff definitely stands out as needing tidying.

Code
tidy_bookings <- 
  hotel_bookings %>%
      mutate(arrival=str_c(arrival_date_day_of_month, arrival_date_month,arrival_date_year, sep="/"), arrival=dmy(arrival)) 

tidy_bookings = select(tidy_bookings, select = -starts_with("arrival_date") )

Any additional comments?

Identify variables that need to be mutated

For starters, one of the more confusing things I found when looking through the data was that the arrival dates of the visits are kept in 3 separate columns (arrival date year, arrival date month, and arrival date day), while the checkout or “reservation status” date is stored in 1 column, normally. I’ll want to replicate that mm-dd-yyyy format in one column for arrivals. We can doing that by importing an R library at the top of the file, “lubridate”, that includes a bunch of date formatting tools.

Code
tidy_bookings<-transform(tidy_bookings,valued_customer=ifelse(is_repeated_guest >0 && previous_cancellations == 0, "yes", "no"))
head(tidy_bookings)
         hotel is_canceled lead_time stays_in_weekend_nights
1 Resort Hotel           0       342                       0
2 Resort Hotel           0       737                       0
3 Resort Hotel           0         7                       0
4 Resort Hotel           0        13                       0
5 Resort Hotel           0        14                       0
6 Resort Hotel           0        14                       0
  stays_in_week_nights adults children babies meal country market_segment
1                    0      2        0      0   BB     PRT         Direct
2                    0      2        0      0   BB     PRT         Direct
3                    1      1        0      0   BB     GBR         Direct
4                    1      1        0      0   BB     GBR      Corporate
5                    2      2        0      0   BB     GBR      Online TA
6                    2      2        0      0   BB     GBR      Online TA
  distribution_channel is_repeated_guest previous_cancellations
1               Direct                 0                      0
2               Direct                 0                      0
3               Direct                 0                      0
4            Corporate                 0                      0
5                TA/TO                 0                      0
6                TA/TO                 0                      0
  previous_bookings_not_canceled reserved_room_type assigned_room_type
1                              0                  C                  C
2                              0                  C                  C
3                              0                  A                  C
4                              0                  A                  A
5                              0                  A                  A
6                              0                  A                  A
  booking_changes deposit_type agent company days_in_waiting_list customer_type
1               3   No Deposit  NULL    NULL                    0     Transient
2               4   No Deposit  NULL    NULL                    0     Transient
3               0   No Deposit  NULL    NULL                    0     Transient
4               0   No Deposit   304    NULL                    0     Transient
5               0   No Deposit   240    NULL                    0     Transient
6               0   No Deposit   240    NULL                    0     Transient
  adr required_car_parking_spaces total_of_special_requests reservation_status
1   0                           0                         0          Check-Out
2   0                           0                         0          Check-Out
3  75                           0                         0          Check-Out
4  75                           0                         0          Check-Out
5  98                           0                         1          Check-Out
6  98                           0                         1          Check-Out
  reservation_status_date    arrival valued_customer
1              2015-07-01 2015-07-01              no
2              2015-07-01 2015-07-01              no
3              2015-07-02 2015-07-01              no
4              2015-07-02 2015-07-01              no
5              2015-07-03 2015-07-01              no
6              2015-07-03 2015-07-01              no

Additionally, I’ve added a column at the end that distingushes certain booking as important guests, based on historical data. If a customer has been booked at the hotel before and is a repeat visitor, and has never canceled a booking, they are listed as “yes” under the “valued_customer” column at the end of the data.