Challenge 4: Mutating hotel bookings dataset

More data wrangling: mutate
Author

Saksham Kumar

Published

April 12, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

We read in the hotel_bookings.csv dataset, using the correct R package and command.

Code
bookings<-read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)
bookings
ABCDEFGHIJ0123456789
hotel
<chr>
is_canceled
<dbl>
lead_time
<dbl>
arrival_date_year
<dbl>
arrival_date_month
<chr>
arrival_date_week_number
<dbl>
arrival_date_day_of_month
<dbl>
Resort Hotel03422015July271
Resort Hotel07372015July271
Resort Hotel072015July271
Resort Hotel0132015July271
Resort Hotel0142015July271
Resort Hotel0142015July271
Resort Hotel002015July271
Resort Hotel092015July271
Resort Hotel1852015July271
Resort Hotel1752015July271

Briefly describe the data

The data corresponds to hotel bookings. There are 119390 rows and 32 variables in our data. The 32 data types and their data types are mentioned below.

Code
spec(bookings)
cols(
  hotel = col_character(),
  is_canceled = col_double(),
  lead_time = col_double(),
  arrival_date_year = col_double(),
  arrival_date_month = col_character(),
  arrival_date_week_number = col_double(),
  arrival_date_day_of_month = col_double(),
  stays_in_weekend_nights = col_double(),
  stays_in_week_nights = col_double(),
  adults = col_double(),
  children = col_double(),
  babies = col_double(),
  meal = col_character(),
  country = col_character(),
  market_segment = col_character(),
  distribution_channel = col_character(),
  is_repeated_guest = col_double(),
  previous_cancellations = col_double(),
  previous_bookings_not_canceled = col_double(),
  reserved_room_type = col_character(),
  assigned_room_type = col_character(),
  booking_changes = col_double(),
  deposit_type = col_character(),
  agent = col_character(),
  company = col_character(),
  days_in_waiting_list = col_double(),
  customer_type = col_character(),
  adr = col_double(),
  required_car_parking_spaces = col_double(),
  total_of_special_requests = col_double(),
  reservation_status = col_character(),
  reservation_status_date = col_date(format = "")
)

Tidy Data (as needed)

The data looks clean for an initial analysis and does not need to be tidied.

Identify variables that need to be mutated

Looking at the data we see that we have 3 variables that can be coalesced into one - arrival_date_day_of_month, arrival_date_month, arrival_date_year.

Using mutate the first three fields can be combined into a single arrival date field.

Code
bookings_mutate_date_arrival<-bookings%>%
  mutate(date_arrival = str_c(arrival_date_day_of_month, arrival_date_month, arrival_date_year, sep="/"))

bookings_mutate_strToDate<-bookings_mutate_date_arrival%>%
  mutate(date_arrival = dmy(date_arrival))

bookings_mutate_strToDate[ , c("arrival_date_day_of_month", "arrival_date_month", "arrival_date_year", "date_arrival")]  
ABCDEFGHIJ0123456789
arrival_date_day_of_month
<dbl>
arrival_date_month
<chr>
arrival_date_year
<dbl>
date_arrival
<date>
1July20152015-07-01
1July20152015-07-01
1July20152015-07-01
1July20152015-07-01
1July20152015-07-01
1July20152015-07-01
1July20152015-07-01
1July20152015-07-01
1July20152015-07-01
1July20152015-07-01
Code
summary(bookings_mutate_strToDate$date_arrival)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2015-07-01" "2016-03-13" "2016-09-06" "2016-08-28" "2017-03-18" "2017-08-31" 

We can now clean the data by removing the now redundant variables arrival_date_day_of_month, arrival_date_month and arrival_date_year

Code
bookings_final <- bookings_mutate_strToDate%>%
  select(-starts_with("arrival"))

bookings_final
ABCDEFGHIJ0123456789
hotel
<chr>
is_canceled
<dbl>
lead_time
<dbl>
stays_in_weekend_nights
<dbl>
stays_in_week_nights
<dbl>
adults
<dbl>
children
<dbl>
babies
<dbl>
meal
<chr>
country
<chr>
Resort Hotel034200200BBPRT
Resort Hotel073700200BBPRT
Resort Hotel0701100BBGBR
Resort Hotel01301100BBGBR
Resort Hotel01402200BBGBR
Resort Hotel01402200BBGBR
Resort Hotel0002200BBPRT
Resort Hotel0902200FBPRT
Resort Hotel18503200BBPRT
Resort Hotel17503200HBPRT