Challenge 4: Mutating hotel bookings dataset

challenge_4

hotel_bookings

Saksham Kumar

More data wrangling: mutate

Author

Saksham Kumar

Published

April 12, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
identify variables that need to be mutated
mutate variables and sanity check all mutations

Read in data

We read in the hotel_bookings.csv dataset, using the correct R package and command.

Code

bookings<-read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)
bookings

Briefly describe the data

The data corresponds to hotel bookings. There are 119390 rows and 32 variables in our data. The 32 data types and their data types are mentioned below.

Code

spec(bookings)

cols(
  hotel = col_character(),
  is_canceled = col_double(),
  lead_time = col_double(),
  arrival_date_year = col_double(),
  arrival_date_month = col_character(),
  arrival_date_week_number = col_double(),
  arrival_date_day_of_month = col_double(),
  stays_in_weekend_nights = col_double(),
  stays_in_week_nights = col_double(),
  adults = col_double(),
  children = col_double(),
  babies = col_double(),
  meal = col_character(),
  country = col_character(),
  market_segment = col_character(),
  distribution_channel = col_character(),
  is_repeated_guest = col_double(),
  previous_cancellations = col_double(),
  previous_bookings_not_canceled = col_double(),
  reserved_room_type = col_character(),
  assigned_room_type = col_character(),
  booking_changes = col_double(),
  deposit_type = col_character(),
  agent = col_character(),
  company = col_character(),
  days_in_waiting_list = col_double(),
  customer_type = col_character(),
  adr = col_double(),
  required_car_parking_spaces = col_double(),
  total_of_special_requests = col_double(),
  reservation_status = col_character(),
  reservation_status_date = col_date(format = "")
)

Tidy Data (as needed)

The data looks clean for an initial analysis and does not need to be tidied.

Identify variables that need to be mutated

Looking at the data we see that we have 3 variables that can be coalesced into one - arrival_date_day_of_month, arrival_date_month, arrival_date_year.

Using mutate the first three fields can be combined into a single arrival date field.

Code

bookings_mutate_date_arrival<-bookings%>%
  mutate(date_arrival = str_c(arrival_date_day_of_month, arrival_date_month, arrival_date_year, sep="/"))

bookings_mutate_strToDate<-bookings_mutate_date_arrival%>%
  mutate(date_arrival = dmy(date_arrival))

bookings_mutate_strToDate[ , c("arrival_date_day_of_month", "arrival_date_month", "arrival_date_year", "date_arrival")]

Code

summary(bookings_mutate_strToDate$date_arrival)

        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2015-07-01" "2016-03-13" "2016-09-06" "2016-08-28" "2017-03-18" "2017-08-31"

We can now clean the data by removing the now redundant variables arrival_date_day_of_month, arrival_date_month and arrival_date_year

Code

bookings_final <- bookings_mutate_strToDate%>%
  select(-starts_with("arrival"))

bookings_final