challenge_4
hotel_bookings
dplyr
mutate
audrey_bertin
Data wrangling with mutate
Author

Audrey Bertin

Published

June 9, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

For this challenge, I’ll be continuing with the dataset I used in challenge 2: hotel_bookings.csv⭐⭐⭐⭐

Code
hotels <- readr::read_csv("_data/hotel_bookings.csv")

I provided an overview of the dataset in that challenge, which I’ll share here as well.

Data Overview

Code
glimpse(hotels)
Rows: 119,390
Columns: 32
$ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
$ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
$ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
$ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
$ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
$ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
$ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
$ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
$ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
$ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
$ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
$ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
$ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
$ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
$ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
$ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ customer_type                  <chr> "Transient", "Transient", "Transient", …
$ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
$ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
$ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
$ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…

This dataset has 119,300 rows and 32 columns. It contains many different data points on hotel reservations, including information about the customer (how many adults/children are staying, customer type, previous bookings by them, etc), the reservation (arrival date, length of stay, room type, etc), as well as the hotel (country location, distribution channel, hotel type, etc). Each row appears to refer to a single booking at a single hotel.

The hotels appear to be located all around the world and data covers the years 2015-2017:

Code
unique(hotels$country)
  [1] "PRT"  "GBR"  "USA"  "ESP"  "IRL"  "FRA"  "NULL" "ROU"  "NOR"  "OMN" 
 [11] "ARG"  "POL"  "DEU"  "BEL"  "CHE"  "CN"   "GRC"  "ITA"  "NLD"  "DNK" 
 [21] "RUS"  "SWE"  "AUS"  "EST"  "CZE"  "BRA"  "FIN"  "MOZ"  "BWA"  "LUX" 
 [31] "SVN"  "ALB"  "IND"  "CHN"  "MEX"  "MAR"  "UKR"  "SMR"  "LVA"  "PRI" 
 [41] "SRB"  "CHL"  "AUT"  "BLR"  "LTU"  "TUR"  "ZAF"  "AGO"  "ISR"  "CYM" 
 [51] "ZMB"  "CPV"  "ZWE"  "DZA"  "KOR"  "CRI"  "HUN"  "ARE"  "TUN"  "JAM" 
 [61] "HRV"  "HKG"  "IRN"  "GEO"  "AND"  "GIB"  "URY"  "JEY"  "CAF"  "CYP" 
 [71] "COL"  "GGY"  "KWT"  "NGA"  "MDV"  "VEN"  "SVK"  "FJI"  "KAZ"  "PAK" 
 [81] "IDN"  "LBN"  "PHL"  "SEN"  "SYC"  "AZE"  "BHR"  "NZL"  "THA"  "DOM" 
 [91] "MKD"  "MYS"  "ARM"  "JPN"  "LKA"  "CUB"  "CMR"  "BIH"  "MUS"  "COM" 
[101] "SUR"  "UGA"  "BGR"  "CIV"  "JOR"  "SYR"  "SGP"  "BDI"  "SAU"  "VNM" 
[111] "PLW"  "QAT"  "EGY"  "PER"  "MLT"  "MWI"  "ECU"  "MDG"  "ISL"  "UZB" 
[121] "NPL"  "BHS"  "MAC"  "TGO"  "TWN"  "DJI"  "STP"  "KNA"  "ETH"  "IRQ" 
[131] "HND"  "RWA"  "KHM"  "MCO"  "BGD"  "IMN"  "TJK"  "NIC"  "BEN"  "VGB" 
[141] "TZA"  "GAB"  "GHA"  "TMP"  "GLP"  "KEN"  "LIE"  "GNB"  "MNE"  "UMI" 
[151] "MYT"  "FRO"  "MMR"  "PAN"  "BFA"  "LBY"  "MLI"  "NAM"  "BOL"  "PRY" 
[161] "BRB"  "ABW"  "AIA"  "SLV"  "DMA"  "PYF"  "GUY"  "LCA"  "ATA"  "GTM" 
[171] "ASM"  "MRT"  "NCL"  "KIR"  "SDN"  "ATF"  "SLE"  "LAO" 
Code
unique(hotels$arrival_date_year)
[1] 2015 2016 2017

It is not entirely clear, however, which hotels are covered or any information about their cost.

Tidy Data

This data is already in a tidy format, meeting Hadley Wickham’s 3 rules of tidy data:

  1. Each variable must have its own column – Data is correctly separated into columns, no concerns here
  2. Each observation must have its own row – Each row represents one reservation at a hotel (which is one observation)
  3. Each value must have its own cell – There are not multiple pieces of information stored in the same cell of the data

There is also no need to pivot this data, as it is not in “wide” format. To be in wide format, there would need to be multiple responses of some type per row, such as a row representing a customer and columns representing their history of bookings. Instead, a row represents a single reservation at a hotel, which cannot be broken down further.

Identify variables that need to be mutated

Looking at this data, we consider each variable and what format they should be in:

Currently character but should be factor

Variables that only have a limited set of categories should be turned into factors. It usually only makes sense to keep strings as characters if there are an undefined set of options (e.g. free text responses to a survey) or if the options are regularly changing. Otherwise, factors usually save memory and make it easier to graph things/make comparisons.

These are the following variables:

  • hotel
  • arrival_date_month
  • meal
  • country
  • market_segment
  • distribution_channel
  • reserved_room_type
  • assigned_room_type
  • deposit_type
  • customer_type
  • reservation_status

Currently double but should be factor

These are variables that technically only have a yes (1) or no (0) option, but are currently categorized as numerical instead of categorical.

  • is_canceled
  • is_repeated_guest

Currently double but should probably be int

Storing values as doubles when they can only take on integer and not decimal values can take up more memory. There are quite a few of this type of variable in the data. This sort of modification isn’t necessary for data analysis, though:

  • lead_time
  • arrival_date_week_number
  • arrival_date_day_of_month
  • stays_in_weekend_nights
  • stays_in_week_nights
  • adults
  • children
  • babies
  • previous_cancellations
  • previous_bookings_not_canceled
  • booking_changes
  • required_car_parking_spaces
  • total_of_special_requests

Not entirely clear, but should probably stay as-is (with a minor modification)

There are two variables, agent and company that are currently character variables. They seem to store numeric IDs that refer to different agents and companies.

Looking into these variables further, they both have a LOT of options.

Code
length(unique(hotels$agent))
[1] 334
Code
length(unique(hotels$company))
[1] 353

It is theoretically possible that these options change regularly. Maybe someone changes ID number or IDs get removed from the data if people quit or contracts end with companies, so it probably does not make sense to store this as a factor.

It also does not make sense to store these as numeric (int), because there is no obvious meaning to an ordering. An agent of number “243” is not necessarily of higher status or importance than agent “5”.

Therefore, these can be left as character. However, one note to improve this encoding is to replace “NULL” with R’s NA so that we can more easily track null values in the data.

Already in the optimal format

  • adr (is a double, and has decimal options so this is correct)
  • reservation_status_date (is a date, and currently formatted as one)

Mutate variables

We now conduct the mutations that we described in the previous section:

Code
hotels_mutated <- hotels %>%
  mutate(
    # Character -> Factor
    hotel = as_factor(hotel),
    arrival_date_month = as_factor(arrival_date_month),
    meal = as_factor(meal),
    country = as_factor(country), 
    market_segment = as_factor(market_segment), 
    distribution_channel = as_factor(distribution_channel),
    reserved_room_type = as_factor(reserved_room_type),
    assigned_room_type = as_factor(assigned_room_type),
    deposit_type = as_factor(deposit_type),
    customer_type = as_factor(customer_type),
    reservation_status = as_factor(reservation_status),
    # Double -> Factor
    is_canceled = as_factor(is_canceled),
    is_repeated_guest = as_factor(is_repeated_guest),
    # Double -> Int
    lead_time = as.integer(lead_time),
    arrival_date_week_number = as.integer(arrival_date_week_number),
    arrival_date_day_of_month = as.integer(arrival_date_day_of_month),
    stays_in_weekend_nights = as.integer(stays_in_weekend_nights),
    stays_in_week_nights = as.integer(stays_in_week_nights),
    adults = as.integer(adults),
    children = as.integer(children),
    babies = as.integer(babies),
    previous_cancellations = as.integer(previous_cancellations),
    previous_bookings_not_canceled = as.integer(previous_bookings_not_canceled),
    booking_changes = as.integer(booking_changes),
    required_car_parking_spaces = as.integer(required_car_parking_spaces),
    total_of_special_requests = as.integer(total_of_special_requests)
    ) %>%
    # Replace "NULL" values with NA
    mutate(across(where(is.character), ~na_if(., "NULL")))

Looking at the results, we see that all still looks in good order and none of the variables appear to have an obvious bug that was introduced:

Code
glimpse(hotels_mutated)
Rows: 119,390
Columns: 32
$ hotel                          <fct> Resort Hotel, Resort Hotel, Resort Hote…
$ is_canceled                    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
$ lead_time                      <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
$ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ arrival_date_month             <fct> July, July, July, July, July, July, Jul…
$ arrival_date_week_number       <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
$ arrival_date_day_of_month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ stays_in_weekend_nights        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stays_in_week_nights           <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
$ adults                         <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ babies                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ meal                           <fct> BB, BB, BB, BB, BB, BB, BB, FB, BB, HB,…
$ country                        <fct> PRT, PRT, GBR, GBR, GBR, GBR, PRT, PRT,…
$ market_segment                 <fct> Direct, Direct, Direct, Corporate, Onli…
$ distribution_channel           <fct> Direct, Direct, Direct, Corporate, TA/T…
$ is_repeated_guest              <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type             <fct> C, C, A, A, A, A, C, C, A, D, E, D, D, …
$ assigned_room_type             <fct> C, C, C, A, A, A, C, C, A, D, E, D, E, …
$ booking_changes                <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ deposit_type                   <fct> No Deposit, No Deposit, No Deposit, No …
$ agent                          <chr> NA, NA, NA, "304", "240", "240", NA, "3…
$ company                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ customer_type                  <fct> Transient, Transient, Transient, Transi…
$ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
$ required_car_parking_spaces    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_of_special_requests      <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
$ reservation_status             <fct> Check-Out, Check-Out, Check-Out, Check-…
$ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…