Challenge_4_Solution

challenge_4
abc_poll
eggs
fed_rates
hotel_bookings
debt
More data wrangling: pivoting
Author

Zhongyue Lin

Published

June 9, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • hotel_bookings.csv⭐⭐⭐⭐

In challenge 4, I chose hotel_bookings.csv as the original dataset. This dataset is the one I used in challenge 2.

Code
data <- readr::read_csv("_data/hotel_bookings.csv")
head(data)
# A tibble: 6 × 32
  hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <chr>     <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 Resort…       0     342    2015 July         27       1       0       0      2
2 Resort…       0     737    2015 July         27       1       0       0      2
3 Resort…       0       7    2015 July         27       1       0       1      1
4 Resort…       0      13    2015 July         27       1       0       1      1
5 Resort…       0      14    2015 July         27       1       0       2      2
6 Resort…       0      14    2015 July         27       1       0       2      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Briefly describe the data

According to the EDA of this dataset in Challenge 2, it is understood that this dataset is about the data records of two hotels, Resort hotel and City hotel. The dataset contains customer data, business data and other related data. The data structure of this dataset is 119,390 rows 32 columns.

Tidy Data (as needed)

Code
data %>%
  summarize_all(function(x) sum(is.na(x)))
# A tibble: 1 × 32
  hotel is_canc…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <int>     <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>  <int>
1     0         0       0       0       0       0       0       0       0      0
# … with 22 more variables: children <int>, babies <int>, meal <int>,
#   country <int>, market_segment <int>, distribution_channel <int>,
#   is_repeated_guest <int>, previous_cancellations <int>,
#   previous_bookings_not_canceled <int>, reserved_room_type <int>,
#   assigned_room_type <int>, booking_changes <int>, deposit_type <int>,
#   agent <int>, company <int>, days_in_waiting_list <int>,
#   customer_type <int>, adr <int>, required_car_parking_spaces <int>, …
Code
# Find the index of the row containing the missing value
missing_rows_index <- which(apply(data, 1, function(x) any(is.na(x))))

# Print out lines with missing values
missing_rows <- data[missing_rows_index, ]
print(missing_rows)
# A tibble: 4 × 32
  hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <chr>     <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 City H…       1       2    2015 August       32       3       1       0      2
2 City H…       1       1    2015 August       32       5       0       2      2
3 City H…       1       1    2015 August       32       5       0       2      3
4 City H…       1       8    2015 August       33      13       2       5      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Code
# Using replace()
data$children <- replace(data$children, is.na(data$children), 0)

# Using subset assignment
data$children[is.na(data$children)] <- 0

data %>%
  summarize_all(function(x) sum(is.na(x)))
# A tibble: 1 × 32
  hotel is_canc…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <int>     <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>  <int>
1     0         0       0       0       0       0       0       0       0      0
# … with 22 more variables: children <int>, babies <int>, meal <int>,
#   country <int>, market_segment <int>, distribution_channel <int>,
#   is_repeated_guest <int>, previous_cancellations <int>,
#   previous_bookings_not_canceled <int>, reserved_room_type <int>,
#   assigned_room_type <int>, booking_changes <int>, deposit_type <int>,
#   agent <int>, company <int>, days_in_waiting_list <int>,
#   customer_type <int>, adr <int>, required_car_parking_spaces <int>, …

As I began working with this dataset, I observed that it was already quite tidy. The dataset follows the principles of tidy data, where each variable forms a column, each observation forms a row, and each cell holds a single value of the corresponding variable for a given observation. To understand the extent of missing data across all columns, I performed an initial check using summarize_all() combined with is.na(). This gave me an overall snapshot of the missing data.

Next, I employed apply() and is.na() to identify which rows in the dataset contained missing data, and stored these row indices in the variable missing_rows_index for future use. Using these stored indices, I was able to extract and store the rows with missing data in the missing_rows variable. By printing out missing_rows, I gained a detailed view of the rows with missing data and the extent of missingness.

Then, I addressed the missing values in the “children” column, replacing them with 0 using the replace() function or subset assignment. I decided to replace missing values with 0 under the assumption that if the number of children isn’t specified, it’s likely because no children were involved in the booking. This approach to handling missing data seemed reasonable under this context.

Finally, to confirm that the missing values in the “children” column were indeed replaced with 0, I repeated the missing data check using summarize_all().

Identify variables that need to be mutated

Code
# Combine year, month, and day of month into a single date column and then remove the original columns related to date
data <- data %>%
  mutate(arrival_date = as.Date(paste(arrival_date_year, arrival_date_month, arrival_date_day_of_month, sep = "-"), format = "%Y-%B-%d")) %>%
  select(-arrival_date_year, -arrival_date_month, -arrival_date_week_number, -arrival_date_day_of_month)

# Convert the variables 'is_canceled' and 'is_repeated_guest' to logical (boolean) values, which are currently coded as integers
data <- data %>%
  mutate(is_canceled = as.logical(is_canceled), is_repeated_guest = as.logical(is_repeated_guest))

# Check and print unique values of categorical variables in order to see if there are inconsistencies, spelling mistakes, or unnecessary categories
unique_values <- data %>%
  select(hotel, meal, country, market_segment, distribution_channel, reserved_room_type, assigned_room_type, deposit_type, customer_type, reservation_status) %>%
  sapply(unique)
print(unique_values)
$hotel
[1] "Resort Hotel" "City Hotel"  

$meal
[1] "BB"        "FB"        "HB"        "SC"        "Undefined"

$country
  [1] "PRT"  "GBR"  "USA"  "ESP"  "IRL"  "FRA"  "NULL" "ROU"  "NOR"  "OMN" 
 [11] "ARG"  "POL"  "DEU"  "BEL"  "CHE"  "CN"   "GRC"  "ITA"  "NLD"  "DNK" 
 [21] "RUS"  "SWE"  "AUS"  "EST"  "CZE"  "BRA"  "FIN"  "MOZ"  "BWA"  "LUX" 
 [31] "SVN"  "ALB"  "IND"  "CHN"  "MEX"  "MAR"  "UKR"  "SMR"  "LVA"  "PRI" 
 [41] "SRB"  "CHL"  "AUT"  "BLR"  "LTU"  "TUR"  "ZAF"  "AGO"  "ISR"  "CYM" 
 [51] "ZMB"  "CPV"  "ZWE"  "DZA"  "KOR"  "CRI"  "HUN"  "ARE"  "TUN"  "JAM" 
 [61] "HRV"  "HKG"  "IRN"  "GEO"  "AND"  "GIB"  "URY"  "JEY"  "CAF"  "CYP" 
 [71] "COL"  "GGY"  "KWT"  "NGA"  "MDV"  "VEN"  "SVK"  "FJI"  "KAZ"  "PAK" 
 [81] "IDN"  "LBN"  "PHL"  "SEN"  "SYC"  "AZE"  "BHR"  "NZL"  "THA"  "DOM" 
 [91] "MKD"  "MYS"  "ARM"  "JPN"  "LKA"  "CUB"  "CMR"  "BIH"  "MUS"  "COM" 
[101] "SUR"  "UGA"  "BGR"  "CIV"  "JOR"  "SYR"  "SGP"  "BDI"  "SAU"  "VNM" 
[111] "PLW"  "QAT"  "EGY"  "PER"  "MLT"  "MWI"  "ECU"  "MDG"  "ISL"  "UZB" 
[121] "NPL"  "BHS"  "MAC"  "TGO"  "TWN"  "DJI"  "STP"  "KNA"  "ETH"  "IRQ" 
[131] "HND"  "RWA"  "KHM"  "MCO"  "BGD"  "IMN"  "TJK"  "NIC"  "BEN"  "VGB" 
[141] "TZA"  "GAB"  "GHA"  "TMP"  "GLP"  "KEN"  "LIE"  "GNB"  "MNE"  "UMI" 
[151] "MYT"  "FRO"  "MMR"  "PAN"  "BFA"  "LBY"  "MLI"  "NAM"  "BOL"  "PRY" 
[161] "BRB"  "ABW"  "AIA"  "SLV"  "DMA"  "PYF"  "GUY"  "LCA"  "ATA"  "GTM" 
[171] "ASM"  "MRT"  "NCL"  "KIR"  "SDN"  "ATF"  "SLE"  "LAO" 

$market_segment
[1] "Direct"        "Corporate"     "Online TA"     "Offline TA/TO"
[5] "Complementary" "Groups"        "Undefined"     "Aviation"     

$distribution_channel
[1] "Direct"    "Corporate" "TA/TO"     "Undefined" "GDS"      

$reserved_room_type
 [1] "C" "A" "D" "E" "G" "F" "H" "L" "P" "B"

$assigned_room_type
 [1] "C" "A" "D" "E" "G" "F" "I" "B" "H" "P" "L" "K"

$deposit_type
[1] "No Deposit" "Refundable" "Non Refund"

$customer_type
[1] "Transient"       "Contract"        "Transient-Party" "Group"          

$reservation_status
[1] "Check-Out" "Canceled"  "No-Show"  
Code
# Convert categorical variables into factors for easier analysis and visualization. 
data <- data %>%
  mutate_at(vars(hotel, meal, country, market_segment, distribution_channel, reserved_room_type, assigned_room_type, deposit_type, customer_type, reservation_status), 
            factor)

# Print the first few rows of the dataset to verify that the transformations have been applied correctly
head(data)
# A tibble: 6 × 29
  hotel      is_ca…¹ lead_…² stays…³ stays…⁴ adults child…⁵ babies meal  country
  <fct>      <lgl>     <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <fct> <fct>  
1 Resort Ho… FALSE       342       0       0      2       0      0 BB    PRT    
2 Resort Ho… FALSE       737       0       0      2       0      0 BB    PRT    
3 Resort Ho… FALSE         7       0       1      1       0      0 BB    GBR    
4 Resort Ho… FALSE        13       0       1      1       0      0 BB    GBR    
5 Resort Ho… FALSE        14       0       2      2       0      0 BB    GBR    
6 Resort Ho… FALSE        14       0       2      2       0      0 BB    GBR    
# … with 19 more variables: market_segment <fct>, distribution_channel <fct>,
#   is_repeated_guest <lgl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <fct>,
#   assigned_room_type <fct>, booking_changes <dbl>, deposit_type <fct>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <fct>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <fct>, …

First, I merged the year, month, and day of arrival into a single date column and then removed the original columns. This was done using the mutate function to create a new arrival_date column, which was the combination of arrival_date_year, arrival_date_month, and arrival_date_day_of_month. After that, I used the selectfunction to remove the original date columns. I also converted the ‘is_canceled’ and ‘is_repeated_guest’ columns into logical values using the as.logical function as these were originally in integer format.

Next, I needed to check the unique values of categorical variables to identify any inconsistencies, spelling mistakes, or unnecessary categories. To do this, I selected certain categorical variables and applied the sapply and unique functions to get unique values for each column. After identifying and rectifying any issues, I converted these categorical variables into factors using the mutate_at function along with the factor function. This step was necessary because many functions in R treat factors differently from character strings, especially in statistical modeling and visualization. Finally, I used the headfunction to output the first few rows of the dataset, allowing me to verify that the transformations were correctly applied.