Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Zhongyue Lin
June 9, 2023
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
In challenge 4, I chose hotel_bookings.csv as the original dataset. This dataset is the one I used in challenge 2.
# A tibble: 6 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resort… 0 342 2015 July 27 1 0 0 2
2 Resort… 0 737 2015 July 27 1 0 0 2
3 Resort… 0 7 2015 July 27 1 0 1 1
4 Resort… 0 13 2015 July 27 1 0 1 1
5 Resort… 0 14 2015 July 27 1 0 2 2
6 Resort… 0 14 2015 July 27 1 0 2 2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
# assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
# agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
According to the EDA of this dataset in Challenge 2, it is understood that this dataset is about the data records of two hotels, Resort hotel and City hotel. The dataset contains customer data, business data and other related data. The data structure of this dataset is 119,390 rows 32 columns.
# A tibble: 1 × 32
hotel is_canc…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 0 0 0 0 0
# … with 22 more variables: children <int>, babies <int>, meal <int>,
# country <int>, market_segment <int>, distribution_channel <int>,
# is_repeated_guest <int>, previous_cancellations <int>,
# previous_bookings_not_canceled <int>, reserved_room_type <int>,
# assigned_room_type <int>, booking_changes <int>, deposit_type <int>,
# agent <int>, company <int>, days_in_waiting_list <int>,
# customer_type <int>, adr <int>, required_car_parking_spaces <int>, …
# A tibble: 4 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City H… 1 2 2015 August 32 3 1 0 2
2 City H… 1 1 2015 August 32 5 0 2 2
3 City H… 1 1 2015 August 32 5 0 2 3
4 City H… 1 8 2015 August 33 13 2 5 2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
# assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
# agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
# A tibble: 1 × 32
hotel is_canc…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 0 0 0 0 0
# … with 22 more variables: children <int>, babies <int>, meal <int>,
# country <int>, market_segment <int>, distribution_channel <int>,
# is_repeated_guest <int>, previous_cancellations <int>,
# previous_bookings_not_canceled <int>, reserved_room_type <int>,
# assigned_room_type <int>, booking_changes <int>, deposit_type <int>,
# agent <int>, company <int>, days_in_waiting_list <int>,
# customer_type <int>, adr <int>, required_car_parking_spaces <int>, …
As I began working with this dataset, I observed that it was already quite tidy. The dataset follows the principles of tidy data, where each variable forms a column, each observation forms a row, and each cell holds a single value of the corresponding variable for a given observation. To understand the extent of missing data across all columns, I performed an initial check using summarize_all()
combined with is.na()
. This gave me an overall snapshot of the missing data.
Next, I employed apply()
and is.na()
to identify which rows in the dataset contained missing data, and stored these row indices in the variable missing_rows_index
for future use. Using these stored indices, I was able to extract and store the rows with missing data in the missing_rows
variable. By printing out missing_rows
, I gained a detailed view of the rows with missing data and the extent of missingness.
Then, I addressed the missing values in the “children” column, replacing them with 0 using the replace()
function or subset assignment. I decided to replace missing values with 0 under the assumption that if the number of children isn’t specified, it’s likely because no children were involved in the booking. This approach to handling missing data seemed reasonable under this context.
Finally, to confirm that the missing values in the “children” column were indeed replaced with 0, I repeated the missing data check using summarize_all()
.
# Combine year, month, and day of month into a single date column and then remove the original columns related to date
data <- data %>%
mutate(arrival_date = as.Date(paste(arrival_date_year, arrival_date_month, arrival_date_day_of_month, sep = "-"), format = "%Y-%B-%d")) %>%
select(-arrival_date_year, -arrival_date_month, -arrival_date_week_number, -arrival_date_day_of_month)
# Convert the variables 'is_canceled' and 'is_repeated_guest' to logical (boolean) values, which are currently coded as integers
data <- data %>%
mutate(is_canceled = as.logical(is_canceled), is_repeated_guest = as.logical(is_repeated_guest))
# Check and print unique values of categorical variables in order to see if there are inconsistencies, spelling mistakes, or unnecessary categories
unique_values <- data %>%
select(hotel, meal, country, market_segment, distribution_channel, reserved_room_type, assigned_room_type, deposit_type, customer_type, reservation_status) %>%
sapply(unique)
print(unique_values)
$hotel
[1] "Resort Hotel" "City Hotel"
$meal
[1] "BB" "FB" "HB" "SC" "Undefined"
$country
[1] "PRT" "GBR" "USA" "ESP" "IRL" "FRA" "NULL" "ROU" "NOR" "OMN"
[11] "ARG" "POL" "DEU" "BEL" "CHE" "CN" "GRC" "ITA" "NLD" "DNK"
[21] "RUS" "SWE" "AUS" "EST" "CZE" "BRA" "FIN" "MOZ" "BWA" "LUX"
[31] "SVN" "ALB" "IND" "CHN" "MEX" "MAR" "UKR" "SMR" "LVA" "PRI"
[41] "SRB" "CHL" "AUT" "BLR" "LTU" "TUR" "ZAF" "AGO" "ISR" "CYM"
[51] "ZMB" "CPV" "ZWE" "DZA" "KOR" "CRI" "HUN" "ARE" "TUN" "JAM"
[61] "HRV" "HKG" "IRN" "GEO" "AND" "GIB" "URY" "JEY" "CAF" "CYP"
[71] "COL" "GGY" "KWT" "NGA" "MDV" "VEN" "SVK" "FJI" "KAZ" "PAK"
[81] "IDN" "LBN" "PHL" "SEN" "SYC" "AZE" "BHR" "NZL" "THA" "DOM"
[91] "MKD" "MYS" "ARM" "JPN" "LKA" "CUB" "CMR" "BIH" "MUS" "COM"
[101] "SUR" "UGA" "BGR" "CIV" "JOR" "SYR" "SGP" "BDI" "SAU" "VNM"
[111] "PLW" "QAT" "EGY" "PER" "MLT" "MWI" "ECU" "MDG" "ISL" "UZB"
[121] "NPL" "BHS" "MAC" "TGO" "TWN" "DJI" "STP" "KNA" "ETH" "IRQ"
[131] "HND" "RWA" "KHM" "MCO" "BGD" "IMN" "TJK" "NIC" "BEN" "VGB"
[141] "TZA" "GAB" "GHA" "TMP" "GLP" "KEN" "LIE" "GNB" "MNE" "UMI"
[151] "MYT" "FRO" "MMR" "PAN" "BFA" "LBY" "MLI" "NAM" "BOL" "PRY"
[161] "BRB" "ABW" "AIA" "SLV" "DMA" "PYF" "GUY" "LCA" "ATA" "GTM"
[171] "ASM" "MRT" "NCL" "KIR" "SDN" "ATF" "SLE" "LAO"
$market_segment
[1] "Direct" "Corporate" "Online TA" "Offline TA/TO"
[5] "Complementary" "Groups" "Undefined" "Aviation"
$distribution_channel
[1] "Direct" "Corporate" "TA/TO" "Undefined" "GDS"
$reserved_room_type
[1] "C" "A" "D" "E" "G" "F" "H" "L" "P" "B"
$assigned_room_type
[1] "C" "A" "D" "E" "G" "F" "I" "B" "H" "P" "L" "K"
$deposit_type
[1] "No Deposit" "Refundable" "Non Refund"
$customer_type
[1] "Transient" "Contract" "Transient-Party" "Group"
$reservation_status
[1] "Check-Out" "Canceled" "No-Show"
# Convert categorical variables into factors for easier analysis and visualization.
data <- data %>%
mutate_at(vars(hotel, meal, country, market_segment, distribution_channel, reserved_room_type, assigned_room_type, deposit_type, customer_type, reservation_status),
factor)
# Print the first few rows of the dataset to verify that the transformations have been applied correctly
head(data)
# A tibble: 6 × 29
hotel is_ca…¹ lead_…² stays…³ stays…⁴ adults child…⁵ babies meal country
<fct> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 Resort Ho… FALSE 342 0 0 2 0 0 BB PRT
2 Resort Ho… FALSE 737 0 0 2 0 0 BB PRT
3 Resort Ho… FALSE 7 0 1 1 0 0 BB GBR
4 Resort Ho… FALSE 13 0 1 1 0 0 BB GBR
5 Resort Ho… FALSE 14 0 2 2 0 0 BB GBR
6 Resort Ho… FALSE 14 0 2 2 0 0 BB GBR
# … with 19 more variables: market_segment <fct>, distribution_channel <fct>,
# is_repeated_guest <lgl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <fct>,
# assigned_room_type <fct>, booking_changes <dbl>, deposit_type <fct>,
# agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <fct>, adr <dbl>, required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>, reservation_status <fct>, …
First, I merged the year, month, and day of arrival into a single date column and then removed the original columns. This was done using the mutate function to create a new arrival_date column, which was the combination of arrival_date_year, arrival_date_month, and arrival_date_day_of_month. After that, I used the select
function to remove the original date columns. I also converted the ‘is_canceled’ and ‘is_repeated_guest’ columns into logical values using the as.logical
function as these were originally in integer format.
Next, I needed to check the unique values of categorical variables to identify any inconsistencies, spelling mistakes, or unnecessary categories. To do this, I selected certain categorical variables and applied the sapply
and unique
functions to get unique values for each column. After identifying and rectifying any issues, I converted these categorical variables into factors using the mutate_at function along with the factor function. This step was necessary because many functions in R treat factors differently from character strings, especially in statistical modeling and visualization. Finally, I used the head
function to output the first few rows of the dataset, allowing me to verify that the transformations were correctly applied.
---
title: "Challenge4_Solution"
author: "Zhongyue Lin"
description: "More data wrangling: pivoting"
date: "6/9/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- abc_poll
- eggs
- fed_rates
- hotel_bookings
- debt
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- hotel_bookings.csv⭐⭐⭐⭐
In challenge 4, I chose hotel_bookings.csv as the original dataset. This dataset is the one I used in challenge 2.
```{r}
data <- readr::read_csv("_data/hotel_bookings.csv")
head(data)
```
### Briefly describe the data
According to the EDA of this dataset in Challenge 2, it is understood that this dataset is about the data records of two hotels, **Resort hotel** and **City hotel**. The dataset contains customer data, business data and other related data. The data structure of this dataset is **119,390** rows **32** columns.
## Tidy Data (as needed)
```{r}
data %>%
summarize_all(function(x) sum(is.na(x)))
# Find the index of the row containing the missing value
missing_rows_index <- which(apply(data, 1, function(x) any(is.na(x))))
# Print out lines with missing values
missing_rows <- data[missing_rows_index, ]
print(missing_rows)
```
```{r}
# Using replace()
data$children <- replace(data$children, is.na(data$children), 0)
# Using subset assignment
data$children[is.na(data$children)] <- 0
data %>%
summarize_all(function(x) sum(is.na(x)))
```
As I began working with this dataset, I observed that it was already quite tidy. The dataset follows the principles of tidy data, where each variable forms a column, each observation forms a row, and each cell holds a single value of the corresponding variable for a given observation. To understand the extent of missing data across all columns, I performed an initial check using `summarize_all()` combined with `is.na()`. This gave me an overall snapshot of the missing data.
Next, I employed `apply()` and `is.na()` to identify which rows in the dataset contained missing data, and stored these row indices in the variable `missing_rows_index` for future use. Using these stored indices, I was able to extract and store the rows with missing data in the `missing_rows` variable. By printing out `missing_rows`, I gained a detailed view of the rows with missing data and the extent of missingness.
Then, I addressed the missing values in the "children" column, replacing them with 0 using the `replace()` function or subset assignment. I decided to replace missing values with 0 under the assumption that if the number of children isn't specified, it's likely because no children were involved in the booking. This approach to handling missing data seemed reasonable under this context.
Finally, to confirm that the missing values in the "children" column were indeed replaced with 0, I repeated the missing data check using `summarize_all()`.
## Identify variables that need to be mutated
```{r}
# Combine year, month, and day of month into a single date column and then remove the original columns related to date
data <- data %>%
mutate(arrival_date = as.Date(paste(arrival_date_year, arrival_date_month, arrival_date_day_of_month, sep = "-"), format = "%Y-%B-%d")) %>%
select(-arrival_date_year, -arrival_date_month, -arrival_date_week_number, -arrival_date_day_of_month)
# Convert the variables 'is_canceled' and 'is_repeated_guest' to logical (boolean) values, which are currently coded as integers
data <- data %>%
mutate(is_canceled = as.logical(is_canceled), is_repeated_guest = as.logical(is_repeated_guest))
# Check and print unique values of categorical variables in order to see if there are inconsistencies, spelling mistakes, or unnecessary categories
unique_values <- data %>%
select(hotel, meal, country, market_segment, distribution_channel, reserved_room_type, assigned_room_type, deposit_type, customer_type, reservation_status) %>%
sapply(unique)
print(unique_values)
# Convert categorical variables into factors for easier analysis and visualization.
data <- data %>%
mutate_at(vars(hotel, meal, country, market_segment, distribution_channel, reserved_room_type, assigned_room_type, deposit_type, customer_type, reservation_status),
factor)
# Print the first few rows of the dataset to verify that the transformations have been applied correctly
head(data)
```
First, I merged the year, month, and day of arrival into a single date column and then removed the original columns. This was done using the mutate function to create a new arrival_date column, which was the combination of arrival_date_year, arrival_date_month, and arrival_date_day_of_month. After that, I used the `select`function to remove the original date columns. I also converted the 'is_canceled' and 'is_repeated_guest' columns into logical values using the `as.logical` function as these were originally in integer format.
Next, I needed to check the unique values of categorical variables to identify any inconsistencies, spelling mistakes, or unnecessary categories. To do this, I selected certain categorical variables and applied the `sapply` and `unique` functions to get unique values for each column. After identifying and rectifying any issues, I converted these categorical variables into factors using the mutate_at function along with the factor function. This step was necessary because many functions in R treat factors differently from character strings, especially in statistical modeling and visualization. Finally, I used the `head`function to output the first few rows of the dataset, allowing me to verify that the transformations were correctly applied.