Code
library(tidyverse)
library(ggplot2)
library(dplyr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Siddharth Goel
January 28, 2023
cols(
hotel = col_character(),
is_canceled = col_double(),
lead_time = col_double(),
arrival_date_year = col_double(),
arrival_date_month = col_character(),
arrival_date_week_number = col_double(),
arrival_date_day_of_month = col_double(),
stays_in_weekend_nights = col_double(),
stays_in_week_nights = col_double(),
adults = col_double(),
children = col_double(),
babies = col_double(),
meal = col_character(),
country = col_character(),
market_segment = col_character(),
distribution_channel = col_character(),
is_repeated_guest = col_double(),
previous_cancellations = col_double(),
previous_bookings_not_canceled = col_double(),
reserved_room_type = col_character(),
assigned_room_type = col_character(),
booking_changes = col_double(),
deposit_type = col_character(),
agent = col_character(),
company = col_character(),
days_in_waiting_list = col_double(),
customer_type = col_character(),
adr = col_double(),
required_car_parking_spaces = col_double(),
total_of_special_requests = col_double(),
reservation_status = col_character(),
reservation_status_date = col_date(format = "")
)
# A tibble: 6 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resort… 0 342 2015 July 27 1 0 0 2
2 Resort… 0 737 2015 July 27 1 0 0 2
3 Resort… 0 7 2015 July 27 1 0 1 1
4 Resort… 0 13 2015 July 27 1 0 1 1
5 Resort… 0 14 2015 July 27 1 0 2 2
6 Resort… 0 14 2015 July 27 1 0 2 2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
# assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
# agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
[1] "Resort Hotel" "City Hotel"
[1] 2015 2016 2017
We got the descriptors of each column by using the spec
method and the top 10 rows in the data using the head
method. As we can see, this is the hotel booking data for 2 hotels over 3 years. We can also see that the dataset has 119390
total rows and 32
columns.
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
The data is not tidy as we can see a lot of NULL
and 0
values in the data in the output of the head
command. Now, we will attempt to find the percentage of these values in the data and remove the columns if the percentage is significant
# get zero percentage in data
zero_percent <- (colSums(data == 0) / nrow(data)) * 100
# get null percent in data
null_percent <- sapply(data, function(x) sum(str_detect(x, "NULL")) / length(x))
aggregated_df <- data.frame(null_percent = null_percent, zero_percent = zero_percent)
arrange(aggregated_df, desc(null_percent), desc(zero_percent))
null_percent zero_percent
company 0.943068934 0.0000000
agent 0.136862384 0.0000000
country 0.004087445 0.0000000
babies 0.000000000 99.2319290
previous_bookings_not_canceled 0.000000000 96.9679203
days_in_waiting_list 0.000000000 96.9025882
is_repeated_guest 0.000000000 96.8087780
previous_cancellations 0.000000000 94.5690594
required_car_parking_spaces 0.000000000 93.7884245
booking_changes 0.000000000 84.8597035
is_canceled 0.000000000 62.9583717
total_of_special_requests 0.000000000 58.8977301
stays_in_weekend_nights 0.000000000 43.5530614
stays_in_week_nights 0.000000000 6.4033839
lead_time 0.000000000 5.3145155
adr 0.000000000 1.6408409
adults 0.000000000 0.3375492
hotel 0.000000000 0.0000000
arrival_date_year 0.000000000 0.0000000
arrival_date_month 0.000000000 0.0000000
arrival_date_week_number 0.000000000 0.0000000
arrival_date_day_of_month 0.000000000 0.0000000
meal 0.000000000 0.0000000
market_segment 0.000000000 0.0000000
distribution_channel 0.000000000 0.0000000
reserved_room_type 0.000000000 0.0000000
assigned_room_type 0.000000000 0.0000000
deposit_type 0.000000000 0.0000000
customer_type 0.000000000 0.0000000
reservation_status 0.000000000 0.0000000
reservation_status_date 0.000000000 0.0000000
children NA NA
As we can see from the stats above, it is safe to remove columns company
and babies
due to the high percentage of insignificant "NULL"
and 0
values.
# A tibble: 6 × 30
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resort… 0 342 2015 July 27 1 0 0 2
2 Resort… 0 737 2015 July 27 1 0 0 2
3 Resort… 0 7 2015 July 27 1 0 1 1
4 Resort… 0 13 2015 July 27 1 0 1 1
5 Resort… 0 14 2015 July 27 1 0 2 2
6 Resort… 0 14 2015 July 27 1 0 2 2
# … with 20 more variables: children <dbl>, meal <chr>, country <chr>,
# market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>, reservation_status <chr>, …
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Even after we removed columns from the data, there still remain some columns which have NULL and 0 values which can be hard to intercept. NULL values in the agent
column and binary values in the is_canceled
and is_repeated_guest
columns can be confusing to interpret
Mutations: - NULL values in agent
to NO AGENT
- 0 values in is_canceled
to NO
- 1 values in is_canceled
to YES
- 0 values in is_repeated_guest
to NO
- 1 values in is_repeated_guest
to YES
mutated_data <- filtered_data %>% mutate(agent = str_replace(agent, "NULL", "NO AGENT"))
mutated_data$is_canceled <- as.character(mutated_data$is_canceled)
mutated_data <- mutated_data %>% mutate(is_canceled = str_replace(is_canceled, "0", "NO"))
mutated_data <- mutated_data %>% mutate(is_canceled = str_replace(is_canceled, "1", "YES"))
mutated_data <- mutated_data %>% mutate(is_repeated_guest = str_replace(is_repeated_guest, "0", "NO"))
mutated_data <- mutated_data %>% mutate(is_repeated_guest = str_replace(is_repeated_guest, "1", "YES"))
unique(mutated_data$is_canceled)
[1] "NO" "YES"
[1] "NO" "YES"
yes_percent
is_canceled 0.37041628
is_repeated_guest 0.03191222
hotel 0.00000000
lead_time 0.00000000
arrival_date_year 0.00000000
arrival_date_month 0.00000000
arrival_date_week_number 0.00000000
arrival_date_day_of_month 0.00000000
stays_in_weekend_nights 0.00000000
stays_in_week_nights 0.00000000
adults 0.00000000
meal 0.00000000
country 0.00000000
market_segment 0.00000000
distribution_channel 0.00000000
previous_cancellations 0.00000000
previous_bookings_not_canceled 0.00000000
reserved_room_type 0.00000000
assigned_room_type 0.00000000
booking_changes 0.00000000
deposit_type 0.00000000
agent 0.00000000
days_in_waiting_list 0.00000000
customer_type 0.00000000
adr 0.00000000
required_car_parking_spaces 0.00000000
total_of_special_requests 0.00000000
reservation_status 0.00000000
reservation_status_date 0.00000000
children NA
As we can see above, the final dataframe mutated_data
contains data in a very clean and understandable format as compared to the original data. Moreover, as a sanity check, it has been verified that the percentages of YES
and NO
in the final dataframe match the percentages of 0
and 1
in the original dataframe for columns is_canceled
and is_repeated_guest
---
title: "Challenge 4"
author: "Siddharth Goel"
description: "More data wrangling: mutate"
date: "01/28/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- hotel_bookings.csv
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(ggplot2)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in data
```{r}
data = read_csv("_data/hotel_bookings.csv")
```
### Briefly describe the data
```{r}
# looking at the schema
spec(data)
# looking at the data values
head(data)
unique(data$hotel)
unique(data$arrival_date_year)
```
We got the descriptors of each column by using the `spec` method and the top 10 rows in the data using the `head` method. As we can see, this is the hotel booking data for 2 hotels over 3 years.
We can also see that the dataset has `119390` total rows and `32` columns.
## Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
The data is not tidy as we can see a lot of `NULL` and `0` values in the data in the output of the `head` command. Now, we will attempt to find the percentage of these values in the data and remove the columns if the percentage is significant
```{r}
# get zero percentage in data
zero_percent <- (colSums(data == 0) / nrow(data)) * 100
# get null percent in data
null_percent <- sapply(data, function(x) sum(str_detect(x, "NULL")) / length(x))
aggregated_df <- data.frame(null_percent = null_percent, zero_percent = zero_percent)
arrange(aggregated_df, desc(null_percent), desc(zero_percent))
```
As we can see from the stats above, it is safe to remove columns `company` and `babies` due to the high percentage of insignificant `"NULL"` and `0` values.
```{r}
filtered_data = select(data, -company, -babies)
head(filtered_data)
```
## Identify variables that need to be mutated
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Even after we removed columns from the data, there still remain some columns which have NULL and 0 values which can be hard to intercept. NULL values in the `agent` column and binary values in the `is_canceled` and `is_repeated_guest` columns can be confusing to interpret
Mutations:
- NULL values in `agent` to `NO AGENT`
- 0 values in `is_canceled` to `NO`
- 1 values in `is_canceled` to `YES`
- 0 values in `is_repeated_guest` to `NO`
- 1 values in `is_repeated_guest` to `YES`
```{r}
mutated_data <- filtered_data %>% mutate(agent = str_replace(agent, "NULL", "NO AGENT"))
mutated_data$is_canceled <- as.character(mutated_data$is_canceled)
mutated_data <- mutated_data %>% mutate(is_canceled = str_replace(is_canceled, "0", "NO"))
mutated_data <- mutated_data %>% mutate(is_canceled = str_replace(is_canceled, "1", "YES"))
mutated_data <- mutated_data %>% mutate(is_repeated_guest = str_replace(is_repeated_guest, "0", "NO"))
mutated_data <- mutated_data %>% mutate(is_repeated_guest = str_replace(is_repeated_guest, "1", "YES"))
unique(mutated_data$is_canceled)
unique(mutated_data$is_repeated_guest)
yes_percent <- sapply(mutated_data, function(x) sum(str_detect(x, "YES")) / length(x))
aggregated_df_mutated <- data.frame(yes_percent = yes_percent)
arrange(aggregated_df_mutated, desc(yes_percent))
```
As we can see above, the final dataframe `mutated_data` contains data in a very clean and understandable format as compared to the original data. Moreover, as a sanity check, it has been verified that the percentages of `YES` and `NO` in the final dataframe match the percentages of `0` and `1` in the original dataframe for columns `is_canceled` and `is_repeated_guest`