Challenge 4

challenge_4

hotel_bookings.csv

More data wrangling: mutate

Author

Siddharth Goel

Published

January 28, 2023

Code

library(tidyverse)
library(ggplot2)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

Code

data = read_csv("_data/hotel_bookings.csv")

Briefly describe the data

Code

# looking at the schema
spec(data)

cols(
  hotel = col_character(),
  is_canceled = col_double(),
  lead_time = col_double(),
  arrival_date_year = col_double(),
  arrival_date_month = col_character(),
  arrival_date_week_number = col_double(),
  arrival_date_day_of_month = col_double(),
  stays_in_weekend_nights = col_double(),
  stays_in_week_nights = col_double(),
  adults = col_double(),
  children = col_double(),
  babies = col_double(),
  meal = col_character(),
  country = col_character(),
  market_segment = col_character(),
  distribution_channel = col_character(),
  is_repeated_guest = col_double(),
  previous_cancellations = col_double(),
  previous_bookings_not_canceled = col_double(),
  reserved_room_type = col_character(),
  assigned_room_type = col_character(),
  booking_changes = col_double(),
  deposit_type = col_character(),
  agent = col_character(),
  company = col_character(),
  days_in_waiting_list = col_double(),
  customer_type = col_character(),
  adr = col_double(),
  required_car_parking_spaces = col_double(),
  total_of_special_requests = col_double(),
  reservation_status = col_character(),
  reservation_status_date = col_date(format = "")
)

Code

# looking at the data values
head(data)

# A tibble: 6 × 32
  hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <chr>     <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 Resort…       0     342    2015 July         27       1       0       0      2
2 Resort…       0     737    2015 July         27       1       0       0      2
3 Resort…       0       7    2015 July         27       1       0       1      1
4 Resort…       0      13    2015 July         27       1       0       1      1
5 Resort…       0      14    2015 July         27       1       0       2      2
6 Resort…       0      14    2015 July         27       1       0       2      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Code

unique(data$hotel)

[1] "Resort Hotel" "City Hotel"

Code

unique(data$arrival_date_year)

[1] 2015 2016 2017

We got the descriptors of each column by using the spec method and the top 10 rows in the data using the head method. As we can see, this is the hotel booking data for 2 hotels over 3 years. We can also see that the dataset has 119390 total rows and 32 columns.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

The data is not tidy as we can see a lot of NULL and 0 values in the data in the output of the head command. Now, we will attempt to find the percentage of these values in the data and remove the columns if the percentage is significant

Code

# get zero percentage in data
zero_percent <- (colSums(data == 0) / nrow(data)) * 100
# get null percent in data
null_percent <- sapply(data, function(x) sum(str_detect(x, "NULL")) / length(x))

aggregated_df <- data.frame(null_percent = null_percent, zero_percent = zero_percent)

arrange(aggregated_df, desc(null_percent), desc(zero_percent))

                               null_percent zero_percent
company                         0.943068934    0.0000000
agent                           0.136862384    0.0000000
country                         0.004087445    0.0000000
babies                          0.000000000   99.2319290
previous_bookings_not_canceled  0.000000000   96.9679203
days_in_waiting_list            0.000000000   96.9025882
is_repeated_guest               0.000000000   96.8087780
previous_cancellations          0.000000000   94.5690594
required_car_parking_spaces     0.000000000   93.7884245
booking_changes                 0.000000000   84.8597035
is_canceled                     0.000000000   62.9583717
total_of_special_requests       0.000000000   58.8977301
stays_in_weekend_nights         0.000000000   43.5530614
stays_in_week_nights            0.000000000    6.4033839
lead_time                       0.000000000    5.3145155
adr                             0.000000000    1.6408409
adults                          0.000000000    0.3375492
hotel                           0.000000000    0.0000000
arrival_date_year               0.000000000    0.0000000
arrival_date_month              0.000000000    0.0000000
arrival_date_week_number        0.000000000    0.0000000
arrival_date_day_of_month       0.000000000    0.0000000
meal                            0.000000000    0.0000000
market_segment                  0.000000000    0.0000000
distribution_channel            0.000000000    0.0000000
reserved_room_type              0.000000000    0.0000000
assigned_room_type              0.000000000    0.0000000
deposit_type                    0.000000000    0.0000000
customer_type                   0.000000000    0.0000000
reservation_status              0.000000000    0.0000000
reservation_status_date         0.000000000    0.0000000
children                                 NA           NA

As we can see from the stats above, it is safe to remove columns company and babies due to the high percentage of insignificant "NULL" and 0 values.

Code

filtered_data = select(data, -company, -babies)
head(filtered_data)

# A tibble: 6 × 30
  hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <chr>     <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 Resort…       0     342    2015 July         27       1       0       0      2
2 Resort…       0     737    2015 July         27       1       0       0      2
3 Resort…       0       7    2015 July         27       1       0       1      1
4 Resort…       0      13    2015 July         27       1       0       1      1
5 Resort…       0      14    2015 July         27       1       0       2      2
6 Resort…       0      14    2015 July         27       1       0       2      2
# … with 20 more variables: children <dbl>, meal <chr>, country <chr>,
#   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Even after we removed columns from the data, there still remain some columns which have NULL and 0 values which can be hard to intercept. NULL values in the agent column and binary values in the is_canceled and is_repeated_guest columns can be confusing to interpret

Mutations: - NULL values in agent to NO AGENT - 0 values in is_canceled to NO - 1 values in is_canceled to YES - 0 values in is_repeated_guest to NO - 1 values in is_repeated_guest to YES

Code

mutated_data <- filtered_data %>% mutate(agent = str_replace(agent, "NULL", "NO AGENT"))
mutated_data$is_canceled <- as.character(mutated_data$is_canceled)
mutated_data <- mutated_data %>% mutate(is_canceled = str_replace(is_canceled, "0", "NO"))
mutated_data <- mutated_data %>% mutate(is_canceled = str_replace(is_canceled, "1", "YES"))
mutated_data <- mutated_data %>% mutate(is_repeated_guest = str_replace(is_repeated_guest, "0", "NO"))
mutated_data <- mutated_data %>% mutate(is_repeated_guest = str_replace(is_repeated_guest, "1", "YES"))

unique(mutated_data$is_canceled)

[1] "NO"  "YES"

Code

unique(mutated_data$is_repeated_guest)

[1] "NO"  "YES"

Code

yes_percent <- sapply(mutated_data, function(x) sum(str_detect(x, "YES")) / length(x))
aggregated_df_mutated <- data.frame(yes_percent = yes_percent)
arrange(aggregated_df_mutated, desc(yes_percent))

                               yes_percent
is_canceled                     0.37041628
is_repeated_guest               0.03191222
hotel                           0.00000000
lead_time                       0.00000000
arrival_date_year               0.00000000
arrival_date_month              0.00000000
arrival_date_week_number        0.00000000
arrival_date_day_of_month       0.00000000
stays_in_weekend_nights         0.00000000
stays_in_week_nights            0.00000000
adults                          0.00000000
meal                            0.00000000
country                         0.00000000
market_segment                  0.00000000
distribution_channel            0.00000000
previous_cancellations          0.00000000
previous_bookings_not_canceled  0.00000000
reserved_room_type              0.00000000
assigned_room_type              0.00000000
booking_changes                 0.00000000
deposit_type                    0.00000000
agent                           0.00000000
days_in_waiting_list            0.00000000
customer_type                   0.00000000
adr                             0.00000000
required_car_parking_spaces     0.00000000
total_of_special_requests       0.00000000
reservation_status              0.00000000
reservation_status_date         0.00000000
children                                NA

As we can see above, the final dataframe mutated_data contains data in a very clean and understandable format as compared to the original data. Moreover, as a sanity check, it has been verified that the percentages of YES and NO in the final dataframe match the percentages of 0 and 1 in the original dataframe for columns is_canceled and is_repeated_guest