Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Srujan Kagitala
June 26, 2023
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
hotel is_canceled lead_time arrival_date_year arrival_date_month
1 Resort Hotel 0 342 2015 July
2 Resort Hotel 0 737 2015 July
3 Resort Hotel 0 7 2015 July
4 Resort Hotel 0 13 2015 July
5 Resort Hotel 0 14 2015 July
6 Resort Hotel 0 14 2015 July
arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
1 27 1 0
2 27 1 0
3 27 1 0
4 27 1 0
5 27 1 0
6 27 1 0
stays_in_week_nights adults children babies meal country market_segment
1 0 2 0 0 BB PRT Direct
2 0 2 0 0 BB PRT Direct
3 1 1 0 0 BB GBR Direct
4 1 1 0 0 BB GBR Corporate
5 2 2 0 0 BB GBR Online TA
6 2 2 0 0 BB GBR Online TA
distribution_channel is_repeated_guest previous_cancellations
1 Direct 0 0
2 Direct 0 0
3 Direct 0 0
4 Corporate 0 0
5 TA/TO 0 0
6 TA/TO 0 0
previous_bookings_not_canceled reserved_room_type assigned_room_type
1 0 C C
2 0 C C
3 0 A C
4 0 A A
5 0 A A
6 0 A A
booking_changes deposit_type agent company days_in_waiting_list customer_type
1 3 No Deposit NULL NULL 0 Transient
2 4 No Deposit NULL NULL 0 Transient
3 0 No Deposit NULL NULL 0 Transient
4 0 No Deposit 304 NULL 0 Transient
5 0 No Deposit 240 NULL 0 Transient
6 0 No Deposit 240 NULL 0 Transient
adr required_car_parking_spaces total_of_special_requests reservation_status
1 0 0 0 Check-Out
2 0 0 0 Check-Out
3 75 0 0 Check-Out
4 75 0 0 Check-Out
5 98 0 1 Check-Out
6 98 0 1 Check-Out
reservation_status_date
1 2015-07-01
2 2015-07-01
3 2015-07-02
4 2015-07-02
5 2015-07-03
6 2015-07-03
This data set contains booking data of City Hotel, Resort Hotel. It contains 119390 bookings information. 79330 and 40060 observations are respectively captured from City Hotel and Resort Hotel. The information available per booking are hotel, is_canceled, lead_time, arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month, stays_in_weekend_nights, stays_in_week_nights, adults, children, babies, meal, country, market_segment, distribution_channel, is_repeated_guest, previous_cancellations, previous_bookings_not_canceled, reserved_room_type, assigned_room_type, booking_changes, deposit_type, agent, company, days_in_waiting_list, customer_type, adr, required_car_parking_spaces, total_of_special_requests, reservation_status, reservation_status_date.
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
hotel is_canceled lead_time arrival_date_year
Length:119390 Min. :0.0000 Min. : 0 Min. :2015
Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
Mode :character Median :0.0000 Median : 69 Median :2016
Mean :0.3704 Mean :104 Mean :2016
3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
Max. :1.0000 Max. :737 Max. :2017
arrival_date_month arrival_date_week_number arrival_date_day_of_month
Length:119390 Min. : 1.00 Min. : 1.0
Class :character 1st Qu.:16.00 1st Qu.: 8.0
Mode :character Median :28.00 Median :16.0
Mean :27.17 Mean :15.8
3rd Qu.:38.00 3rd Qu.:23.0
Max. :53.00 Max. :31.0
stays_in_weekend_nights stays_in_week_nights adults
Min. : 0.0000 Min. : 0.0 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
Median : 1.0000 Median : 2.0 Median : 2.000
Mean : 0.9276 Mean : 2.5 Mean : 1.856
3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
Max. :19.0000 Max. :50.0 Max. :55.000
children babies meal country
Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
Median : 0.0000 Median : 0.000000 Mode :character Mode :character
Mean : 0.1039 Mean : 0.007949
3rd Qu.: 0.0000 3rd Qu.: 0.000000
Max. :10.0000 Max. :10.000000
NA's :4
market_segment distribution_channel is_repeated_guest
Length:119390 Length:119390 Min. :0.00000
Class :character Class :character 1st Qu.:0.00000
Mode :character Mode :character Median :0.00000
Mean :0.03191
3rd Qu.:0.00000
Max. :1.00000
previous_cancellations previous_bookings_not_canceled reserved_room_type
Min. : 0.00000 Min. : 0.0000 Length:119390
1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
Median : 0.00000 Median : 0.0000 Mode :character
Mean : 0.08712 Mean : 0.1371
3rd Qu.: 0.00000 3rd Qu.: 0.0000
Max. :26.00000 Max. :72.0000
assigned_room_type booking_changes deposit_type agent
Length:119390 Min. : 0.0000 Length:119390 Length:119390
Class :character 1st Qu.: 0.0000 Class :character Class :character
Mode :character Median : 0.0000 Mode :character Mode :character
Mean : 0.2211
3rd Qu.: 0.0000
Max. :21.0000
company days_in_waiting_list customer_type adr
Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
Mode :character Median : 0.000 Mode :character Median : 94.58
Mean : 2.321 Mean : 101.83
3rd Qu.: 0.000 3rd Qu.: 126.00
Max. :391.000 Max. :5400.00
required_car_parking_spaces total_of_special_requests reservation_status
Min. :0.00000 Min. :0.0000 Length:119390
1st Qu.:0.00000 1st Qu.:0.0000 Class :character
Median :0.00000 Median :0.0000 Mode :character
Mean :0.06252 Mean :0.5714
3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :8.00000 Max. :5.0000
reservation_status_date
Length:119390
Class :character
Mode :character
[1] "hotel" "is_canceled"
[3] "lead_time" "arrival_date_year"
[5] "arrival_date_month" "arrival_date_week_number"
[7] "arrival_date_day_of_month" "stays_in_weekend_nights"
[9] "stays_in_week_nights" "adults"
[11] "children" "babies"
[13] "meal" "country"
[15] "market_segment" "distribution_channel"
[17] "is_repeated_guest" "previous_cancellations"
[19] "previous_bookings_not_canceled" "reserved_room_type"
[21] "assigned_room_type" "booking_changes"
[23] "deposit_type" "agent"
[25] "company" "days_in_waiting_list"
[27] "customer_type" "adr"
[29] "required_car_parking_spaces" "total_of_special_requests"
[31] "reservation_status" "reservation_status_date"
We can see from summary that all character variables don’t give any summary of the data. They have to be converted to factor to get valuable insights from summary. We can drop ‘arrival_date_week_number’ variable because this is redundant information. If we want to do some analysis to find effective marketing strategies,we need to drop rows with NA values(Undefined) for distribution_channel and market_segment variables. After all this pre processing, we will be left with observations less than or equal to 119390 and 31 columns.
# Drop rows with NA values for market_segment, distribution_channel.
# Convert character variables to factor.
# Drop arrival_date_week_number variable.
clean_booking_data <- booking_data %>%
filter(distribution_channel != "Undefined",
market_segment != "Undefined") %>%
mutate_if(is.character, as.factor) %>%
select (-arrival_date_week_number)
summary(clean_booking_data)
hotel is_canceled lead_time arrival_date_year
City Hotel :79326 Min. :0.0000 Min. : 0 Min. :2015
Resort Hotel:40059 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
Median :0.0000 Median : 69 Median :2016
Mean :0.3704 Mean :104 Mean :2016
3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
Max. :1.0000 Max. :737 Max. :2017
arrival_date_month arrival_date_day_of_month stays_in_weekend_nights
August :13873 Min. : 1.0 Min. : 0.0000
July :12660 1st Qu.: 8.0 1st Qu.: 0.0000
May :11791 Median :16.0 Median : 1.0000
October:11160 Mean :15.8 Mean : 0.9276
April :11089 3rd Qu.:23.0 3rd Qu.: 2.0000
June :10939 Max. :31.0 Max. :19.0000
(Other):47873
stays_in_week_nights adults children babies
Min. : 0.0 Min. : 0.000 Min. : 0.0000 Min. : 0.000000
1st Qu.: 1.0 1st Qu.: 2.000 1st Qu.: 0.0000 1st Qu.: 0.000000
Median : 2.0 Median : 2.000 Median : 0.0000 Median : 0.000000
Mean : 2.5 Mean : 1.856 Mean : 0.1039 Mean : 0.007949
3rd Qu.: 3.0 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 0.000000
Max. :50.0 Max. :55.000 Max. :10.0000 Max. :10.000000
meal country market_segment distribution_channel
BB :92306 PRT :48585 Aviation : 237 Corporate: 6677
FB : 798 GBR :12129 Complementary: 743 Direct :14645
HB :14462 FRA :10415 Corporate : 5295 GDS : 193
SC :10650 ESP : 8568 Direct :12604 TA/TO :97870
Undefined: 1169 DEU : 7287 Groups :19811
ITA : 3766 Offline TA/TO:24219
(Other):28635 Online TA :56476
is_repeated_guest previous_cancellations previous_bookings_not_canceled
Min. :0.00000 Min. : 0.00000 Min. : 0.0000
1st Qu.:0.00000 1st Qu.: 0.00000 1st Qu.: 0.0000
Median :0.00000 Median : 0.00000 Median : 0.0000
Mean :0.03191 Mean : 0.08712 Mean : 0.1371
3rd Qu.:0.00000 3rd Qu.: 0.00000 3rd Qu.: 0.0000
Max. :1.00000 Max. :26.00000 Max. :72.0000
reserved_room_type assigned_room_type booking_changes deposit_type
A :85993 A :74052 Min. : 0.0000 No Deposit:104636
D :19201 D :25322 1st Qu.: 0.0000 Non Refund: 14587
E : 6535 E : 7806 Median : 0.0000 Refundable: 162
F : 2897 F : 3751 Mean : 0.2211
G : 2094 G : 2553 3rd Qu.: 0.0000
B : 1114 C : 2375 Max. :21.0000
(Other): 1551 (Other): 3526
agent company days_in_waiting_list customer_type
9 :31960 NULL :112588 Min. : 0.000 Contract : 4076
NULL :16337 40 : 927 1st Qu.: 0.000 Group : 577
240 :13922 223 : 784 Median : 0.000 Transient :89612
1 : 7191 67 : 267 Mean : 2.321 Transient-Party:25120
14 : 3639 45 : 250 3rd Qu.: 0.000
7 : 3539 153 : 215 Max. :391.000
(Other):42797 (Other): 4354
adr required_car_parking_spaces total_of_special_requests
Min. : -6.38 Min. :0.00000 Min. :0.0000
1st Qu.: 69.29 1st Qu.:0.00000 1st Qu.:0.0000
Median : 94.59 Median :0.00000 Median :0.0000
Mean : 101.83 Mean :0.06251 Mean :0.5713
3rd Qu.: 126.00 3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :5400.00 Max. :8.00000 Max. :5.0000
reservation_status reservation_status_date
Canceled :43013 2015-10-21: 1461
Check-Out:75165 2015-07-06: 805
No-Show : 1207 2016-11-25: 790
2015-01-01: 763
2016-01-18: 625
2015-07-02: 469
(Other) :114472
So, we have 119385 observations and 31 columns after cleaning the data.
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
There are three independent variables for arrival_date_year, arrival_date_month, arrival_date_day_of_month which is unnecessary. Hence, they can be combined to one variable arrival_date. Also, variables like is_canceled, is_repeated_guest have to be converted to Boolean instead of integer for readability. We will be left with 29 columns after this step.
# combine day, month and year columns to date column.
# Remove arrival_date_week_number_column.
booking_data_tidy <- clean_booking_data %>%
mutate(arrival_date = as.Date(paste(arrival_date_month,
arrival_date_day_of_month,
arrival_date_year, sep = "-"),
format="%B-%d-%Y"),
is_repeated_guest = as.logical(is_repeated_guest),
is_canceled = as.logical(is_canceled)) %>%
select(-arrival_date_month,
-arrival_date_day_of_month,
-arrival_date_year)
head(booking_data_tidy)
hotel is_canceled lead_time stays_in_weekend_nights
1 Resort Hotel FALSE 342 0
2 Resort Hotel FALSE 737 0
3 Resort Hotel FALSE 7 0
4 Resort Hotel FALSE 13 0
5 Resort Hotel FALSE 14 0
6 Resort Hotel FALSE 14 0
stays_in_week_nights adults children babies meal country market_segment
1 0 2 0 0 BB PRT Direct
2 0 2 0 0 BB PRT Direct
3 1 1 0 0 BB GBR Direct
4 1 1 0 0 BB GBR Corporate
5 2 2 0 0 BB GBR Online TA
6 2 2 0 0 BB GBR Online TA
distribution_channel is_repeated_guest previous_cancellations
1 Direct FALSE 0
2 Direct FALSE 0
3 Direct FALSE 0
4 Corporate FALSE 0
5 TA/TO FALSE 0
6 TA/TO FALSE 0
previous_bookings_not_canceled reserved_room_type assigned_room_type
1 0 C C
2 0 C C
3 0 A C
4 0 A A
5 0 A A
6 0 A A
booking_changes deposit_type agent company days_in_waiting_list customer_type
1 3 No Deposit NULL NULL 0 Transient
2 4 No Deposit NULL NULL 0 Transient
3 0 No Deposit NULL NULL 0 Transient
4 0 No Deposit 304 NULL 0 Transient
5 0 No Deposit 240 NULL 0 Transient
6 0 No Deposit 240 NULL 0 Transient
adr required_car_parking_spaces total_of_special_requests reservation_status
1 0 0 0 Check-Out
2 0 0 0 Check-Out
3 75 0 0 Check-Out
4 75 0 0 Check-Out
5 98 0 1 Check-Out
6 98 0 1 Check-Out
reservation_status_date arrival_date
1 2015-07-01 2015-07-01
2 2015-07-01 2015-07-01
3 2015-07-02 2015-07-01
4 2015-07-02 2015-07-01
5 2015-07-03 2015-07-01
6 2015-07-03 2015-07-01
So, we have 119385 observations and 29 columns after combing all the date related variables to “arrival_date” variable.
# A tibble: 7 × 3
# Groups: hotel [2]
hotel distribution_channel count
<fct> <fct> <int>
1 City Hotel Corporate 3408
2 City Hotel Direct 6780
3 City Hotel GDS 193
4 City Hotel TA/TO 68945
5 Resort Hotel Corporate 3269
6 Resort Hotel Direct 7865
7 Resort Hotel TA/TO 28925
# A tibble: 13 × 3
# Groups: hotel [2]
hotel market_segment count
<fct> <fct> <int>
1 City Hotel Aviation 237
2 City Hotel Complementary 542
3 City Hotel Corporate 2986
4 City Hotel Direct 6092
5 City Hotel Groups 13975
6 City Hotel Offline TA/TO 16747
7 City Hotel Online TA 38747
8 Resort Hotel Complementary 201
9 Resort Hotel Corporate 2309
10 Resort Hotel Direct 6512
11 Resort Hotel Groups 5836
12 Resort Hotel Offline TA/TO 7472
13 Resort Hotel Online TA 17729
---
title: "Challenge 4 - Hotel Bookings"
author: "Srujan Kagitala"
description: "More data wrangling: pivoting"
date: "6/26/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- hotel_bookings
- srujan_kagitala
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- hotel_bookings.csv⭐⭐⭐⭐
```{r}
#| label: read
booking_data <- read.csv("_data/hotel_bookings.csv")
head(booking_data)
```
### Briefly describe the data
```{r}
#| label: describe
#Number of bookings per hotel.
no_bookings_by_hotel <- booking_data %>%
group_by(hotel) %>%
summarise(count = n())
```
This data set contains booking data of `r unique(no_bookings_by_hotel$hotel)`. It contains `r nrow(booking_data)` bookings information. `r no_bookings_by_hotel$count[1]` and `r no_bookings_by_hotel$count[2]` observations are respectively captured from `r no_bookings_by_hotel$hotel[1]` and `r no_bookings_by_hotel$hotel[2]`. The information available per booking are `r names(booking_data)`.
## Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
```{r}
#| label: summary_pre
#Overall summary of data across variables before converting character to factor.
summary(booking_data)
names(booking_data)
```
We can see from summary that all character variables don't give any summary of the data. They have to be converted to factor to get valuable insights from summary. We can drop 'arrival_date_week_number' variable because this is redundant information. If we want to do some analysis to find effective marketing strategies,we need to drop rows with NA values(Undefined) for distribution_channel and market_segment variables. After all this pre processing, we will be left with observations less than or equal to `r nrow(booking_data)` and `r ncol(booking_data)-1` columns.
```{r}
#| label: clean
# Drop rows with NA values for market_segment, distribution_channel.
# Convert character variables to factor.
# Drop arrival_date_week_number variable.
clean_booking_data <- booking_data %>%
filter(distribution_channel != "Undefined",
market_segment != "Undefined") %>%
mutate_if(is.character, as.factor) %>%
select (-arrival_date_week_number)
summary(clean_booking_data)
```
So, we have `r nrow(clean_booking_data)` observations and `r ncol(clean_booking_data)` columns after cleaning the data.
## Identify variables that need to be mutated
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
There are three independent variables for arrival_date_year, arrival_date_month, arrival_date_day_of_month which is unnecessary. Hence, they can be combined to one variable arrival_date. Also, variables like is_canceled, is_repeated_guest have to be converted to Boolean instead of integer for readability. We will be left with `r ncol(clean_booking_data)-2` columns after this step.
```{r}
#| label: tidy
# combine day, month and year columns to date column.
# Remove arrival_date_week_number_column.
booking_data_tidy <- clean_booking_data %>%
mutate(arrival_date = as.Date(paste(arrival_date_month,
arrival_date_day_of_month,
arrival_date_year, sep = "-"),
format="%B-%d-%Y"),
is_repeated_guest = as.logical(is_repeated_guest),
is_canceled = as.logical(is_canceled)) %>%
select(-arrival_date_month,
-arrival_date_day_of_month,
-arrival_date_year)
head(booking_data_tidy)
```
So, we have `r nrow(booking_data_tidy)` observations and `r ncol(booking_data_tidy)` columns after combing all the date related variables to "arrival_date" variable.
```{r}
#| label: analysis
# Hotel wise frequency of distribution_channel
booking_data_tidy %>% group_by(hotel, distribution_channel) %>% summarise(count=n())
# Hotel wise frequency of market_segment
booking_data_tidy %>% group_by(hotel, market_segment) %>% summarise(count=n())
```