Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Srujan Kagitala
June 26, 2023
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
hotel is_canceled lead_time arrival_date_year arrival_date_month
1 Resort Hotel 0 342 2015 July
2 Resort Hotel 0 737 2015 July
3 Resort Hotel 0 7 2015 July
4 Resort Hotel 0 13 2015 July
5 Resort Hotel 0 14 2015 July
6 Resort Hotel 0 14 2015 July
arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
1 27 1 0
2 27 1 0
3 27 1 0
4 27 1 0
5 27 1 0
6 27 1 0
stays_in_week_nights adults children babies meal country market_segment
1 0 2 0 0 BB PRT Direct
2 0 2 0 0 BB PRT Direct
3 1 1 0 0 BB GBR Direct
4 1 1 0 0 BB GBR Corporate
5 2 2 0 0 BB GBR Online TA
6 2 2 0 0 BB GBR Online TA
distribution_channel is_repeated_guest previous_cancellations
1 Direct 0 0
2 Direct 0 0
3 Direct 0 0
4 Corporate 0 0
5 TA/TO 0 0
6 TA/TO 0 0
previous_bookings_not_canceled reserved_room_type assigned_room_type
1 0 C C
2 0 C C
3 0 A C
4 0 A A
5 0 A A
6 0 A A
booking_changes deposit_type agent company days_in_waiting_list customer_type
1 3 No Deposit NULL NULL 0 Transient
2 4 No Deposit NULL NULL 0 Transient
3 0 No Deposit NULL NULL 0 Transient
4 0 No Deposit 304 NULL 0 Transient
5 0 No Deposit 240 NULL 0 Transient
6 0 No Deposit 240 NULL 0 Transient
adr required_car_parking_spaces total_of_special_requests reservation_status
1 0 0 0 Check-Out
2 0 0 0 Check-Out
3 75 0 0 Check-Out
4 75 0 0 Check-Out
5 98 0 1 Check-Out
6 98 0 1 Check-Out
reservation_status_date
1 2015-07-01
2 2015-07-01
3 2015-07-02
4 2015-07-02
5 2015-07-03
6 2015-07-03
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
This data set contains booking data of City Hotel, Resort Hotel. It contains 119390 bookings information. 79330 and 40060 observations are respectively captured from City Hotel and Resort Hotel. The information available per booking are hotel, is_canceled, lead_time, arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month, stays_in_weekend_nights, stays_in_week_nights, adults, children, babies, meal, country, market_segment, distribution_channel, is_repeated_guest, previous_cancellations, previous_bookings_not_canceled, reserved_room_type, assigned_room_type, booking_changes, deposit_type, agent, company, days_in_waiting_list, customer_type, adr, required_car_parking_spaces, total_of_special_requests, reservation_status, reservation_status_date.
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
# A tibble: 12 × 4
# Groups: hotel, arrival_date_year [6]
hotel arrival_date_year is_canceled no_bookings
<chr> <int> <int> <int>
1 City Hotel 2015 0 7678
2 City Hotel 2015 1 6004
3 City Hotel 2016 0 22733
4 City Hotel 2016 1 15407
5 City Hotel 2017 0 15817
6 City Hotel 2017 1 11691
7 Resort Hotel 2015 0 6176
8 Resort Hotel 2015 1 2138
9 Resort Hotel 2016 0 13637
10 Resort Hotel 2016 1 4930
11 Resort Hotel 2017 0 9125
12 Resort Hotel 2017 1 4054
total_hotel_bookings_by_year <- hotel_booking_by_year %>%
group_by(hotel, arrival_date_year) %>%
summarise(no_bookings = sum(no_bookings))
ggplot(data = total_hotel_bookings_by_year, aes(x= arrival_date_year, y=no_bookings)) +
geom_bar(stat = "identity", mapping = aes(fill = hotel), position = "dodge") +
ggtitle("Total bookings across hotels") +
ylab("bookings") +
xlab("year")
cancelled_bookings_by_year <- hotel_booking_by_year %>%
filter(is_canceled == 1) %>%
select(hotel, arrival_date_year, no_bookings)
ggplot(data = cancelled_bookings_by_year, aes(x= arrival_date_year, y=no_bookings)) +
geom_bar(stat = "identity", mapping = aes(fill = hotel), position = "dodge") +
ggtitle("Cancelled bookings across hotels") +
ylab("bookings") +
xlab("year")
not_cancelled_bookings_by_year <- hotel_booking_by_year %>%
filter(is_canceled == 0) %>%
select(hotel, arrival_date_year, no_bookings)
ggplot(data = not_cancelled_bookings_by_year, aes(x= arrival_date_year, y=no_bookings)) +
geom_bar(stat = "identity", mapping = aes(fill = hotel), position = "dodge") +
ggtitle("Active bookings across hotels") +
ylab("bookings") +
xlab("year")
# A tibble: 2 × 2
hotel no_bookings
<chr> <int>
1 City Hotel 46228
2 Resort Hotel 28938
# Mean, median and standard deviation of night stays
# during the week and weekends per hotel
(stay_pattern <- booking_data %>%
group_by(hotel) %>%
summarise(mean_stay_week_nights = mean(stays_in_week_nights),
median_stay_week_nights = median(stays_in_week_nights),
sd_stay_week_nights = sd(stays_in_week_nights),
mean_stay_weekend_nights = mean(stays_in_weekend_nights),
median_stay_weekend_nights = median(stays_in_weekend_nights),
sd_stay_weekend_nights = sd(stays_in_weekend_nights)))
# A tibble: 2 × 7
hotel mean_stay_week_nights median_stay_week_nights sd_stay_week_nights
<chr> <dbl> <dbl> <dbl>
1 City Hotel 2.18 2 1.46
2 Resort Hotel 3.13 3 2.46
# ℹ 3 more variables: mean_stay_weekend_nights <dbl>,
# median_stay_weekend_nights <dbl>, sd_stay_weekend_nights <dbl>
# A tibble: 2 × 2
hotel count
<chr> <int>
1 City Hotel 1591
2 Resort Hotel 1667
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
Compared the number of booking across hotels on a yearly basis and tried to observe the retention rate among customers.We can see that in the period of 3 years, City Hotel has 46228 bookings that aren’t cancelled whereas Resort Hotel has 28938. Out of these, 3.4416371% and 5.7605916% are repeated bookings for City Hotel and Resort Hotel respectively. Although the number of bookings for City Hotel were higher, Resort Hotel has a better customer retention rate compared to City Hotel.
---
title: "Challenge 2 - Hotel Bookings"
author: "Srujan Kagitala"
description: "Data wrangling: using group() and summarise()"
date: "6/26/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- hotel_bookings
- srujan_kagitala
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- hotel_bookings.csv ⭐⭐⭐⭐
```{r}
#| label: read
booking_data <- read.csv("_data/hotel_bookings.csv")
head(booking_data)
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
```{r}
#| label: summary
#Number of bookings per hotel.
no_bookings_by_hotel <- booking_data %>%
group_by(hotel) %>%
summarise(count = n())
```
This data set contains booking data of `r unique(no_bookings_by_hotel$hotel)`. It contains `r nrow(booking_data)` bookings information. `r no_bookings_by_hotel$count[1]` and `r no_bookings_by_hotel$count[2]` observations are respectively captured from `r no_bookings_by_hotel$hotel[1]` and `r no_bookings_by_hotel$hotel[2]`. The information available per booking are `r names(booking_data)`.
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
#| label: analysis
# Number of bookings cancelled versus not cancelled
# per hotel across years
(hotel_booking_by_year <- booking_data %>%
group_by(hotel, arrival_date_year, is_canceled) %>%
summarise(no_bookings = n()))
total_hotel_bookings_by_year <- hotel_booking_by_year %>%
group_by(hotel, arrival_date_year) %>%
summarise(no_bookings = sum(no_bookings))
ggplot(data = total_hotel_bookings_by_year, aes(x= arrival_date_year, y=no_bookings)) +
geom_bar(stat = "identity", mapping = aes(fill = hotel), position = "dodge") +
ggtitle("Total bookings across hotels") +
ylab("bookings") +
xlab("year")
cancelled_bookings_by_year <- hotel_booking_by_year %>%
filter(is_canceled == 1) %>%
select(hotel, arrival_date_year, no_bookings)
ggplot(data = cancelled_bookings_by_year, aes(x= arrival_date_year, y=no_bookings)) +
geom_bar(stat = "identity", mapping = aes(fill = hotel), position = "dodge") +
ggtitle("Cancelled bookings across hotels") +
ylab("bookings") +
xlab("year")
not_cancelled_bookings_by_year <- hotel_booking_by_year %>%
filter(is_canceled == 0) %>%
select(hotel, arrival_date_year, no_bookings)
ggplot(data = not_cancelled_bookings_by_year, aes(x= arrival_date_year, y=no_bookings)) +
geom_bar(stat = "identity", mapping = aes(fill = hotel), position = "dodge") +
ggtitle("Active bookings across hotels") +
ylab("bookings") +
xlab("year")
# Total number of bookings that are not cancelled per hotel
(actual_hotel_booking <- hotel_booking_by_year %>%
filter(is_canceled == 0) %>%
group_by(hotel) %>%
summarise(no_bookings = sum(no_bookings)))
# Mean, median and standard deviation of night stays
# during the week and weekends per hotel
(stay_pattern <- booking_data %>%
group_by(hotel) %>%
summarise(mean_stay_week_nights = mean(stays_in_week_nights),
median_stay_week_nights = median(stays_in_week_nights),
sd_stay_week_nights = sd(stays_in_week_nights),
mean_stay_weekend_nights = mean(stays_in_weekend_nights),
median_stay_weekend_nights = median(stays_in_weekend_nights),
sd_stay_weekend_nights = sd(stays_in_weekend_nights)))
# Number of repeated guests who didn't cancel per hotel.
(repeated_guests <- booking_data %>%
filter(is_repeated_guest == 1, is_canceled == 0) %>%
group_by(hotel) %>%
summarise(count = n()))
```
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
Compared the number of booking across hotels on a yearly basis and tried to observe the retention rate among customers.We can see that in the period of 3 years, `r actual_hotel_booking$hotel[1]` has `r actual_hotel_booking$no_bookings[1]` bookings that aren't cancelled whereas `r actual_hotel_booking$hotel[2]` has `r actual_hotel_booking$no_bookings[2]`. Out of these, `r (repeated_guests$count[1]/actual_hotel_booking$no_bookings[1])*100`% and `r (repeated_guests$count[2]/actual_hotel_booking$no_bookings[2])*100`% are repeated bookings for `r repeated_guests$hotel[1]` and `r repeated_guests$hotel[2]` respectively. Although the number of bookings for City Hotel were higher, Resort Hotel has a better customer retention rate compared to City Hotel.