Code
library(tidyverse)
library(dplyr)
library(skimr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Ananya Pujary
August 16, 2022
I’ll be working with the ‘hotel_bookings.csv’ dataset.
First, we’ll generate a broad overview of the data.
Name | hotelbookings |
Number of rows | 119390 |
Number of columns | 32 |
_______________________ | |
Column type frequency: | |
character | 13 |
Date | 1 |
numeric | 18 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
hotel | 0 | 1 | 10 | 12 | 0 | 2 | 0 |
arrival_date_month | 0 | 1 | 3 | 9 | 0 | 12 | 0 |
meal | 0 | 1 | 2 | 9 | 0 | 5 | 0 |
country | 0 | 1 | 2 | 4 | 0 | 178 | 0 |
market_segment | 0 | 1 | 6 | 13 | 0 | 8 | 0 |
distribution_channel | 0 | 1 | 3 | 9 | 0 | 5 | 0 |
reserved_room_type | 0 | 1 | 1 | 1 | 0 | 10 | 0 |
assigned_room_type | 0 | 1 | 1 | 1 | 0 | 12 | 0 |
deposit_type | 0 | 1 | 10 | 10 | 0 | 3 | 0 |
agent | 0 | 1 | 1 | 4 | 0 | 334 | 0 |
company | 0 | 1 | 1 | 4 | 0 | 353 | 0 |
customer_type | 0 | 1 | 5 | 15 | 0 | 4 | 0 |
reservation_status | 0 | 1 | 7 | 9 | 0 | 3 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
reservation_status_date | 0 | 1 | 2014-10-17 | 2017-09-14 | 2016-08-07 | 926 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
is_canceled | 0 | 1 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1 | 1 | ▇▁▁▁▅ |
lead_time | 0 | 1 | 104.01 | 106.86 | 0.00 | 18.00 | 69.00 | 160 | 737 | ▇▂▁▁▁ |
arrival_date_year | 0 | 1 | 2016.16 | 0.71 | 2015.00 | 2016.00 | 2016.00 | 2017 | 2017 | ▃▁▇▁▆ |
arrival_date_week_number | 0 | 1 | 27.17 | 13.61 | 1.00 | 16.00 | 28.00 | 38 | 53 | ▅▇▇▇▅ |
arrival_date_day_of_month | 0 | 1 | 15.80 | 8.78 | 1.00 | 8.00 | 16.00 | 23 | 31 | ▇▇▇▇▆ |
stays_in_weekend_nights | 0 | 1 | 0.93 | 1.00 | 0.00 | 0.00 | 1.00 | 2 | 19 | ▇▁▁▁▁ |
stays_in_week_nights | 0 | 1 | 2.50 | 1.91 | 0.00 | 1.00 | 2.00 | 3 | 50 | ▇▁▁▁▁ |
adults | 0 | 1 | 1.86 | 0.58 | 0.00 | 2.00 | 2.00 | 2 | 55 | ▇▁▁▁▁ |
children | 4 | 1 | 0.10 | 0.40 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
babies | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
is_repeated_guest | 0 | 1 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0 | 1 | ▇▁▁▁▁ |
previous_cancellations | 0 | 1 | 0.09 | 0.84 | 0.00 | 0.00 | 0.00 | 0 | 26 | ▇▁▁▁▁ |
previous_bookings_not_canceled | 0 | 1 | 0.14 | 1.50 | 0.00 | 0.00 | 0.00 | 0 | 72 | ▇▁▁▁▁ |
booking_changes | 0 | 1 | 0.22 | 0.65 | 0.00 | 0.00 | 0.00 | 0 | 21 | ▇▁▁▁▁ |
days_in_waiting_list | 0 | 1 | 2.32 | 17.59 | 0.00 | 0.00 | 0.00 | 0 | 391 | ▇▁▁▁▁ |
adr | 0 | 1 | 101.83 | 50.54 | -6.38 | 69.29 | 94.58 | 126 | 5400 | ▇▁▁▁▁ |
required_car_parking_spaces | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0 | 8 | ▇▁▁▁▁ |
total_of_special_requests | 0 | 1 | 0.57 | 0.79 | 0.00 | 0.00 | 0.00 | 1 | 5 | ▇▁▁▁▁ |
This dataset has 32 variables and 119390 cases. There are 14 columns of character type, 1 of date type (‘reservation_status_date’), and 18 of numeric type. Only the ‘children’ column has missing values (4).
# A tibble: 3 × 1
arrival_date_year
<dbl>
1 2015
2 2016
3 2017
# A tibble: 2 × 1
is_canceled
<dbl>
1 0
2 1
From the variable names, this dataset seems to be logging the data of a hotel’s bookings for the years 2015-2017, such as their customers’ arrival date, reservation status, and the number of adults/children/babies checking in. The variable ‘is_canceled’ has binary values and indicates whether a booking has been canceled (0) or not (1).
# A tibble: 3 × 1
reservation_status
<chr>
1 Check-Out
2 Canceled
3 No-Show
# A tibble: 2 × 1
hotel
<chr>
1 Resort Hotel
2 City Hotel
# A tibble: 2 × 2
hotel n
<chr> <int>
1 City Hotel 79330
2 Resort Hotel 40060
The ‘reservation_status’ variable has three values: ‘Check-Out’, ‘Canceled’, and ‘No-Show’. There are two types of hotels from which data are collected: ‘Resort Hotel’ and ‘City Hotel’. 79330 of the rows contain data related to the City Hotel, while 40060 rows are related to the Resort Hotel.
The following command returns central tendency and dispersion values for the numeric values in the dataset.
hotel is_canceled lead_time arrival_date_year
Length:119390 Min. :0.0000 Min. : 0 Min. :2015
Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
Mode :character Median :0.0000 Median : 69 Median :2016
Mean :0.3704 Mean :104 Mean :2016
3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
Max. :1.0000 Max. :737 Max. :2017
arrival_date_month arrival_date_week_number arrival_date_day_of_month
Length:119390 Min. : 1.00 Min. : 1.0
Class :character 1st Qu.:16.00 1st Qu.: 8.0
Mode :character Median :28.00 Median :16.0
Mean :27.17 Mean :15.8
3rd Qu.:38.00 3rd Qu.:23.0
Max. :53.00 Max. :31.0
stays_in_weekend_nights stays_in_week_nights adults
Min. : 0.0000 Min. : 0.0 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
Median : 1.0000 Median : 2.0 Median : 2.000
Mean : 0.9276 Mean : 2.5 Mean : 1.856
3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
Max. :19.0000 Max. :50.0 Max. :55.000
children babies meal country
Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
Median : 0.0000 Median : 0.000000 Mode :character Mode :character
Mean : 0.1039 Mean : 0.007949
3rd Qu.: 0.0000 3rd Qu.: 0.000000
Max. :10.0000 Max. :10.000000
NA's :4
market_segment distribution_channel is_repeated_guest
Length:119390 Length:119390 Min. :0.00000
Class :character Class :character 1st Qu.:0.00000
Mode :character Mode :character Median :0.00000
Mean :0.03191
3rd Qu.:0.00000
Max. :1.00000
previous_cancellations previous_bookings_not_canceled reserved_room_type
Min. : 0.00000 Min. : 0.0000 Length:119390
1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
Median : 0.00000 Median : 0.0000 Mode :character
Mean : 0.08712 Mean : 0.1371
3rd Qu.: 0.00000 3rd Qu.: 0.0000
Max. :26.00000 Max. :72.0000
assigned_room_type booking_changes deposit_type agent
Length:119390 Min. : 0.0000 Length:119390 Length:119390
Class :character 1st Qu.: 0.0000 Class :character Class :character
Mode :character Median : 0.0000 Mode :character Mode :character
Mean : 0.2211
3rd Qu.: 0.0000
Max. :21.0000
company days_in_waiting_list customer_type adr
Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
Mode :character Median : 0.000 Mode :character Median : 94.58
Mean : 2.321 Mean : 101.83
3rd Qu.: 0.000 3rd Qu.: 126.00
Max. :391.000 Max. :5400.00
required_car_parking_spaces total_of_special_requests reservation_status
Min. :0.00000 Min. :0.0000 Length:119390
1st Qu.:0.00000 1st Qu.:0.0000 Class :character
Median :0.00000 Median :0.0000 Mode :character
Mean :0.06252 Mean :0.5714
3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :8.00000 Max. :5.0000
reservation_status_date
Min. :2014-10-17
1st Qu.:2016-02-01
Median :2016-08-07
Mean :2016-07-30
3rd Qu.:2017-02-08
Max. :2017-09-14
# calculating the number of cancellations and number of repeat guests grouped by hotel
hotel_stats<-hotelbookings %>%
select(hotel,is_canceled,is_repeated_guest) %>%
group_by(hotel)%>%
summarize(cancellations = sum(is_canceled),guests_repeated = sum(is_repeated_guest))
hotelbookings$room_assignment = (ifelse(hotelbookings$reserved_room_type==hotelbookings$assigned_room_type, 'Same', 'Different'))
rooms <- hotelbookings %>%
select(hotel,room_assignment) %>%
group_by(hotel,room_assignment) %>%
tally()
City Hotel saw 33102 cancellations and Resort Hotel had 11122 cancellations during this time period. City Hotel also had more repeat guests (2032) than the Resort Hotel (1778). Also, I created a new column in the dataset called ‘room_assignment’ that indicates whether guests got their preferred room assignment (‘Same’) or not (‘Different’). City Hotel guests were less likely to be assigned a room not of their preference (7192) than Resort Hotel guests (7725).
Then, I compared reservation statuses (‘Canceled, ’Check-Out’,‘No Show’) across hotels. For both, the number of guests checked out were more than those that cancelled or didn’t show up.
# A tibble: 2 × 4
# Groups: hotel [2]
hotel arrival_date_year arrival_date_month n
<chr> <dbl> <chr> <int>
1 City Hotel 2017 May 4556
2 Resort Hotel 2016 October 1984
I also wanted to look at which month-year was the busiest in terms of bookings for both hotels. The busiest month was May 2017 for City Hotel (4556), and October 2016 for Resort Hotel (1984).
I chose certain variables that would objectively indicate which hotel (City Hotel or Resort Hotel) was more successful, such as the number of cancellations and reservation status. However, City Hotel had more entries in the dataset than Resort Hotel, which could skew the results. I also wanted to compare which hotel assigned most customers their preferred room type (reserved_room_type versus assigned_room_type) which could be contributing to customer satisfaction, and found that City Hotel was better in this aspect. It would also be interesting to look at both hotels’ statistics through the years (2015-2017) to check whether their businesses have improved or declined.
---
title: "Challenge 2"
author: "Ananya Pujary"
description: "Data wrangling: using group() and summarise()"
date: "08/16/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(dplyr)
library(skimr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Data
I'll be working with the 'hotel_bookings.csv' dataset.
```{r}
#| label: reading in the data
hotelbookings <- read_csv("_data/hotel_bookings.csv")
```
## Describe the data
First, we'll generate a broad overview of the data.
```{r}
#| label: summary 1
skim(hotelbookings)
```
This dataset has 32 variables and 119390 cases. There are 14 columns of character type, 1 of date type ('reservation_status_date'), and 18 of numeric type. Only the 'children' column has missing values (4).
```{r}
#| label: summary 2
distinct(hotelbookings,arrival_date_year)
distinct(hotelbookings,is_canceled)
```
From the variable names, this dataset seems to be logging the data of a hotel's bookings for the years 2015-2017, such as their customers' arrival date, reservation status, and the number of adults/children/babies checking in. The variable 'is_canceled' has binary values and indicates whether a booking has been canceled (0) or not (1).
```{r}
#| label: summary 3
distinct(hotelbookings,reservation_status)
distinct(hotelbookings,hotel)
hotelbookings %>%
group_by(hotel) %>%
tally()
```
The 'reservation_status' variable has three values: 'Check-Out', 'Canceled', and 'No-Show'. There are two types of hotels from which data are collected: 'Resort Hotel' and 'City Hotel'. 79330 of the rows contain data related to the City Hotel, while 40060 rows are related to the Resort Hotel.
## Provide Grouped Summary Statistics
The following command returns central tendency and dispersion values for the numeric values in the dataset.
```{r}
#| label: grouped summary 1
#|
summary(hotelbookings)
```
```{r}
#| label: grouped summary 2
# calculating the number of cancellations and number of repeat guests grouped by hotel
hotel_stats<-hotelbookings %>%
select(hotel,is_canceled,is_repeated_guest) %>%
group_by(hotel)%>%
summarize(cancellations = sum(is_canceled),guests_repeated = sum(is_repeated_guest))
hotelbookings$room_assignment = (ifelse(hotelbookings$reserved_room_type==hotelbookings$assigned_room_type, 'Same', 'Different'))
rooms <- hotelbookings %>%
select(hotel,room_assignment) %>%
group_by(hotel,room_assignment) %>%
tally()
```
City Hotel saw 33102 cancellations and Resort Hotel had 11122 cancellations during this time period. City Hotel also had more repeat guests (2032) than the Resort Hotel (1778). Also, I created a new column in the dataset called 'room_assignment' that indicates whether guests got their preferred room assignment ('Same') or not ('Different'). City Hotel guests were less likely to be assigned a room not of their preference (7192) than Resort Hotel guests (7725).
```{r}
#| label: grouped summary 3
hotel_reservations <- hotelbookings %>%
select(hotel,reservation_status)%>%
group_by(hotel, reservation_status)%>%
tally()
```
Then, I compared reservation statuses ('Canceled, 'Check-Out','No Show') across hotels. For both, the number of guests checked out were more than those that cancelled or didn't show up.
```{r}
#| label: grouped summary 4
hotelbookings %>%
select(hotel,arrival_date_year,arrival_date_month) %>%
group_by(hotel) %>%
count(arrival_date_year,arrival_date_month) %>%
slice(which.max(n))
```
I also wanted to look at which month-year was the busiest in terms of bookings for both hotels. The busiest month was May 2017 for City Hotel (4556), and October 2016 for Resort Hotel (1984).
### Explain and Interpret
I chose certain variables that would objectively indicate which hotel (City Hotel or Resort Hotel) was more successful, such as the number of cancellations and reservation status. However, City Hotel had more entries in the dataset than Resort Hotel, which could skew the results. I also wanted to compare which hotel assigned most customers their preferred room type (reserved_room_type versus assigned_room_type) which could be contributing to customer satisfaction, and found that City Hotel was better in this aspect. It would also be interesting to look at both hotels' statistics through the years (2015-2017) to check whether their businesses have improved or declined.