Code
library(tidyverse)
library(readr)
library(dplyr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Siddharth Goel
January 21, 2023
I am reading the dataset hotel_bookings
which contains booking information of 2 hotels Resort Hotel
and City Hotel
. This data contains information ranging from the booking agency of used by the guest to their arrival and departure time. In this challenge, I will attempt to provide more statistical information relevant to our inpu dataset.
Today’s challenge is to
cols(
hotel = col_character(),
is_canceled = col_double(),
lead_time = col_double(),
arrival_date_year = col_double(),
arrival_date_month = col_character(),
arrival_date_week_number = col_double(),
arrival_date_day_of_month = col_double(),
stays_in_weekend_nights = col_double(),
stays_in_week_nights = col_double(),
adults = col_double(),
children = col_double(),
babies = col_double(),
meal = col_character(),
country = col_character(),
market_segment = col_character(),
distribution_channel = col_character(),
is_repeated_guest = col_double(),
previous_cancellations = col_double(),
previous_bookings_not_canceled = col_double(),
reserved_room_type = col_character(),
assigned_room_type = col_character(),
booking_changes = col_double(),
deposit_type = col_character(),
agent = col_character(),
company = col_character(),
days_in_waiting_list = col_double(),
customer_type = col_character(),
adr = col_double(),
required_car_parking_spaces = col_double(),
total_of_special_requests = col_double(),
reservation_status = col_character(),
reservation_status_date = col_date(format = "")
)
# A tibble: 6 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resort… 0 342 2015 July 27 1 0 0 2
2 Resort… 0 737 2015 July 27 1 0 0 2
3 Resort… 0 7 2015 July 27 1 0 1 1
4 Resort… 0 13 2015 July 27 1 0 1 1
5 Resort… 0 14 2015 July 27 1 0 2 2
6 Resort… 0 14 2015 July 27 1 0 2 2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
# assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
# agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
[1] "Resort Hotel" "City Hotel"
[1] 2015 2016 2017
[1] "Direct" "Corporate" "TA/TO" "Undefined" "GDS"
[1] "Transient" "Contract" "Transient-Party" "Group"
The input data contains 119390 rows and 32 columns. It contains the booking data for 2 hotels over 3 years (2015 - 2017)
I will be providing Grouped Summary Statistics as well as their interpretation and explanation in this section.
# A tibble: 6 × 3
hotel arrival_date_year n
<chr> <dbl> <int>
1 City Hotel 2015 13682
2 City Hotel 2016 38140
3 City Hotel 2017 27508
4 Resort Hotel 2015 8314
5 Resort Hotel 2016 18567
6 Resort Hotel 2017 13179
Starting with a basic metric here to see which hotel gets more bookings and the trend of bookings with year.
# A tibble: 14 × 3
hotel market_segment n
<chr> <chr> <int>
1 City Hotel Online TA 38748
2 City Hotel Offline TA/TO 16747
3 City Hotel Groups 13975
4 City Hotel Direct 6093
5 City Hotel Corporate 2986
6 City Hotel Complementary 542
7 City Hotel Aviation 237
8 City Hotel Undefined 2
9 Resort Hotel Online TA 17729
10 Resort Hotel Offline TA/TO 7472
11 Resort Hotel Direct 6513
12 Resort Hotel Groups 5836
13 Resort Hotel Corporate 2309
14 Resort Hotel Complementary 201
This information can give us the count of complementary bookings per hotel, which is a very important number considering that complementary bookings are loss making for the hotels. Moreover, we also find out which market segment is contributing the most towards bookings at the hotels (Online bookings seem to account for most bookings in both the hotels)
# A tibble: 6 × 4
# Groups: hotel, arrival_date_year [6]
hotel arrival_date_year is_canceled percent
<chr> <dbl> <dbl> <dbl>
1 City Hotel 2015 1 0.439
2 City Hotel 2016 1 0.404
3 City Hotel 2017 1 0.425
4 Resort Hotel 2015 1 0.257
5 Resort Hotel 2016 1 0.266
6 Resort Hotel 2017 1 0.308
This metric gives us the percentage of bookings canceled every year per hotel. From the metric, we can see that the percentage of canceled bookings has been high in City Hotel
as compared to Resort Hotel
# A tibble: 30 × 4
# Groups: hotel, arrival_date_year [6]
hotel arrival_date_year arrival_date_month n
<chr> <dbl> <chr> <int>
1 City Hotel 2015 September 3529
2 City Hotel 2015 October 3386
3 City Hotel 2015 November 1235
4 City Hotel 2015 July 1398
5 City Hotel 2015 December 1654
6 City Hotel 2016 September 3871
7 City Hotel 2016 October 4219
8 City Hotel 2016 November 3122
9 City Hotel 2016 May 3676
10 City Hotel 2016 March 3046
# … with 20 more rows
This metric gives us the months of the year where the bookings are maximum. This helps the hotels prepare for the busy months of the year in terms of pricing, capacity, staffing, etc.
---
title: "Challenge 2"
author: "Siddharth Goel"
description: "Data wrangling: using group() and summarise()"
date: "01/21/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- hotel_bookings.csv
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readr)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Dataset Description
I am reading the dataset `hotel_bookings` which contains booking information of 2 hotels `Resort Hotel` and `City Hotel`. This data contains information ranging from the booking agency of used by the guest to their arrival and departure time. In this challenge, I will attempt to provide more statistical information relevant to our inpu dataset.
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
```{r}
bookings<-read_csv("_data/hotel_bookings.csv")
```
## Describe the data
```{r}
#| label: summary
spec(bookings)
head(bookings)
unique(bookings$hotel)
unique(bookings$arrival_date_year)
unique(bookings$distribution_channel)
unique(bookings$customer_type)
```
The input data contains 119390 rows and 32 columns. It contains the booking data for 2 hotels over 3 years (2015 - 2017)
## Provide Grouped Summary Statistics
I will be providing Grouped Summary Statistics as well as their interpretation and explanation in this section.
### Number of bookings per hotel per year
```{r}
count(bookings, hotel, arrival_date_year)
```
Starting with a basic metric here to see which hotel gets more bookings and the trend of bookings with year.
### Number of bookings per market segment
```{r}
count(bookings, hotel, market_segment) %>%
arrange(hotel, desc(n))
```
This information can give us the count of complementary bookings per hotel, which is a very important number considering that complementary bookings are loss making for the hotels.
Moreover, we also find out which market segment is contributing the most towards bookings at the hotels (Online bookings seem to account for most bookings in both the hotels)
### Percentage of canceled bookings per year
```{r}
canceled_df <- count(bookings, hotel, arrival_date_year, is_canceled)
canceled_df %>%
group_by(hotel, arrival_date_year) %>%
mutate(percent = n / sum(n)) %>%
subset(select = -c(n)) %>%
filter(is_canceled==1)
```
This metric gives us the percentage of bookings canceled every year per hotel. From the metric, we can see that the percentage of canceled bookings has been high in `City Hotel` as compared to `Resort Hotel`
### Percentage of canceled bookings per year
```{r}
country_df <- count(bookings, hotel, arrival_date_year, arrival_date_month)
country_df %>% group_by(hotel, arrival_date_year) %>% slice_max(order_by = arrival_date_month, n = 5)
```
This metric gives us the months of the year where the bookings are maximum. This helps the hotels prepare for the busy months of the year in terms of pricing, capacity, staffing, etc.