Challenge 2

challenge_2

hotel_bookings

Author

Steve O’Neill

Published

August 16, 2022

Code

library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Introduction

I’m going to try to do hotel_bookings: ⭐⭐⭐⭐

Code

hotel_bookings <- read_csv("_data/hotel_bookings.csv")
hotel_bookings

Reading in the data is straightforward.

Describe the data

This dataset is available on Kaggle and was originally published in the journal Hospitality Management.

The data describes two hotels - one ‘city’ and one ‘resort’-style.

Before I begin, I should establish some terminology based on research online:

Term	Meaning
TA	Travel Agent
TO	Tour Operator (distribution channel)
HB	Half Board (breakfast + other meal)
FB	Full Board (3 meals a day)

Provide Grouped Summary Statistics

There are just two de-identified hotels analyzed in this whole dataset:

Code

hotel_bookings %>% group_by(hotel) %>% summarise()

It is straightforward to give basic descriptive statistics on the Average Daily Rate of both hotels:

Code

hotel_bookings %>% select(adr) %>% summary(adr)

      adr         
 Min.   :  -6.38  
 1st Qu.:  69.29  
 Median :  94.58  
 Mean   : 101.83  
 3rd Qu.: 126.00  
 Max.   :5400.00

I can also demonstrate some more interesting stats based on included values like visitor nationality and lead time:

Code

by_hotel <- hotel_bookings %>% group_by(hotel)

by_hotel <- by_hotel %>% summarise(
  average.lead.time = mean(lead_time),
  busiest.year = names(which.max(table(arrival_date_year))),
  busiest.month = names(which.max(table(arrival_date_month))),
  most.freq.nationality = names(which.max(table(country))),
  most.infreq.nationality = names(which.min(table(country)))
)

by_hotel

As we can see,

The city hotel has a longer lead time on average compared to the resort - could this be based on conferences and business travel?
Both hotels’ busiest year in the dataset was 2016, as well as their busiest month being August [annually].
The most frequent nationality of guests was that of PRT - Portugal. That makes sense because the hotels are in Portugal.
The least frequent visitors in the City and Resort hotels hailed from Anguilla and Burundi, respectively.

To demonstrate select(), I pull out a few values:

Code

visits <- select(hotel_bookings, hotel, market_segment, children, babies, country, reservation_status, reservation_status_date, arrival_date_month, adr)
visits

Using filter(), here are visits that had a higher-than-median amount of babies brought along as compared to other visitors in the same market segment (Direct, Online, Corporate):

Code

visits_with_baby <- visits %>% group_by(market_segment) %>% filter(babies > median(babies, na.rm = TRUE))
visits_with_baby

So let’s compare the general make-up of travelers:

Code

table(visits$market_segment)


     Aviation Complementary     Corporate        Direct        Groups 
          237           743          5295         12606         19811 
Offline TA/TO     Online TA     Undefined 
        24219         56477             2

…with those traveling with a baby:

Code

table(visits_with_baby$market_segment)


Complementary     Corporate        Direct        Groups Offline TA/TO 
           22             8           281            13           177 
    Online TA 
          416

Somewhat expectedly, there aren’t too many business travelers bringing their kids!

--- title: "Challenge 2" author: "Steve O'Neill" desription: "Data wrangling: using group() and summarise()" date: "08/16/2022" df-print: paged format: html: toc: true code-fold: true code-copy: true code-tools: true categories: - challenge_2 - hotel_bookings --- ```{r} #| label: setup #| warning: false #| message: false library(tidyverse) knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) ``` ## Introduction I'm going to try to do hotel_bookings: ⭐⭐⭐⭐ ```{r} hotel_bookings <- read_csv("_data/hotel_bookings.csv") hotel_bookings ``` Reading in the data is straightforward. ## Describe the data This dataset is available on [Kaggle](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand) and was originally published in the journal [Hospitality Management](https://www.sciencedirect.com/science/article/pii/S2352340918315191). The data describes two hotels - one 'city' and one 'resort'-style. Before I begin, I should establish some terminology based on research online: | Term | Meaning | |------|--------------------------------------| | TA | Travel Agent | | TO | Tour Operator (distribution channel) | | HB | Half Board (breakfast + other meal) | | FB | Full Board (3 meals a day) | ## Provide Grouped Summary Statistics There are just two de-identified hotels analyzed in this whole dataset: ```{r} hotel_bookings %>% group_by(hotel) %>% summarise() ``` It is straightforward to give basic descriptive statistics on the Average Daily Rate of both hotels: ```{r} hotel_bookings %>% select(adr) %>% summary(adr) ``` I can also demonstrate some more interesting stats based on included values like visitor nationality and lead time: ```{r} by_hotel <- hotel_bookings %>% group_by(hotel) by_hotel <- by_hotel %>% summarise( average.lead.time = mean(lead_time), busiest.year = names(which.max(table(arrival_date_year))), busiest.month = names(which.max(table(arrival_date_month))), most.freq.nationality = names(which.max(table(country))), most.infreq.nationality = names(which.min(table(country))) ) by_hotel ``` As we can see, - The city hotel has a longer lead time on average compared to the resort - could this be based on conferences and business travel? - Both hotels' busiest year in the dataset was 2016, as well as their busiest month being August \[annually\]. - The most frequent nationality of guests was that of PRT - Portugal. That makes sense because the hotels are in [Portugal](https://www.sciencedirect.com/science/article/pii/S2352340918315191). - The least frequent visitors in the City and Resort hotels hailed from Anguilla and Burundi, respectively. To demonstrate select(), I pull out a few values: ```{r} visits <- select(hotel_bookings, hotel, market_segment, children, babies, country, reservation_status, reservation_status_date, arrival_date_month, adr) visits ``` Using filter(), here are visits that had a higher-than-median amount of babies brought along as compared to other visitors in the same market segment (Direct, Online, Corporate): ```{r} visits_with_baby <- visits %>% group_by(market_segment) %>% filter(babies > median(babies, na.rm = TRUE)) visits_with_baby ``` So let's compare the general make-up of travelers: ```{r} table(visits$market_segment) ``` ...with those traveling with a baby: ```{r} table(visits_with_baby$market_segment) ``` Somewhat expectedly, there aren't too many business travelers bringing their kids!