Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Ryan O’Donnell
September 20, 2022
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
spc_tbl_ [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
$ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
$ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
$ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
$ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
$ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
$ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
$ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
$ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
$ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
$ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
$ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
$ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
$ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
$ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
$ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
$ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
$ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
$ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
$ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
$ adr : num [1:119390] 0 0 75 75 98 ...
$ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
$ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
$ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
$ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
- attr(*, "spec")=
.. cols(
.. hotel = col_character(),
.. is_canceled = col_double(),
.. lead_time = col_double(),
.. arrival_date_year = col_double(),
.. arrival_date_month = col_character(),
.. arrival_date_week_number = col_double(),
.. arrival_date_day_of_month = col_double(),
.. stays_in_weekend_nights = col_double(),
.. stays_in_week_nights = col_double(),
.. adults = col_double(),
.. children = col_double(),
.. babies = col_double(),
.. meal = col_character(),
.. country = col_character(),
.. market_segment = col_character(),
.. distribution_channel = col_character(),
.. is_repeated_guest = col_double(),
.. previous_cancellations = col_double(),
.. previous_bookings_not_canceled = col_double(),
.. reserved_room_type = col_character(),
.. assigned_room_type = col_character(),
.. booking_changes = col_double(),
.. deposit_type = col_character(),
.. agent = col_character(),
.. company = col_character(),
.. days_in_waiting_list = col_double(),
.. customer_type = col_character(),
.. adr = col_double(),
.. required_car_parking_spaces = col_double(),
.. total_of_special_requests = col_double(),
.. reservation_status = col_character(),
.. reservation_status_date = col_date(format = "")
.. )
- attr(*, "problems")=<externalptr>
“hotel_bookings.csv” contains 119,390 observations of 32 variables. The data appears to be tracking the details of reservations at some hotels. There is a lot of information recorded for each reservation. The arrival date is stored across 3 columns, 4 if you count the week number. Some of the columns have codes in them but there is not a key to fully understand these codes.
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
# filter out cancelation and create object that is grouped by hotel
by_hotel_no_cancel <- bookings %>% filter(`is_canceled` == 0) %>% group_by(`hotel`)
# summarize adults, children, and babies for each hotel
summarize(by_hotel_no_cancel, "Avg Adults"=mean(adults), "Med Adults"=median(adults), "Avg Children"=mean(children), "Med Children"=median(children), "Avg Babies"=mean(babies), "Med Babies"=median(babies))
# A tibble: 2 × 7
hotel `Avg Adults` `Med Adults` `Avg Children` Med Ch…¹ Avg B…² Med B…³
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 1.83 2 0.0999 0 0.00710 0
2 Resort Hotel 1.83 2 0.106 0 0.0156 0
# … with abbreviated variable names ¹`Med Children`, ²`Avg Babies`,
# ³`Med Babies`
# A tibble: 10 × 4
# Groups: hotel [2]
hotel Adults Children Babies
<chr> <dbl> <dbl> <dbl>
1 City Hotel 0 0 0
2 City Hotel 2 0 0
3 City Hotel 2 0 0
4 City Hotel 2 0 0
5 City Hotel 4 3 10
6 Resort Hotel 0 0 0
7 Resort Hotel 2 0 0
8 Resort Hotel 2 0 0
9 Resort Hotel 2 0 0
10 Resort Hotel 4 3 2
# create columns that includes the total stay length, % of stay that is weekend, and % of stay that is weeknight, remove stays that are 0 days
stays_expanded_by_hotel <- by_hotel_no_cancel %>% select(hotel, stays_in_weekend_nights, stays_in_week_nights) %>%
mutate("Total Stay" = `stays_in_weekend_nights` + `stays_in_week_nights`,
"% Weekend" = `stays_in_weekend_nights` / (`stays_in_weekend_nights` + `stays_in_week_nights`),
"% Weeknight" = `stays_in_week_nights` / (`stays_in_weekend_nights` + `stays_in_week_nights`)
) %>%
filter(`Total Stay` > 0)
# look at the central tendencies of stays
summarize(stays_expanded_by_hotel, "Avg Stay"=mean(`Total Stay`), "Med Stay"=median(`Total Stay`), "Avg % Wknd"=mean(`% Weekend`), "Med % Wknd"=median(`% Weekend`), "Avg % Wknt"=mean(`% Weeknight`), "Med % Wknt"=median(`% Weeknight`))
# A tibble: 2 × 7
hotel `Avg Stay` `Med Stay` `Avg % Wknd` `Med % Wknd` Avg % W…¹ Med %…²
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 2.94 3 0.262 0.25 0.738 0.75
2 Resort Hotel 4.20 3 0.259 0.286 0.741 0.714
# … with abbreviated variable names ¹`Avg % Wknt`, ²`Med % Wknt`
First, I wondered a bit about differences between who stayed at the City Hotel and who stayed at the Resort Hotel. I filtered out canceld bookings since they didn’t end up staying. At both hotels, most groups are two adults with no children or babies. Slightly more children and babies stay at the Resort Hotel over the City Hotel. It looks like sometimes bookings include no people, but at this time I don’t think I have enough information about this data to understand why.
I also looked at how long customers tended to stay at either hotel and whether more of their stay tends to be weekends or weeknights, but the central tendencies were pretty close for both!
#| label: explore 2
# group by cancelation, removing customers who are not repeat customers
by_canceled <- bookings %>% filter(`is_repeated_guest` > 0) %>% group_by(`is_canceled`)
# See if the statistics show if people who have canceled before are more likely to have canceled their current reservation
summarize(by_canceled, "Avg Previous Cancelations"=mean(`previous_cancellations`), "Avg Repeat Booking Not Canceled"=mean(`previous_bookings_not_canceled`))
# A tibble: 2 × 3
is_canceled `Avg Previous Cancelations` `Avg Repeat Booking Not Canceled`
<dbl> <dbl> <dbl>
1 0 0.335 3.96
2 1 1.27 1.38
One other thing I noticed they were tracking is whether or not the customer had previous cancelations so I wanted to see if whether or not those who had canceled their reservation were more likely to have canceled previous bookings. The customers who had canceled their booking had a higher average number of previous cancelations than those who kept their reservation. I also looked at the average repeated booking not canceled to see if people who had canceled a lot of bookings were just making more bookings, but that doesn’t seem to be the case.
---
title: "Challenge 2 Solution"
author: "Ryan O'Donnell"
desription: "Data wrangling: using group() and summarise()"
date: "09/20/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
- ryan_odonnell
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
```{r}
#| label: load_in
# read in data and set up initial objects
bookings <- read_csv("_data/hotel_bookings.csv")
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
```{r}
#| label: summary
str(bookings)
```
"hotel_bookings.csv" contains 119,390 observations of 32 variables. The data appears to be tracking the details of reservations at some hotels. There is a lot of information recorded for each reservation. The arrival date is stored across 3 columns, 4 if you count the week number. Some of the columns have codes in them but there is not a key to fully understand these codes.
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
#| label: explore 1
# filter out cancelation and create object that is grouped by hotel
by_hotel_no_cancel <- bookings %>% filter(`is_canceled` == 0) %>% group_by(`hotel`)
# summarize adults, children, and babies for each hotel
summarize(by_hotel_no_cancel, "Avg Adults"=mean(adults), "Med Adults"=median(adults), "Avg Children"=mean(children), "Med Children"=median(children), "Avg Babies"=mean(babies), "Med Babies"=median(babies))
summarize(by_hotel_no_cancel, Adults=quantile(adults), Children=quantile(children), Babies=quantile(babies))
# create columns that includes the total stay length, % of stay that is weekend, and % of stay that is weeknight, remove stays that are 0 days
stays_expanded_by_hotel <- by_hotel_no_cancel %>% select(hotel, stays_in_weekend_nights, stays_in_week_nights) %>%
mutate("Total Stay" = `stays_in_weekend_nights` + `stays_in_week_nights`,
"% Weekend" = `stays_in_weekend_nights` / (`stays_in_weekend_nights` + `stays_in_week_nights`),
"% Weeknight" = `stays_in_week_nights` / (`stays_in_weekend_nights` + `stays_in_week_nights`)
) %>%
filter(`Total Stay` > 0)
# look at the central tendencies of stays
summarize(stays_expanded_by_hotel, "Avg Stay"=mean(`Total Stay`), "Med Stay"=median(`Total Stay`), "Avg % Wknd"=mean(`% Weekend`), "Med % Wknd"=median(`% Weekend`), "Avg % Wknt"=mean(`% Weeknight`), "Med % Wknt"=median(`% Weeknight`))
```
First, I wondered a bit about differences between who stayed at the City Hotel and who stayed at the Resort Hotel. I filtered out canceld bookings since they didn't end up staying. At both hotels, most groups are two adults with no children or babies. Slightly more children and babies stay at the Resort Hotel over the City Hotel. It looks like sometimes bookings include no people, but at this time I don't think I have enough information about this data to understand why.
I also looked at how long customers tended to stay at either hotel and whether more of their stay tends to be weekends or weeknights, but the central tendencies were pretty close for both!
```{r}
#| label: explore 2
# group by cancelation, removing customers who are not repeat customers
by_canceled <- bookings %>% filter(`is_repeated_guest` > 0) %>% group_by(`is_canceled`)
# See if the statistics show if people who have canceled before are more likely to have canceled their current reservation
summarize(by_canceled, "Avg Previous Cancelations"=mean(`previous_cancellations`), "Avg Repeat Booking Not Canceled"=mean(`previous_bookings_not_canceled`))
```
One other thing I noticed they were tracking is whether or not the customer had previous cancelations so I wanted to see if whether or not those who had canceled their reservation were more likely to have canceled previous bookings. The customers who had canceled their booking had a higher average number of previous cancelations than those who kept their reservation. I also looked at the average repeated booking not canceled to see if people who had canceled a lot of bookings were just making more bookings, but that doesn't seem to be the case.