Code
library(tidyverse)
library(summarytools)
library(dplyr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Erika Nagai
September 18, 2022
First of all, let’s install all the packages convenient for this challenge.
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
I decided to analyze “hotel_booking.csv data”.
# A tibble: 119,390 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Let’s explore the data.
From this summary, we can see that this data is about reservations made for two types of hotel (City Hotel, Resort Hotel) from 2015 until 2017.
The data includes 1. Status of reservation (cancelled, checkout, no-show) 2. The length of stay (arrival date, days in total, days during weekdays and weekend), 3. Guest information (# of adults, children, babies, repeater or not, previous cancellation history) 4. Reserved service type (room type, meal, car parking, special request) 5. Others (used agency, distribution channel etc)
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
Let’s explore the differences in reservation trends by the type of hotel (City Hotel vs. Resort Hotel)
!! Questions: The below code shows the statistic data of only Resort Hotel.
Descriptive Statistics
adr adults arrival_date_day_of_month arrival_date_week_number
------------- --------- -------- --------------------------- --------------------------
Mean 105.30 1.85 15.79 27.18
Std.Dev 43.60 0.51 8.73 13.40
Min 0.00 0.00 1.00 1.00
Median 99.90 2.00 16.00 27.00
Max 5400.00 4.00 31.00 53.00
Table: Table continues below
arrival_date_year babies booking_changes children days_in_waiting_list
------------- ------------------- -------- ----------------- ---------- ----------------------
Mean 2016.17 0.00 0.19 0.09 3.23
Std.Dev 0.70 0.08 0.61 0.37 20.87
Min 2015.00 0.00 0.00 0.00 0.00
Median 2016.00 0.00 0.00 0.00 0.00
Max 2017.00 10.00 21.00 3.00 391.00
Table: Table continues below
is_canceled is_repeated_guest lead_time previous_bookings_not_canceled
------------- ------------- ------------------- ----------- --------------------------------
Mean 0.42 0.03 109.74 0.13
Std.Dev 0.49 0.16 110.95 1.69
Min 0.00 0.00 0.00 0.00
Median 0.00 0.00 74.00 0.00
Max 1.00 1.00 629.00 72.00
Table: Table continues below
previous_cancellations required_car_parking_spaces stays_in_week_nights
------------- ------------------------ ----------------------------- ----------------------
Mean 0.08 0.02 2.18
Std.Dev 0.42 0.15 1.46
Min 0.00 0.00 0.00
Median 0.00 0.00 2.00
Max 21.00 3.00 41.00
Table: Table continues below
stays_in_weekend_nights total_of_special_requests
------------- ------------------------- ---------------------------
Mean 0.80 0.55
Std.Dev 0.89 0.78
Min 0.00 0.00
Median 1.00 0.00
Max 16.00 5.00
Group: hotel = Resort Hotel
N: 40060
adr adults arrival_date_day_of_month arrival_date_week_number
------------- -------- -------- --------------------------- --------------------------
Mean 94.95 1.87 15.82 27.14
Std.Dev 61.44 0.70 8.88 14.01
Min -6.38 0.00 1.00 1.00
Median 75.00 2.00 16.00 28.00
Max 508.00 55.00 31.00 53.00
Table: Table continues below
arrival_date_year babies booking_changes children days_in_waiting_list
------------- ------------------- -------- ----------------- ---------- ----------------------
Mean 2016.12 0.01 0.29 0.13 0.53
Std.Dev 0.72 0.12 0.73 0.45 7.43
Min 2015.00 0.00 0.00 0.00 0.00
Median 2016.00 0.00 0.00 0.00 0.00
Max 2017.00 2.00 17.00 10.00 185.00
Table: Table continues below
is_canceled is_repeated_guest lead_time previous_bookings_not_canceled
------------- ------------- ------------------- ----------- --------------------------------
Mean 0.28 0.04 92.68 0.15
Std.Dev 0.45 0.21 97.29 1.00
Min 0.00 0.00 0.00 0.00
Median 0.00 0.00 57.00 0.00
Max 1.00 1.00 737.00 30.00
Table: Table continues below
previous_cancellations required_car_parking_spaces stays_in_week_nights
------------- ------------------------ ----------------------------- ----------------------
Mean 0.10 0.14 3.13
Std.Dev 1.34 0.35 2.46
Min 0.00 0.00 0.00
Median 0.00 0.00 3.00
Max 26.00 8.00 50.00
Table: Table continues below
stays_in_weekend_nights total_of_special_requests
------------- ------------------------- ---------------------------
Mean 1.19 0.62
Std.Dev 1.15 0.81
Min 0.00 0.00
Median 1.00 0.00
Max 19.00 5.00
Make a column of total nights of stay
Customer and Stay Trends by Hotel Type
# A tibble: 2 × 5
hotel ave_adults ave_children ave_babies avg_total_nights
<chr> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 1.85 0.0914 0.00494 2.98
2 Resort Hotel 1.87 0.129 0.0139 4.32
Guest who stay at Resort hotel tend to stay longer than those who stay at Resort hotel. Guests at both hotels are less likely to be accompanied by children, but resort hotels have more people with children and babies than city hotels!
I will compare the group with children (at least one child or baby) to the group without children. I chose this group because I thought that the presence of children on a trip significantly impacts the stay trend and that hotel companies would want to know what the groups with children want.
# A tibble: 119,390 × 35
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 25 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Now, let’s find out the differences in trends between groups with and without children.
# A tibble: 3 × 4
Kids avg_total_nights repeated_guest lead_time
<chr> <dbl> <dbl> <dbl>
1 No Kid 3.38 0.0337 105.
2 With Kid 3.94 0.0103 88.5
3 <NA> 3 0 3
Here are some findings
Groups with kids tend to stay a little longer than groups without kids.
Groups without kids are likely to book much earlier than groups with kids.
---
title: "Challenge 2 Erika Nagai"
author: "Erika Nagai"
desription: "Data wrangling: using group() and summarise()"
date: "09/18/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
First of all, let's install all the packages convenient for this challenge.
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(summarytools)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
I decided to analyze "hotel_booking.csv data".
```{r}
# read the data as a tibble
data <- read_csv("_data/hotel_bookings.csv")
data
```
## Describe the data
Let's explore the data.
```{r}
view(dfSummary(data))
```
From this summary, we can see that this data is about reservations made for two types of hotel (City Hotel, Resort Hotel) from 2015 until 2017.
The data includes 1. Status of reservation (cancelled, checkout, no-show) 2. The length of stay (arrival date, days in total, days during weekdays and weekend), 3. Guest information (# of adults, children, babies, repeater or not, previous cancellation history) 4. Reserved service type (room type, meal, car parking, special request) 5. Others (used agency, distribution channel etc)
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
Let's explore the differences in reservation trends by the type of hotel (City Hotel vs. Resort Hotel)
!! Questions: The below code shows the statistic data of only Resort Hotel.
```{r}
data %>%
group_by(hotel) %>%
descr(stats = c("mean", "sd", "min", "med", "max"))
```
Make a column of total nights of stay
```{r}
data <- data %>%
mutate(total_nights = stays_in_week_nights + stays_in_weekend_nights)
```
Customer and Stay Trends by Hotel Type
```{r}
data %>%
group_by(hotel) %>%
summarise(
ave_adults = mean(adults, na.rm = TRUE),
ave_children = mean(children, na.rm = TRUE),
ave_babies = mean(babies, na.rm = TRUE),
avg_total_nights = mean(total_nights, nr.rm = TRUE))
```
Guest who stay at Resort hotel tend to stay longer than those who stay at Resort hotel. Guests at both hotels are less likely to be accompanied by children, but resort hotels have more people with children and babies than city hotels!
### Explain and Interpret
I will compare **the group with children (at least one child or baby) to the group without children**. I chose this group because I thought that the presence of children on a trip significantly impacts the stay trend and that hotel companies would want to know what the groups with children want.
```{r}
# First classify if the group is with at least one kid/baby.
data <- data %>%
mutate(Kids_total = children + babies) %>%
mutate(Kids = case_when(
Kids_total == 0 ~ "No Kid",
Kids_total >= 1 ~ "With Kid"
))
data
```
Now, let's find out the differences in trends between groups with and without children.
```{r}
data %>%
group_by(Kids) %>%
summarise(
avg_total_nights = mean(total_nights, nr.rm = TRUE),
repeated_guest = mean(is_repeated_guest, nr.rm = TRUE),
lead_time = mean(lead_time, nr.rm = TRUE)
)
```
Here are some findings
- Groups with kids tend to stay a little longer than groups without kids.
- Groups without kids are likely to book much earlier than groups with kids.