Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Abby Balint
September 21, 2022
# A tibble: 119,390 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
This data set contains information about hotel stays in various countries from 2015-2017. There is a lot of different types of data contained here, 32 variables and 119390 rows of data. This is likely from an international hotel chain with many locations and hotel types based on the variables that this dataset contains. Because of the large number of variables, I think it would be best to look at the variables in sets as they relate to each other. For example, we could look at data about time of year, length of stay, and lead time for booking, which would tell us about behaviors around advanced booking practices. Or, we could look at the room types, number of adults/children, meals, and parking spaces used to build a profile about the demographics of customers and their habits. There is even data around marketing strategies that could help show how marketing impacts the customer experience. Really, there is an endless amount of combinations of variables that could be used here to tell various different stories.
For this challenge, I am choosing to focus on the demographics of guests and the types of stays they are most interested in given things like family structure, location, and customer type (repeat vs not).
hotel is_canceled lead_time arrival_date_year
Length:119390 Min. :0.0000 Min. : 0 Min. :2015
Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
Mode :character Median :0.0000 Median : 69 Median :2016
Mean :0.3704 Mean :104 Mean :2016
3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
Max. :1.0000 Max. :737 Max. :2017
arrival_date_month arrival_date_week_number arrival_date_day_of_month
Length:119390 Min. : 1.00 Min. : 1.0
Class :character 1st Qu.:16.00 1st Qu.: 8.0
Mode :character Median :28.00 Median :16.0
Mean :27.17 Mean :15.8
3rd Qu.:38.00 3rd Qu.:23.0
Max. :53.00 Max. :31.0
stays_in_weekend_nights stays_in_week_nights adults
Min. : 0.0000 Min. : 0.0 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
Median : 1.0000 Median : 2.0 Median : 2.000
Mean : 0.9276 Mean : 2.5 Mean : 1.856
3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
Max. :19.0000 Max. :50.0 Max. :55.000
children babies meal country
Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
Median : 0.0000 Median : 0.000000 Mode :character Mode :character
Mean : 0.1039 Mean : 0.007949
3rd Qu.: 0.0000 3rd Qu.: 0.000000
Max. :10.0000 Max. :10.000000
NA's :4
market_segment distribution_channel is_repeated_guest
Length:119390 Length:119390 Min. :0.00000
Class :character Class :character 1st Qu.:0.00000
Mode :character Mode :character Median :0.00000
Mean :0.03191
3rd Qu.:0.00000
Max. :1.00000
previous_cancellations previous_bookings_not_canceled reserved_room_type
Min. : 0.00000 Min. : 0.0000 Length:119390
1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
Median : 0.00000 Median : 0.0000 Mode :character
Mean : 0.08712 Mean : 0.1371
3rd Qu.: 0.00000 3rd Qu.: 0.0000
Max. :26.00000 Max. :72.0000
assigned_room_type booking_changes deposit_type agent
Length:119390 Min. : 0.0000 Length:119390 Length:119390
Class :character 1st Qu.: 0.0000 Class :character Class :character
Mode :character Median : 0.0000 Mode :character Mode :character
Mean : 0.2211
3rd Qu.: 0.0000
Max. :21.0000
company days_in_waiting_list customer_type adr
Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
Mode :character Median : 0.000 Mode :character Median : 94.58
Mean : 2.321 Mean : 101.83
3rd Qu.: 0.000 3rd Qu.: 126.00
Max. :391.000 Max. :5400.00
required_car_parking_spaces total_of_special_requests reservation_status
Min. :0.00000 Min. :0.0000 Length:119390
1st Qu.:0.00000 1st Qu.:0.0000 Class :character
Median :0.00000 Median :0.0000 Mode :character
Mean :0.06252 Mean :0.5714
3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :8.00000 Max. :5.0000
reservation_status_date
Min. :2014-10-17
1st Qu.:2016-02-01
Median :2016-08-07
Mean :2016-07-30
3rd Qu.:2017-02-08
Max. :2017-09-14
[1] 119390 32
# A tibble: 119,390 × 7
stays_in_weekend_nights stays_in_week…¹ adults child…² babies country total…³
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 0 0 2 0 0 PRT 0
2 0 0 2 0 0 PRT 0
3 0 1 1 0 0 GBR 0
4 0 1 1 0 0 GBR 0
5 0 2 2 0 0 GBR 1
6 0 2 2 0 0 GBR 1
7 0 2 2 0 0 PRT 0
8 0 2 2 0 0 PRT 1
9 0 3 2 0 0 PRT 1
10 0 3 2 0 0 PRT 0
# … with 119,380 more rows, and abbreviated variable names
# ¹stays_in_week_nights, ²children, ³total_of_special_requests
stays_in_weekend_nights stays_in_week_nights adults
Min. : 0.0000 Min. : 0.0 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
Median : 1.0000 Median : 2.0 Median : 2.000
Mean : 0.9276 Mean : 2.5 Mean : 1.856
3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
Max. :19.0000 Max. :50.0 Max. :55.000
children babies country
Min. : 0.0000 Min. : 0.000000 Length:119390
1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character
Median : 0.0000 Median : 0.000000 Mode :character
Mean : 0.1039 Mean : 0.007949
3rd Qu.: 0.0000 3rd Qu.: 0.000000
Max. :10.0000 Max. :10.000000
NA's :4
total_of_special_requests
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.5714
3rd Qu.:1.0000
Max. :5.0000
Above, I made a smaller version of my dataset using only some of the columns that apply to the demographics of customers, their country, and their length of stay. Below, I filtered by stays with two adults and more than one child in country PRT to see only the data from Portugal. I descended this data by the total number of special requests. I can see that the maximum number of special requests from families with 2 adults and at least one child was five. I also sliced the data for the top 10 most special request rows and can now easily see that 9 out of the 10 rows correspond with a stay 3 nights or longer, drawing the conclusion that the longer the stay the more likely families may be to submit special requests.
# A tibble: 10 × 7
stays_in_weekend_nights stays_in_week…¹ adults child…² babies country total…³
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 5 2 1 0 PRT 5
2 2 3 2 1 0 PRT 5
3 2 4 2 1 0 PRT 4
4 1 1 2 1 0 PRT 4
5 2 3 2 1 1 PRT 4
6 2 3 2 1 0 PRT 4
7 1 4 2 1 0 PRT 4
8 1 4 2 1 0 PRT 4
9 1 2 2 1 0 PRT 4
10 1 2 2 2 0 PRT 4
# … with abbreviated variable names ¹stays_in_week_nights, ²children,
# ³total_of_special_requests
# A tibble: 10 × 7
stays_in_weekend_nights stays_in_week…¹ adults child…² babies country total…³
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 3 5 2 0 2 PRT 0
2 2 3 2 0 2 PRT 3
3 2 7 2 0 2 PRT 1
4 2 0 2 0 2 PRT 1
5 2 6 2 0 2 PRT 1
6 2 2 2 0 2 PRT 0
7 1 4 2 0 2 PRT 2
8 1 0 2 0 2 PRT 2
9 1 0 2 1 2 PRT 3
10 0 2 2 0 10 PRT 1
# … with abbreviated variable names ¹stays_in_week_nights, ²children,
# ³total_of_special_requests
stays_in_weekend_nights stays_in_week_nights adults children
Min. :0.000 Min. :0.000 Min. :2 Min. :0.0000
1st Qu.:0.000 1st Qu.:1.000 1st Qu.:2 1st Qu.:0.0000
Median :1.000 Median :2.000 Median :2 Median :0.0000
Mean :1.231 Mean :2.538 Mean :2 Mean :0.2308
3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:2 3rd Qu.:0.0000
Max. :3.000 Max. :7.000 Max. :2 Max. :1.0000
babies country total_of_special_requests
Min. : 2.000 Length:13 Min. :0.000
1st Qu.: 2.000 Class :character 1st Qu.:1.000
Median : 2.000 Mode :character Median :2.000
Mean : 2.615 Mean :1.615
3rd Qu.: 2.000 3rd Qu.:2.000
Max. :10.000 Max. :3.000
Above I switched my filtering to show that customers with two or more babies stayed up to eight days max. Customers with two or more babies stayed an average of 1-2 weekend nights, and 2-3 weeknights.
---
title: "Challenge 2 - Abby Balint"
author: "Abby Balint "
desription: "Data wrangling: using group() and summarise()"
date: "09/21/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- hotel_bookings
- abby_balint
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Data
- hotel_bookings.csv ⭐⭐⭐⭐
```{r}
read_csv("_data/hotel_bookings.csv")
hoteldata <- read_csv("_data/hotel_bookings.csv")
```
## Describe the data
This data set contains information about hotel stays in various countries from 2015-2017. There is a lot of different types of data contained here, 32 variables and 119390 rows of data. This is likely from an international hotel chain with many locations and hotel types based on the variables that this dataset contains. Because of the large number of variables, I think it would be best to look at the variables in sets as they relate to each other. For example, we could look at data about time of year, length of stay, and lead time for booking, which would tell us about behaviors around advanced booking practices. Or, we could look at the room types, number of adults/children, meals, and parking spaces used to build a profile about the demographics of customers and their habits. There is even data around marketing strategies that could help show how marketing impacts the customer experience. Really, there is an endless amount of combinations of variables that could be used here to tell various different stories.
For this challenge, I am choosing to focus on the demographics of guests and the types of stays they are most interested in given things like family structure, location, and customer type (repeat vs not).
```{r}
#| label: summary
summary (hoteldata)
dim(hoteldata)
```
```{r}
#| label: summary2
select(hoteldata, "stays_in_weekend_nights", "stays_in_week_nights", "adults", "children", "babies", "country", "total_of_special_requests")
hotelcut <- select(hoteldata, "stays_in_weekend_nights", "stays_in_week_nights", "adults", "children", "babies", "country", "total_of_special_requests")
```
```{r}
#| label: summary3
summary(hotelcut)
```
## Provide Grouped Summary Statistics
Above, I made a smaller version of my dataset using only some of the columns that apply to the demographics of customers, their country, and their length of stay.
Below, I filtered by stays with two adults and more than one child in country PRT to see only the data from Portugal. I descended this data by the total number of special requests. I can see that the maximum number of special requests from families with 2 adults and at least one child was five.
I also sliced the data for the top 10 most special request rows and can now easily see that 9 out of the 10 rows correspond with a stay 3 nights or longer, drawing the conclusion that the longer the stay the more likely families may be to submit special requests.
```{r}
#| label: summary4
filter(hotelcut, `adults` == 2 & `children` >= 1 & `country`== "PRT" & `stays_in_weekend_nights` >= 1) %>%
arrange(desc(`total_of_special_requests`)) %>%
slice(1:10)
```
```{r}
#| label: summary5
filter(hotelcut, `babies` >= 2 & `country`== "PRT") %>%
arrange(desc(`stays_in_weekend_nights`)) %>%
slice(1:10)
```
```{r}
#| label: summary6
filter(hotelcut, `babies` >= 2 & `country`== "PRT") %>%
summary()
```
### Explain and Interpret
Above I switched my filtering to show that customers with two or more babies stayed up to eight days max. Customers with two or more babies stayed an average of 1-2 weekend nights, and 2-3 weeknights.