Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Darron Bunt
October 9, 2022
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
The hotels dataset contains data relating to two types of hotels - city, and resort, from 178 countries. There are 32 total columns of data that provide a high degree of insight into each hotel stay. Some of these details include: * The arrival year, month, week, and day * The number of week-end and weekday nights stayed * The number of adults, children, and babies in the booking * The ADR (Average Daily Rate) for each booking.
Overall, this dataset affords the opportunity to extrapolate a wide variety of details regarding city and resort hotel stays between October 2014 and September 2017.
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
group_by
Group_by is going to allow me to make some groupings between columns in the dataset.
Let’s say I wanted to make a grouping of the hotel type and the arrival month. If I do this right, (and bookings of both types occurred in every month), I should get 24 groups.
# A tibble: 119,390 × 32
# Groups: hotel, arrival_date_month [24]
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
select()
Select is going to let me pull a subset of the entire data based on the names of the variables.
So let’s say of the 34 columns of hotel data that I have, I’m really only interested in a few of them. I can select those.
I’m interested in the hotel type, the arrival month, and the average daily rate.
# A tibble: 119,390 × 3
hotel arrival_date_month adr
<chr> <chr> <dbl>
1 Resort Hotel July 0
2 Resort Hotel July 0
3 Resort Hotel July 75
4 Resort Hotel July 75
5 Resort Hotel July 98
6 Resort Hotel July 98
7 Resort Hotel July 107
8 Resort Hotel July 103
9 Resort Hotel July 82
10 Resort Hotel July 106.
# … with 119,380 more rows
filter()
Filter is going to let me pull some specific data that I might be interested in. Maybe I really only care about bookings that had a lead time that was greater that 400 days.
# A tibble: 2,114 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 737 2015 July 27 1 0 0 2
2 Resor… 0 460 2015 August 32 3 2 5 1
3 Resor… 0 709 2016 Februa… 9 25 8 20 2
4 Resor… 1 468 2016 May 20 12 0 2 2
5 Resor… 1 468 2016 May 20 12 0 2 2
6 Resor… 1 468 2016 May 20 12 0 2 2
7 Resor… 1 468 2016 May 20 12 0 2 2
8 Resor… 1 468 2016 May 20 12 0 2 2
9 Resor… 1 468 2016 May 20 12 0 2 2
10 Resor… 1 468 2016 May 20 12 0 2 2
# … with 2,104 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
summarize()
Summarize is going to come in handy if I want to collapse the large set of data down into something a little smaller. An example would be what the average ADR is across the full dataset
Now we know that the average ADR is $101.83.
Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
I can also start finding random and specific things, like the average lead time by hotel type for bookings with a lead time greater than 400 days.
# A tibble: 2 × 2
hotel lead_time
<chr> <dbl>
1 City Hotel 468.
2 Resort Hotel 467
I could add some additional layers to the group_by, too. Maybe I want to figure out the average lead time for bookings with a 400+ day lead time, and break that down by the month of the year of the booking and the type of hotel.
# A tibble: 18 × 3
# Groups: hotel [2]
hotel arrival_date_month lead_time
<chr> <chr> <dbl>
1 City Hotel April 443
2 City Hotel August 460.
3 City Hotel December 482.
4 City Hotel February 584.
5 City Hotel January 556.
6 City Hotel July 438.
7 City Hotel June 435.
8 City Hotel March 615
9 City Hotel May 424.
10 City Hotel November 533.
11 City Hotel October 483.
12 City Hotel September 428.
13 Resort Hotel August 424.
14 Resort Hotel February 709
15 Resort Hotel July 580.
16 Resort Hotel May 453.
17 Resort Hotel October 443.
18 Resort Hotel September 523.
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
So I’m really interested in people who plan ahead by 400+ days, and how doing so might vary between people who book at city hotels vs. resort hotels. I’m also really curious about whether the lead time varies depending on the month of the booking.
Interestingly, the highest average 400+ day lead time can be found in resort hotel bookings for the month of February (709), while for city hotels, the highest average was in March (615).
---
title: "Challenge 2"
author: "Darron Bunt"
desription: "Data wrangling: using group() and summarise()"
date: "10/09/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Step 1 - Read in the Data
*Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.*
- - hotel_bookings.csv ⭐⭐⭐⭐
```{r}
hotels <- read_csv("_data/hotel_bookings.csv")
```
*Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.*
## Step 2 - Describe the data
*Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).*
The hotels dataset contains data relating to two types of hotels - city, and resort, from 178 countries. There are 32 total columns of data that provide a high degree of insight into each hotel stay. Some of these details include:
* The arrival year, month, week, and day
* The number of week-end and weekday nights stayed
* The number of adults, children, and babies in the booking
* The ADR (Average Daily Rate) for each booking.
Overall, this dataset affords the opportunity to extrapolate a wide variety of details regarding city and resort hotel stays between October 2014 and September 2017.
```{r}
#| label: summary
```
## Step 3 - Provide Grouped Summary Statistics
*Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.*
**group_by**
Group_by is going to allow me to make some groupings between columns in the dataset.
Let's say I wanted to make a grouping of the hotel type and the arrival month. If I do this right, (and bookings of both types occurred in every month), I should get 24 groups.
```{r}
group_by(hotels, hotel, arrival_date_month)
```
**select()**
Select is going to let me pull a subset of the entire data based on the names of the variables.
So let's say of the 34 columns of hotel data that I have, I'm really only interested in a few of them. I can select those.
I'm interested in the hotel type, the arrival month, and the average daily rate.
```{r}
select(hotels, hotel, arrival_date_month, adr)
```
**filter()**
Filter is going to let me pull some specific data that I might be interested in. Maybe I really only care about bookings that had a lead time that was greater that 400 days.
```{r}
filter(hotels, lead_time>400)
```
**summarize()**
Summarize is going to come in handy if I want to collapse the large set of data down into something a little smaller. An example would be what the average ADR is across the full dataset
```{r}
summarize(hotels, adr = mean(adr))
```
Now we know that the average ADR is $101.83.
**Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.**
I can also start finding random and specific things, like the average lead time by hotel type for bookings with a lead time greater than 400 days.
```{r}
hotels %>%
filter(lead_time>400) %>%
group_by(hotel) %>%
summarize(lead_time = mean(lead_time))
```
I could add some additional layers to the group_by, too. Maybe I want to figure out the average lead time for bookings with a 400+ day lead time, and break that down by the month of the year of the booking and the type of hotel.
```{r}
hotels%>%
filter(lead_time>400)%>%
group_by(hotel, arrival_date_month)%>%
summarize(lead_time = mean(lead_time))
```
### Step 4 - Explain and Interpret
*Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.*
So I'm really interested in people who plan ahead by 400+ days, and how doing so might vary between people who book at city hotels vs. resort hotels. I'm also really curious about whether the lead time varies depending on the month of the booking.
Interestingly, the highest average 400+ day lead time can be found in resort hotel bookings for the month of February (709), while for city hotels, the highest average was in March (615).