DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Step 1 - Read in the Data
  • Step 2 - Describe the data
  • Step 3 - Provide Grouped Summary Statistics
    • Step 4 - Explain and Interpret

Challenge 2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
hotel_bookings
Author

Darron Bunt

Published

October 9, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Step 1 - Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

    • hotel_bookings.csv ⭐⭐⭐⭐
Code
hotels <- read_csv("_data/hotel_bookings.csv")

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Step 2 - Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

The hotels dataset contains data relating to two types of hotels - city, and resort, from 178 countries. There are 32 total columns of data that provide a high degree of insight into each hotel stay. Some of these details include: * The arrival year, month, week, and day * The number of week-end and weekday nights stayed * The number of adults, children, and babies in the booking * The ADR (Average Daily Rate) for each booking.

Overall, this dataset affords the opportunity to extrapolate a wide variety of details regarding city and resort hotel stays between October 2014 and September 2017.

Step 3 - Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

group_by

Group_by is going to allow me to make some groupings between columns in the dataset.

Let’s say I wanted to make a grouping of the hotel type and the arrival month. If I do this right, (and bookings of both types occurred in every month), I should get 24 groups.

Code
group_by(hotels, hotel, arrival_date_month)
# A tibble: 119,390 × 32
# Groups:   hotel, arrival_date_month [24]
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

select()

Select is going to let me pull a subset of the entire data based on the names of the variables.

So let’s say of the 34 columns of hotel data that I have, I’m really only interested in a few of them. I can select those.

I’m interested in the hotel type, the arrival month, and the average daily rate.

Code
select(hotels, hotel, arrival_date_month, adr)
# A tibble: 119,390 × 3
   hotel        arrival_date_month   adr
   <chr>        <chr>              <dbl>
 1 Resort Hotel July                  0 
 2 Resort Hotel July                  0 
 3 Resort Hotel July                 75 
 4 Resort Hotel July                 75 
 5 Resort Hotel July                 98 
 6 Resort Hotel July                 98 
 7 Resort Hotel July                107 
 8 Resort Hotel July                103 
 9 Resort Hotel July                 82 
10 Resort Hotel July                106.
# … with 119,380 more rows

filter()

Filter is going to let me pull some specific data that I might be interested in. Maybe I really only care about bookings that had a lead time that was greater that 400 days.

Code
filter(hotels, lead_time>400)
# A tibble: 2,114 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     737    2015 July         27       1       0       0      2
 2 Resor…       0     460    2015 August       32       3       2       5      1
 3 Resor…       0     709    2016 Februa…       9      25       8      20      2
 4 Resor…       1     468    2016 May          20      12       0       2      2
 5 Resor…       1     468    2016 May          20      12       0       2      2
 6 Resor…       1     468    2016 May          20      12       0       2      2
 7 Resor…       1     468    2016 May          20      12       0       2      2
 8 Resor…       1     468    2016 May          20      12       0       2      2
 9 Resor…       1     468    2016 May          20      12       0       2      2
10 Resor…       1     468    2016 May          20      12       0       2      2
# … with 2,104 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

summarize()

Summarize is going to come in handy if I want to collapse the large set of data down into something a little smaller. An example would be what the average ADR is across the full dataset

Code
summarize(hotels, adr = mean(adr))
# A tibble: 1 × 1
    adr
  <dbl>
1  102.

Now we know that the average ADR is $101.83.

Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

I can also start finding random and specific things, like the average lead time by hotel type for bookings with a lead time greater than 400 days.

Code
hotels %>% 
  filter(lead_time>400) %>% 
  group_by(hotel) %>% 
  summarize(lead_time = mean(lead_time))
# A tibble: 2 × 2
  hotel        lead_time
  <chr>            <dbl>
1 City Hotel        468.
2 Resort Hotel      467 

I could add some additional layers to the group_by, too. Maybe I want to figure out the average lead time for bookings with a 400+ day lead time, and break that down by the month of the year of the booking and the type of hotel.

Code
hotels%>%
  filter(lead_time>400)%>%
  group_by(hotel, arrival_date_month)%>%
  summarize(lead_time = mean(lead_time))
# A tibble: 18 × 3
# Groups:   hotel [2]
   hotel        arrival_date_month lead_time
   <chr>        <chr>                  <dbl>
 1 City Hotel   April                   443 
 2 City Hotel   August                  460.
 3 City Hotel   December                482.
 4 City Hotel   February                584.
 5 City Hotel   January                 556.
 6 City Hotel   July                    438.
 7 City Hotel   June                    435.
 8 City Hotel   March                   615 
 9 City Hotel   May                     424.
10 City Hotel   November                533.
11 City Hotel   October                 483.
12 City Hotel   September               428.
13 Resort Hotel August                  424.
14 Resort Hotel February                709 
15 Resort Hotel July                    580.
16 Resort Hotel May                     453.
17 Resort Hotel October                 443.
18 Resort Hotel September               523.

Step 4 - Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

So I’m really interested in people who plan ahead by 400+ days, and how doing so might vary between people who book at city hotels vs. resort hotels. I’m also really curious about whether the lead time varies depending on the month of the booking.

Interestingly, the highest average 400+ day lead time can be found in resort hotel bookings for the month of February (709), while for city hotels, the highest average was in March (615).

Source Code
---
title: "Challenge 2"
author: "Darron Bunt"
desription: "Data wrangling: using group() and summarise()"
date: "10/09/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Step 1 - Read in the Data

*Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.*

-   -   hotel_bookings.csv ⭐⭐⭐⭐

```{r}
hotels <- read_csv("_data/hotel_bookings.csv")
```

*Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.*

## Step 2 - Describe the data

*Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).*

The hotels dataset contains data relating to two types of hotels - city, and resort, from 178 countries. There are 32 total columns of data that provide a high degree of insight into each hotel stay. Some of these details include: 
* The arrival year, month, week, and day
* The number of week-end and weekday nights stayed
* The number of adults, children, and babies in the booking
* The ADR (Average Daily Rate) for each booking.

Overall, this dataset affords the opportunity to extrapolate a wide variety of details regarding city and resort hotel stays between October 2014 and September 2017.


```{r}
#| label: summary

```

## Step 3 - Provide Grouped Summary Statistics

*Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.*

**group_by**

Group_by is going to allow me to make some groupings between columns in the dataset. 

Let's say I wanted to make a grouping of the hotel type and the arrival month. If I do this right, (and bookings of both types occurred in every month), I should get 24 groups.

```{r}
group_by(hotels, hotel, arrival_date_month)
```
**select()**

Select is going to let me pull a subset of the entire data based on the names of the variables.

So let's say of the 34 columns of hotel data that I have, I'm really only interested in a few of them. I can select those.

I'm interested in the hotel type, the arrival month, and the average daily rate.

```{r}
select(hotels, hotel, arrival_date_month, adr)
```
**filter()**

Filter is going to let me pull some specific data that I might be interested in. Maybe I really only care about bookings that had a lead time that was greater that 400 days. 

```{r}
filter(hotels, lead_time>400)
```

**summarize()**

Summarize is going to come in handy if I want to collapse the large set of data down into something a little smaller. An example would be what the average ADR is across the full dataset

```{r}
summarize(hotels, adr = mean(adr))
```
Now we know that the average ADR is $101.83.

**Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.**

I can also start finding random and specific things, like the average lead time by hotel type for bookings with a lead time greater than 400 days.

```{r}
hotels %>% 
  filter(lead_time>400) %>% 
  group_by(hotel) %>% 
  summarize(lead_time = mean(lead_time))
```

I could add some additional layers to the group_by, too. Maybe I want to figure out the average lead time for bookings with a 400+ day lead time, and break that down by the month of the year of the booking and the type of hotel. 

```{r}
hotels%>%
  filter(lead_time>400)%>%
  group_by(hotel, arrival_date_month)%>%
  summarize(lead_time = mean(lead_time))
```

### Step 4 - Explain and Interpret

*Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.*

So I'm really interested in people who plan ahead by 400+ days, and how doing so might vary between people who book at city hotels vs. resort hotels. I'm also really curious about whether the lead time varies depending on the month of the booking. 

Interestingly, the highest average 400+ day lead time can be found in resort hotel bookings for the month of February (709), while for city hotels, the highest average was in March (615).