DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2 - Abby Balint

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge 2 - Abby Balint

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
hotel_bookings
abby_balint
Author

Abby Balint

Published

September 21, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

  • hotel_bookings.csv ⭐⭐⭐⭐
Code
read_csv("_data/hotel_bookings.csv")
# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Code
hoteldata <- read_csv("_data/hotel_bookings.csv")

Describe the data

This data set contains information about hotel stays in various countries from 2015-2017. There is a lot of different types of data contained here, 32 variables and 119390 rows of data. This is likely from an international hotel chain with many locations and hotel types based on the variables that this dataset contains. Because of the large number of variables, I think it would be best to look at the variables in sets as they relate to each other. For example, we could look at data about time of year, length of stay, and lead time for booking, which would tell us about behaviors around advanced booking practices. Or, we could look at the room types, number of adults/children, meals, and parking spaces used to build a profile about the demographics of customers and their habits. There is even data around marketing strategies that could help show how marketing impacts the customer experience. Really, there is an endless amount of combinations of variables that could be used here to tell various different stories.

For this challenge, I am choosing to focus on the demographics of guests and the types of stays they are most interested in given things like family structure, location, and customer type (repeat vs not).

Code
summary (hoteldata)
    hotel            is_canceled       lead_time   arrival_date_year
 Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
 Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
 Mode  :character   Median :0.0000   Median : 69   Median :2016     
                    Mean   :0.3704   Mean   :104   Mean   :2016     
                    3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
                    Max.   :1.0000   Max.   :737   Max.   :2017     
                                                                    
 arrival_date_month arrival_date_week_number arrival_date_day_of_month
 Length:119390      Min.   : 1.00            Min.   : 1.0             
 Class :character   1st Qu.:16.00            1st Qu.: 8.0             
 Mode  :character   Median :28.00            Median :16.0             
                    Mean   :27.17            Mean   :15.8             
                    3rd Qu.:38.00            3rd Qu.:23.0             
                    Max.   :53.00            Max.   :31.0             
                                                                      
 stays_in_weekend_nights stays_in_week_nights     adults      
 Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
 1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
 Median : 1.0000         Median : 2.0         Median : 2.000  
 Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
 3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
 Max.   :19.0000         Max.   :50.0         Max.   :55.000  
                                                              
    children           babies              meal             country         
 Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
 1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
 Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
 Mean   : 0.1039   Mean   : 0.007949                                        
 3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
 Max.   :10.0000   Max.   :10.000000                                        
 NA's   :4                                                                  
 market_segment     distribution_channel is_repeated_guest
 Length:119390      Length:119390        Min.   :0.00000  
 Class :character   Class :character     1st Qu.:0.00000  
 Mode  :character   Mode  :character     Median :0.00000  
                                         Mean   :0.03191  
                                         3rd Qu.:0.00000  
                                         Max.   :1.00000  
                                                          
 previous_cancellations previous_bookings_not_canceled reserved_room_type
 Min.   : 0.00000       Min.   : 0.0000                Length:119390     
 1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
 Median : 0.00000       Median : 0.0000                Mode  :character  
 Mean   : 0.08712       Mean   : 0.1371                                  
 3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
 Max.   :26.00000       Max.   :72.0000                                  
                                                                         
 assigned_room_type booking_changes   deposit_type          agent          
 Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
 Class :character   1st Qu.: 0.0000   Class :character   Class :character  
 Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
                    Mean   : 0.2211                                        
                    3rd Qu.: 0.0000                                        
                    Max.   :21.0000                                        
                                                                           
   company          days_in_waiting_list customer_type           adr         
 Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
 Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
 Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
                    Mean   :  2.321                         Mean   : 101.83  
                    3rd Qu.:  0.000                         3rd Qu.: 126.00  
                    Max.   :391.000                         Max.   :5400.00  
                                                                             
 required_car_parking_spaces total_of_special_requests reservation_status
 Min.   :0.00000             Min.   :0.0000            Length:119390     
 1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
 Median :0.00000             Median :0.0000            Mode  :character  
 Mean   :0.06252             Mean   :0.5714                              
 3rd Qu.:0.00000             3rd Qu.:1.0000                              
 Max.   :8.00000             Max.   :5.0000                              
                                                                         
 reservation_status_date
 Min.   :2014-10-17     
 1st Qu.:2016-02-01     
 Median :2016-08-07     
 Mean   :2016-07-30     
 3rd Qu.:2017-02-08     
 Max.   :2017-09-14     
                        
Code
dim(hoteldata)
[1] 119390     32
Code
select(hoteldata, "stays_in_weekend_nights", "stays_in_week_nights", "adults", "children", "babies", "country", "total_of_special_requests")
# A tibble: 119,390 × 7
   stays_in_weekend_nights stays_in_week…¹ adults child…² babies country total…³
                     <dbl>           <dbl>  <dbl>   <dbl>  <dbl> <chr>     <dbl>
 1                       0               0      2       0      0 PRT           0
 2                       0               0      2       0      0 PRT           0
 3                       0               1      1       0      0 GBR           0
 4                       0               1      1       0      0 GBR           0
 5                       0               2      2       0      0 GBR           1
 6                       0               2      2       0      0 GBR           1
 7                       0               2      2       0      0 PRT           0
 8                       0               2      2       0      0 PRT           1
 9                       0               3      2       0      0 PRT           1
10                       0               3      2       0      0 PRT           0
# … with 119,380 more rows, and abbreviated variable names
#   ¹​stays_in_week_nights, ²​children, ³​total_of_special_requests
Code
hotelcut <- select(hoteldata, "stays_in_weekend_nights", "stays_in_week_nights", "adults", "children", "babies", "country", "total_of_special_requests")
Code
summary(hotelcut)
 stays_in_weekend_nights stays_in_week_nights     adults      
 Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
 1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
 Median : 1.0000         Median : 2.0         Median : 2.000  
 Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
 3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
 Max.   :19.0000         Max.   :50.0         Max.   :55.000  
                                                              
    children           babies            country         
 Min.   : 0.0000   Min.   : 0.000000   Length:119390     
 1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character  
 Median : 0.0000   Median : 0.000000   Mode  :character  
 Mean   : 0.1039   Mean   : 0.007949                     
 3rd Qu.: 0.0000   3rd Qu.: 0.000000                     
 Max.   :10.0000   Max.   :10.000000                     
 NA's   :4                                               
 total_of_special_requests
 Min.   :0.0000           
 1st Qu.:0.0000           
 Median :0.0000           
 Mean   :0.5714           
 3rd Qu.:1.0000           
 Max.   :5.0000           
                          

Provide Grouped Summary Statistics

Above, I made a smaller version of my dataset using only some of the columns that apply to the demographics of customers, their country, and their length of stay. Below, I filtered by stays with two adults and more than one child in country PRT to see only the data from Portugal. I descended this data by the total number of special requests. I can see that the maximum number of special requests from families with 2 adults and at least one child was five. I also sliced the data for the top 10 most special request rows and can now easily see that 9 out of the 10 rows correspond with a stay 3 nights or longer, drawing the conclusion that the longer the stay the more likely families may be to submit special requests.

Code
filter(hotelcut, `adults` == 2 & `children` >= 1 & `country`== "PRT" & `stays_in_weekend_nights` >= 1) %>%
  arrange(desc(`total_of_special_requests`)) %>%
  slice(1:10)
# A tibble: 10 × 7
   stays_in_weekend_nights stays_in_week…¹ adults child…² babies country total…³
                     <dbl>           <dbl>  <dbl>   <dbl>  <dbl> <chr>     <dbl>
 1                       1               5      2       1      0 PRT           5
 2                       2               3      2       1      0 PRT           5
 3                       2               4      2       1      0 PRT           4
 4                       1               1      2       1      0 PRT           4
 5                       2               3      2       1      1 PRT           4
 6                       2               3      2       1      0 PRT           4
 7                       1               4      2       1      0 PRT           4
 8                       1               4      2       1      0 PRT           4
 9                       1               2      2       1      0 PRT           4
10                       1               2      2       2      0 PRT           4
# … with abbreviated variable names ¹​stays_in_week_nights, ²​children,
#   ³​total_of_special_requests
Code
filter(hotelcut, `babies` >= 2 & `country`== "PRT") %>%
  arrange(desc(`stays_in_weekend_nights`)) %>%
  slice(1:10)
# A tibble: 10 × 7
   stays_in_weekend_nights stays_in_week…¹ adults child…² babies country total…³
                     <dbl>           <dbl>  <dbl>   <dbl>  <dbl> <chr>     <dbl>
 1                       3               5      2       0      2 PRT           0
 2                       2               3      2       0      2 PRT           3
 3                       2               7      2       0      2 PRT           1
 4                       2               0      2       0      2 PRT           1
 5                       2               6      2       0      2 PRT           1
 6                       2               2      2       0      2 PRT           0
 7                       1               4      2       0      2 PRT           2
 8                       1               0      2       0      2 PRT           2
 9                       1               0      2       1      2 PRT           3
10                       0               2      2       0     10 PRT           1
# … with abbreviated variable names ¹​stays_in_week_nights, ²​children,
#   ³​total_of_special_requests
Code
filter(hotelcut, `babies` >= 2 & `country`== "PRT") %>%
  summary()
 stays_in_weekend_nights stays_in_week_nights     adults     children     
 Min.   :0.000           Min.   :0.000        Min.   :2   Min.   :0.0000  
 1st Qu.:0.000           1st Qu.:1.000        1st Qu.:2   1st Qu.:0.0000  
 Median :1.000           Median :2.000        Median :2   Median :0.0000  
 Mean   :1.231           Mean   :2.538        Mean   :2   Mean   :0.2308  
 3rd Qu.:2.000           3rd Qu.:4.000        3rd Qu.:2   3rd Qu.:0.0000  
 Max.   :3.000           Max.   :7.000        Max.   :2   Max.   :1.0000  
     babies         country          total_of_special_requests
 Min.   : 2.000   Length:13          Min.   :0.000            
 1st Qu.: 2.000   Class :character   1st Qu.:1.000            
 Median : 2.000   Mode  :character   Median :2.000            
 Mean   : 2.615                      Mean   :1.615            
 3rd Qu.: 2.000                      3rd Qu.:2.000            
 Max.   :10.000                      Max.   :3.000            

Explain and Interpret

Above I switched my filtering to show that customers with two or more babies stayed up to eight days max. Customers with two or more babies stayed an average of 1-2 weekend nights, and 2-3 weeknights.

Source Code
---
title: "Challenge 2 - Abby Balint"
author: "Abby Balint "
desription: "Data wrangling: using group() and summarise()"
date: "09/21/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - hotel_bookings
  - abby_balint
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Read in the Data

-   hotel_bookings.csv ⭐⭐⭐⭐

```{r}
read_csv("_data/hotel_bookings.csv")
hoteldata <- read_csv("_data/hotel_bookings.csv")
```


## Describe the data

This data set contains information about hotel stays in various countries from 2015-2017. There is a lot of different types of data contained here, 32 variables and 119390 rows of data. This is likely from an international hotel chain with many locations and hotel types based on the variables that this dataset contains. Because of the large number of variables, I think it would be best to look at the variables in sets as they relate to each other. For example, we could look at data about time of year, length of stay, and lead time for booking, which would tell us about behaviors around advanced booking practices. Or, we could look at the room types, number of adults/children, meals, and parking spaces used to build a profile about the demographics of customers and their habits. There is even data around marketing strategies that could help show how marketing impacts the customer experience. Really, there is an endless amount of combinations of variables that could be used here to tell various different stories.

For this challenge, I am choosing to focus on the demographics of guests and the types of stays they are most interested in given things like family structure, location, and customer type (repeat vs not).

```{r}
#| label: summary
summary (hoteldata)
dim(hoteldata)
```
```{r}
#| label: summary2
select(hoteldata, "stays_in_weekend_nights", "stays_in_week_nights", "adults", "children", "babies", "country", "total_of_special_requests")
hotelcut <- select(hoteldata, "stays_in_weekend_nights", "stays_in_week_nights", "adults", "children", "babies", "country", "total_of_special_requests")
```
```{r}
#| label: summary3
summary(hotelcut)
```

## Provide Grouped Summary Statistics


Above, I made a smaller version of my dataset using only some of the columns that apply to the demographics of customers, their country, and their length of stay.
Below, I filtered by stays with two adults and more than one child in country PRT to see only the data from Portugal. I descended this data by the total number of special requests. I can see that the maximum number of special requests from families with 2 adults and at least one child was five.
I also sliced the data for the top 10 most special request rows and can now easily see that 9 out of the 10 rows correspond with a stay 3 nights or longer, drawing the conclusion that the longer the stay the more likely families may be to submit special requests.

```{r}
#| label: summary4
filter(hotelcut, `adults` == 2 & `children` >= 1 & `country`== "PRT" & `stays_in_weekend_nights` >= 1) %>%
  arrange(desc(`total_of_special_requests`)) %>%
  slice(1:10)

```
```{r}
#| label: summary5
filter(hotelcut, `babies` >= 2 & `country`== "PRT") %>%
  arrange(desc(`stays_in_weekend_nights`)) %>%
  slice(1:10)
```
```{r}
#| label: summary6
filter(hotelcut, `babies` >= 2 & `country`== "PRT") %>%
  summary()
```


### Explain and Interpret

Above I switched my filtering to show that customers with two or more babies stayed up to eight days max. Customers with two or more babies stayed an average of 1-2 weekend nights, and 2-3 weeknights.