DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge 2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
Author

Janhvi Joshi

Published

October 25, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐
Code
hotel_bookings <- read_csv('_data/hotel_bookings.csv')
hotel_bookings
# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

This dataset is a summarising details of hotel bookings which includes data like arrival time, month, year, number of nights, number of adults, children and babies staying. Other interesting details includes whether the guest has previously booked or cancelled bookings at the hotel, hotel room type etc.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

This dataset summarises various details about a hotel booking and contains 120k records dating from year 2015 to 2017. It can be seen by the summary command and other analysis of data done below that there are two types of hotels in this dataset - Resort Hotel and City Hotel. Customers from around the world; around 160-170 countries, book these hotels. It can also be seen that on an average, around 37% of the bookings are cancelled and around 3% of the guests are repeated. On an average, customers may need to wait for 2.3 days in the waitlist to confirm a booking and around 57% of these bookings include some special requests. The hotels provide 4 different types of meals to their customers. This dataset is likely gathered from various online and offline channels - from where the booking was made.

Code
summary(hotel_bookings)
    hotel            is_canceled       lead_time   arrival_date_year
 Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
 Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
 Mode  :character   Median :0.0000   Median : 69   Median :2016     
                    Mean   :0.3704   Mean   :104   Mean   :2016     
                    3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
                    Max.   :1.0000   Max.   :737   Max.   :2017     
                                                                    
 arrival_date_month arrival_date_week_number arrival_date_day_of_month
 Length:119390      Min.   : 1.00            Min.   : 1.0             
 Class :character   1st Qu.:16.00            1st Qu.: 8.0             
 Mode  :character   Median :28.00            Median :16.0             
                    Mean   :27.17            Mean   :15.8             
                    3rd Qu.:38.00            3rd Qu.:23.0             
                    Max.   :53.00            Max.   :31.0             
                                                                      
 stays_in_weekend_nights stays_in_week_nights     adults      
 Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
 1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
 Median : 1.0000         Median : 2.0         Median : 2.000  
 Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
 3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
 Max.   :19.0000         Max.   :50.0         Max.   :55.000  
                                                              
    children           babies              meal             country         
 Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
 1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
 Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
 Mean   : 0.1039   Mean   : 0.007949                                        
 3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
 Max.   :10.0000   Max.   :10.000000                                        
 NA's   :4                                                                  
 market_segment     distribution_channel is_repeated_guest
 Length:119390      Length:119390        Min.   :0.00000  
 Class :character   Class :character     1st Qu.:0.00000  
 Mode  :character   Mode  :character     Median :0.00000  
                                         Mean   :0.03191  
                                         3rd Qu.:0.00000  
                                         Max.   :1.00000  
                                                          
 previous_cancellations previous_bookings_not_canceled reserved_room_type
 Min.   : 0.00000       Min.   : 0.0000                Length:119390     
 1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
 Median : 0.00000       Median : 0.0000                Mode  :character  
 Mean   : 0.08712       Mean   : 0.1371                                  
 3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
 Max.   :26.00000       Max.   :72.0000                                  
                                                                         
 assigned_room_type booking_changes   deposit_type          agent          
 Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
 Class :character   1st Qu.: 0.0000   Class :character   Class :character  
 Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
                    Mean   : 0.2211                                        
                    3rd Qu.: 0.0000                                        
                    Max.   :21.0000                                        
                                                                           
   company          days_in_waiting_list customer_type           adr         
 Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
 Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
 Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
                    Mean   :  2.321                         Mean   : 101.83  
                    3rd Qu.:  0.000                         3rd Qu.: 126.00  
                    Max.   :391.000                         Max.   :5400.00  
                                                                             
 required_car_parking_spaces total_of_special_requests reservation_status
 Min.   :0.00000             Min.   :0.0000            Length:119390     
 1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
 Median :0.00000             Median :0.0000            Mode  :character  
 Mean   :0.06252             Mean   :0.5714                              
 3rd Qu.:0.00000             3rd Qu.:1.0000                              
 Max.   :8.00000             Max.   :5.0000                              
                                                                         
 reservation_status_date
 Min.   :2014-10-17     
 1st Qu.:2016-02-01     
 Median :2016-08-07     
 Mean   :2016-07-30     
 3rd Qu.:2017-02-08     
 Max.   :2017-09-14     
                        
Code
as_tibble(hotel_bookings)
# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Code
colnames(hotel_bookings)[1] <- c("hotel")
unique_hotel <- unique(bird$hotel)
Error in unique(bird$hotel): object 'bird' not found
Code
unique_hotel
Error in eval(expr, envir, enclos): object 'unique_hotel' not found
Code
colnames(hotel_bookings)[14] <- c("country")
unique_country <- unique(bird$country)
Error in unique(bird$country): object 'bird' not found
Code
unique_country
Error in eval(expr, envir, enclos): object 'unique_country' not found
Code
nrow(hotel_bookings)
[1] 119390
Code
colnames(hotel_bookings)[13] <- c("meal")
unique_meals <- unique(bird$meal)
Error in unique(bird$meal): object 'bird' not found
Code
unique_meals
Error in eval(expr, envir, enclos): object 'unique_meals' not found
Code
colnames(hotel_bookings)[15] <- c("market_segment")
unique_m <- unique(bird$market_segment)
Error in unique(bird$market_segment): object 'bird' not found
Code
unique_m
Error in eval(expr, envir, enclos): object 'unique_m' not found

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code
grouped_by_customer_type <- hotel_bookings %>%
  group_by(customer_type)
grouped_by_customer_type %>%
  summarise(
    avg_stays_in_week_nights = mean(stays_in_week_nights, nr.rm = TRUE),
    avg_stays_in_weekend_nights = mean(stays_in_weekend_nights, nr.rm = TRUE), 
    avg_days_in_waiting_list = mean(days_in_waiting_list, nr.rm=TRUE), 
    avg_total_of_special_requests = mean(total_of_special_requests, nr.rm=TRUE)
  )
# A tibble: 4 × 5
  customer_type   avg_stays_in_week_nights avg_stays_in_weeken…¹ avg_d…² avg_t…³
  <chr>                              <dbl>                 <dbl>   <dbl>   <dbl>
1 Contract                            3.85                 1.47   0.0395   0.729
2 Group                               2.06                 0.825  0.369    0.645
3 Transient                           2.51                 0.939  1.32     0.632
4 Transient-Party                     2.26                 0.802  6.32     0.329
# … with abbreviated variable names ¹​avg_stays_in_weekend_nights,
#   ²​avg_days_in_waiting_list, ³​avg_total_of_special_requests

Explain and Interpret

I chose to group by the type of customers booking the two hotels and found the average days these customers stayed in week and weekend nights, the number of days they had to wait for and their special requests. I chose this group and these values because I think this shows important insights of the trends and patterns of different groups of customers. This type of information can be useful for the hotels to provide better service and maybe tier based prices based on the type of customer. For example, the “Transient-Party” customer type typically stay the longest on the waiting list while they stay on weekend nights the least. But we can see that “Transient” customers stay more in weekend nights and still have the second highest average days in waiting list. The hotels could try to make a priority queue from their waiting list and prioritise “Transient” customers on weekend nights over “Transient-Party”. This might lead to better customer service and satisfaction.

Source Code
---
title: "Challenge 2"
author: "Janhvi Joshi"
desription: "Data wrangling: using group() and summarise()"
date: "10/25/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.

-   railroad\*.csv or StateCounty2012.xls ⭐
-   FAOstat\*.csv or birds.csv ⭐⭐⭐
-   hotel_bookings.csv ⭐⭐⭐⭐

```{r}
hotel_bookings <- read_csv('_data/hotel_bookings.csv')
hotel_bookings
```
This dataset is a summarising details of hotel bookings which includes data like arrival time, month, year, number of nights, number of adults, children and babies staying. Other interesting details includes whether the guest has previously booked or cancelled bookings at the hotel, hotel room type etc. 

## Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

This dataset summarises various details about a hotel booking and contains 120k records dating from year 2015 to 2017. It can be seen by the summary command and other analysis of data done below that there are two types of hotels in this dataset - Resort Hotel and City Hotel. Customers from around the world; around 160-170 countries, book these hotels. It can also be seen that on an average, around 37% of the bookings are cancelled and around 3% of the guests are repeated. On an average, customers may need to wait for 2.3 days in the waitlist to confirm a booking and around 57% of these bookings include some special requests. The hotels provide 4 different types of meals to their customers. This dataset is likely gathered from various online and offline channels - from where the booking was made.

```{r}
#| label: summary
summary(hotel_bookings)
```
```{r}
as_tibble(hotel_bookings)
```
```{r}
colnames(hotel_bookings)[1] <- c("hotel")
unique_hotel <- unique(bird$hotel)
unique_hotel
```

```{r}
colnames(hotel_bookings)[14] <- c("country")
unique_country <- unique(bird$country)
unique_country
```
```{r}
nrow(hotel_bookings)
colnames(hotel_bookings)[13] <- c("meal")
unique_meals <- unique(bird$meal)
unique_meals
```
```{r}
colnames(hotel_bookings)[15] <- c("market_segment")
unique_m <- unique(bird$market_segment)
unique_m
```
## Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

```{r}
grouped_by_customer_type <- hotel_bookings %>%
  group_by(customer_type)
grouped_by_customer_type %>%
  summarise(
    avg_stays_in_week_nights = mean(stays_in_week_nights, nr.rm = TRUE),
    avg_stays_in_weekend_nights = mean(stays_in_weekend_nights, nr.rm = TRUE), 
    avg_days_in_waiting_list = mean(days_in_waiting_list, nr.rm=TRUE), 
    avg_total_of_special_requests = mean(total_of_special_requests, nr.rm=TRUE)
  )
```

### Explain and Interpret

I chose to group by the type of customers booking the two hotels and found the average days these customers stayed in week and weekend nights, the number of days they had to wait for and their special requests. I chose this group and these values because I think this shows important insights of the trends and patterns of different groups of customers. This type of information can be useful for the hotels to provide better service and maybe tier based prices based on the type of customer. For example, the "Transient-Party" customer type typically stay the longest on the waiting list while they stay on weekend nights the least. But we can see that "Transient" customers stay more in weekend nights and still have the second highest average days in waiting list. The hotels could try to make a priority queue from their waiting list and prioritise "Transient" customers on weekend nights over "Transient-Party". This might lead to better customer service and satisfaction.