DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2 Erika Nagai

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge 2 Erika Nagai

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
Author

Erika Nagai

Published

September 18, 2022

First of all, let’s install all the packages convenient for this challenge.

Code
library(tidyverse)
library(summarytools)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐

I decided to analyze “hotel_booking.csv data”.

Code
# read the data as a tibble
data <- read_csv("_data/hotel_bookings.csv")
data
# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Describe the data

Let’s explore the data.

Code
view(dfSummary(data))

From this summary, we can see that this data is about reservations made for two types of hotel (City Hotel, Resort Hotel) from 2015 until 2017.

The data includes 1. Status of reservation (cancelled, checkout, no-show) 2. The length of stay (arrival date, days in total, days during weekdays and weekend), 3. Guest information (# of adults, children, babies, repeater or not, previous cancellation history) 4. Reserved service type (room type, meal, car parking, special request) 5. Others (used agency, distribution channel etc)

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Let’s explore the differences in reservation trends by the type of hotel (City Hotel vs. Resort Hotel)

!! Questions: The below code shows the statistic data of only Resort Hotel.

Code
data %>% 
  group_by(hotel) %>%
  descr(stats = c("mean", "sd", "min", "med", "max"))
Descriptive Statistics  

                    adr   adults   arrival_date_day_of_month   arrival_date_week_number
------------- --------- -------- --------------------------- --------------------------
         Mean    105.30     1.85                       15.79                      27.18
      Std.Dev     43.60     0.51                        8.73                      13.40
          Min      0.00     0.00                        1.00                       1.00
       Median     99.90     2.00                       16.00                      27.00
          Max   5400.00     4.00                       31.00                      53.00

Table: Table continues below

 

                arrival_date_year   babies   booking_changes   children   days_in_waiting_list
------------- ------------------- -------- ----------------- ---------- ----------------------
         Mean             2016.17     0.00              0.19       0.09                   3.23
      Std.Dev                0.70     0.08              0.61       0.37                  20.87
          Min             2015.00     0.00              0.00       0.00                   0.00
       Median             2016.00     0.00              0.00       0.00                   0.00
          Max             2017.00    10.00             21.00       3.00                 391.00

Table: Table continues below

 

                is_canceled   is_repeated_guest   lead_time   previous_bookings_not_canceled
------------- ------------- ------------------- ----------- --------------------------------
         Mean          0.42                0.03      109.74                             0.13
      Std.Dev          0.49                0.16      110.95                             1.69
          Min          0.00                0.00        0.00                             0.00
       Median          0.00                0.00       74.00                             0.00
          Max          1.00                1.00      629.00                            72.00

Table: Table continues below

 

                previous_cancellations   required_car_parking_spaces   stays_in_week_nights
------------- ------------------------ ----------------------------- ----------------------
         Mean                     0.08                          0.02                   2.18
      Std.Dev                     0.42                          0.15                   1.46
          Min                     0.00                          0.00                   0.00
       Median                     0.00                          0.00                   2.00
          Max                    21.00                          3.00                  41.00

Table: Table continues below

 

                stays_in_weekend_nights   total_of_special_requests
------------- ------------------------- ---------------------------
         Mean                      0.80                        0.55
      Std.Dev                      0.89                        0.78
          Min                      0.00                        0.00
       Median                      1.00                        0.00
          Max                     16.00                        5.00

Group: hotel = Resort Hotel  
N: 40060  

                   adr   adults   arrival_date_day_of_month   arrival_date_week_number
------------- -------- -------- --------------------------- --------------------------
         Mean    94.95     1.87                       15.82                      27.14
      Std.Dev    61.44     0.70                        8.88                      14.01
          Min    -6.38     0.00                        1.00                       1.00
       Median    75.00     2.00                       16.00                      28.00
          Max   508.00    55.00                       31.00                      53.00

Table: Table continues below

 

                arrival_date_year   babies   booking_changes   children   days_in_waiting_list
------------- ------------------- -------- ----------------- ---------- ----------------------
         Mean             2016.12     0.01              0.29       0.13                   0.53
      Std.Dev                0.72     0.12              0.73       0.45                   7.43
          Min             2015.00     0.00              0.00       0.00                   0.00
       Median             2016.00     0.00              0.00       0.00                   0.00
          Max             2017.00     2.00             17.00      10.00                 185.00

Table: Table continues below

 

                is_canceled   is_repeated_guest   lead_time   previous_bookings_not_canceled
------------- ------------- ------------------- ----------- --------------------------------
         Mean          0.28                0.04       92.68                             0.15
      Std.Dev          0.45                0.21       97.29                             1.00
          Min          0.00                0.00        0.00                             0.00
       Median          0.00                0.00       57.00                             0.00
          Max          1.00                1.00      737.00                            30.00

Table: Table continues below

 

                previous_cancellations   required_car_parking_spaces   stays_in_week_nights
------------- ------------------------ ----------------------------- ----------------------
         Mean                     0.10                          0.14                   3.13
      Std.Dev                     1.34                          0.35                   2.46
          Min                     0.00                          0.00                   0.00
       Median                     0.00                          0.00                   3.00
          Max                    26.00                          8.00                  50.00

Table: Table continues below

 

                stays_in_weekend_nights   total_of_special_requests
------------- ------------------------- ---------------------------
         Mean                      1.19                        0.62
      Std.Dev                      1.15                        0.81
          Min                      0.00                        0.00
       Median                      1.00                        0.00
          Max                     19.00                        5.00

Make a column of total nights of stay

Code
data <- data %>% 
  mutate(total_nights = stays_in_week_nights + stays_in_weekend_nights)

Customer and Stay Trends by Hotel Type

Code
data %>%
  group_by(hotel) %>%
  summarise(
    ave_adults = mean(adults, na.rm = TRUE),
    ave_children = mean(children, na.rm = TRUE),
    ave_babies = mean(babies, na.rm = TRUE),
    avg_total_nights = mean(total_nights, nr.rm = TRUE))
# A tibble: 2 × 5
  hotel        ave_adults ave_children ave_babies avg_total_nights
  <chr>             <dbl>        <dbl>      <dbl>            <dbl>
1 City Hotel         1.85       0.0914    0.00494             2.98
2 Resort Hotel       1.87       0.129     0.0139              4.32

Guest who stay at Resort hotel tend to stay longer than those who stay at Resort hotel. Guests at both hotels are less likely to be accompanied by children, but resort hotels have more people with children and babies than city hotels!

Explain and Interpret

I will compare the group with children (at least one child or baby) to the group without children. I chose this group because I thought that the presence of children on a trip significantly impacts the stay trend and that hotel companies would want to know what the groups with children want.

Code
# First classify if the group is with at least one kid/baby.

data <- data %>% 
  mutate(Kids_total = children + babies) %>%
  mutate(Kids = case_when(
    Kids_total == 0 ~ "No Kid",
    Kids_total >= 1 ~ "With Kid"
  ))

data
# A tibble: 119,390 × 35
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 25 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Now, let’s find out the differences in trends between groups with and without children.

Code
data %>%
  group_by(Kids) %>%
  summarise(
    avg_total_nights = mean(total_nights, nr.rm = TRUE),
    repeated_guest = mean(is_repeated_guest, nr.rm = TRUE),
    lead_time = mean(lead_time, nr.rm = TRUE)
  )
# A tibble: 3 × 4
  Kids     avg_total_nights repeated_guest lead_time
  <chr>               <dbl>          <dbl>     <dbl>
1 No Kid               3.38         0.0337     105. 
2 With Kid             3.94         0.0103      88.5
3 <NA>                 3            0            3  

Here are some findings

  • Groups with kids tend to stay a little longer than groups without kids.

  • Groups without kids are likely to book much earlier than groups with kids.

Source Code
---
title: "Challenge 2 Erika Nagai"
author: "Erika Nagai"
desription: "Data wrangling: using group() and summarise()"
date: "09/18/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---

First of all, let's install all the packages convenient for this challenge.

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(summarytools)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.

-   railroad\*.csv or StateCounty2012.xls ⭐
-   FAOstat\*.csv or birds.csv ⭐⭐⭐
-   hotel_bookings.csv ⭐⭐⭐⭐

I decided to analyze "hotel_booking.csv data".

```{r}
# read the data as a tibble
data <- read_csv("_data/hotel_bookings.csv")
data
```

## Describe the data

Let's explore the data.

```{r}
view(dfSummary(data))
```

From this summary, we can see that this data is about reservations made for two types of hotel (City Hotel, Resort Hotel) from 2015 until 2017.

The data includes 1. Status of reservation (cancelled, checkout, no-show) 2. The length of stay (arrival date, days in total, days during weekdays and weekend), 3. Guest information (# of adults, children, babies, repeater or not, previous cancellation history) 4. Reserved service type (room type, meal, car parking, special request) 5. Others (used agency, distribution channel etc)

## Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Let's explore the differences in reservation trends by the type of hotel (City Hotel vs. Resort Hotel)

!! Questions: The below code shows the statistic data of only Resort Hotel.

```{r}
data %>% 
  group_by(hotel) %>%
  descr(stats = c("mean", "sd", "min", "med", "max"))

```

Make a column of total nights of stay

```{r}
data <- data %>% 
  mutate(total_nights = stays_in_week_nights + stays_in_weekend_nights)

```

Customer and Stay Trends by Hotel Type

```{r}
data %>%
  group_by(hotel) %>%
  summarise(
    ave_adults = mean(adults, na.rm = TRUE),
    ave_children = mean(children, na.rm = TRUE),
    ave_babies = mean(babies, na.rm = TRUE),
    avg_total_nights = mean(total_nights, nr.rm = TRUE))


```

Guest who stay at Resort hotel tend to stay longer than those who stay at Resort hotel. Guests at both hotels are less likely to be accompanied by children, but resort hotels have more people with children and babies than city hotels!

### Explain and Interpret

I will compare **the group with children (at least one child or baby) to the group without children**. I chose this group because I thought that the presence of children on a trip significantly impacts the stay trend and that hotel companies would want to know what the groups with children want.

```{r}
# First classify if the group is with at least one kid/baby.

data <- data %>% 
  mutate(Kids_total = children + babies) %>%
  mutate(Kids = case_when(
    Kids_total == 0 ~ "No Kid",
    Kids_total >= 1 ~ "With Kid"
  ))

data
```

Now, let's find out the differences in trends between groups with and without children.

```{r}
data %>%
  group_by(Kids) %>%
  summarise(
    avg_total_nights = mean(total_nights, nr.rm = TRUE),
    repeated_guest = mean(is_repeated_guest, nr.rm = TRUE),
    lead_time = mean(lead_time, nr.rm = TRUE)
  )
```

Here are some findings

-   Groups with kids tend to stay a little longer than groups without kids.

-   Groups without kids are likely to book much earlier than groups with kids.