DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2 Solution

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics

Challenge 2 Solution

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
ryan_odonnell
Author

Ryan O’Donnell

Published

September 20, 2022

Code
library(tidyverse)


knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐
Code
# read in data and set up initial objects

bookings <- read_csv("_data/hotel_bookings.csv")

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
str(bookings)
spc_tbl_ [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
 $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
 $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
 $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
 $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
 $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
 $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
 $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
 $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
 $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
 $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
 $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
 $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
 $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
 $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
 $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
 $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
 $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
 $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
 $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
 $ adr                           : num [1:119390] 0 0 75 75 98 ...
 $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
 $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
 $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
 $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
 - attr(*, "spec")=
  .. cols(
  ..   hotel = col_character(),
  ..   is_canceled = col_double(),
  ..   lead_time = col_double(),
  ..   arrival_date_year = col_double(),
  ..   arrival_date_month = col_character(),
  ..   arrival_date_week_number = col_double(),
  ..   arrival_date_day_of_month = col_double(),
  ..   stays_in_weekend_nights = col_double(),
  ..   stays_in_week_nights = col_double(),
  ..   adults = col_double(),
  ..   children = col_double(),
  ..   babies = col_double(),
  ..   meal = col_character(),
  ..   country = col_character(),
  ..   market_segment = col_character(),
  ..   distribution_channel = col_character(),
  ..   is_repeated_guest = col_double(),
  ..   previous_cancellations = col_double(),
  ..   previous_bookings_not_canceled = col_double(),
  ..   reserved_room_type = col_character(),
  ..   assigned_room_type = col_character(),
  ..   booking_changes = col_double(),
  ..   deposit_type = col_character(),
  ..   agent = col_character(),
  ..   company = col_character(),
  ..   days_in_waiting_list = col_double(),
  ..   customer_type = col_character(),
  ..   adr = col_double(),
  ..   required_car_parking_spaces = col_double(),
  ..   total_of_special_requests = col_double(),
  ..   reservation_status = col_character(),
  ..   reservation_status_date = col_date(format = "")
  .. )
 - attr(*, "problems")=<externalptr> 

“hotel_bookings.csv” contains 119,390 observations of 32 variables. The data appears to be tracking the details of reservations at some hotels. There is a lot of information recorded for each reservation. The arrival date is stored across 3 columns, 4 if you count the week number. Some of the columns have codes in them but there is not a key to fully understand these codes.

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code
# filter out cancelation and create object that is grouped by hotel

by_hotel_no_cancel <- bookings %>% filter(`is_canceled` == 0) %>% group_by(`hotel`)

# summarize adults, children, and babies for each hotel
summarize(by_hotel_no_cancel, "Avg Adults"=mean(adults), "Med Adults"=median(adults), "Avg Children"=mean(children), "Med Children"=median(children), "Avg Babies"=mean(babies), "Med Babies"=median(babies))
# A tibble: 2 × 7
  hotel        `Avg Adults` `Med Adults` `Avg Children` Med Ch…¹ Avg B…² Med B…³
  <chr>               <dbl>        <dbl>          <dbl>    <dbl>   <dbl>   <dbl>
1 City Hotel           1.83            2         0.0999        0 0.00710       0
2 Resort Hotel         1.83            2         0.106         0 0.0156        0
# … with abbreviated variable names ¹​`Med Children`, ²​`Avg Babies`,
#   ³​`Med Babies`
Code
summarize(by_hotel_no_cancel, Adults=quantile(adults), Children=quantile(children), Babies=quantile(babies))
# A tibble: 10 × 4
# Groups:   hotel [2]
   hotel        Adults Children Babies
   <chr>         <dbl>    <dbl>  <dbl>
 1 City Hotel        0        0      0
 2 City Hotel        2        0      0
 3 City Hotel        2        0      0
 4 City Hotel        2        0      0
 5 City Hotel        4        3     10
 6 Resort Hotel      0        0      0
 7 Resort Hotel      2        0      0
 8 Resort Hotel      2        0      0
 9 Resort Hotel      2        0      0
10 Resort Hotel      4        3      2
Code
# create columns that includes the total stay length, % of stay that is weekend, and % of stay that is weeknight, remove stays that are 0 days

stays_expanded_by_hotel <- by_hotel_no_cancel %>% select(hotel, stays_in_weekend_nights, stays_in_week_nights) %>%
  mutate("Total Stay" = `stays_in_weekend_nights` + `stays_in_week_nights`,
         "% Weekend" = `stays_in_weekend_nights` / (`stays_in_weekend_nights` + `stays_in_week_nights`),
         "% Weeknight" = `stays_in_week_nights` / (`stays_in_weekend_nights` + `stays_in_week_nights`)
         ) %>%
  filter(`Total Stay` > 0)

# look at the central tendencies of stays 

summarize(stays_expanded_by_hotel, "Avg Stay"=mean(`Total Stay`), "Med Stay"=median(`Total Stay`), "Avg % Wknd"=mean(`% Weekend`), "Med % Wknd"=median(`% Weekend`), "Avg % Wknt"=mean(`% Weeknight`), "Med % Wknt"=median(`% Weeknight`))
# A tibble: 2 × 7
  hotel        `Avg Stay` `Med Stay` `Avg % Wknd` `Med % Wknd` Avg % W…¹ Med %…²
  <chr>             <dbl>      <dbl>        <dbl>        <dbl>     <dbl>   <dbl>
1 City Hotel         2.94          3        0.262        0.25      0.738   0.75 
2 Resort Hotel       4.20          3        0.259        0.286     0.741   0.714
# … with abbreviated variable names ¹​`Avg % Wknt`, ²​`Med % Wknt`

First, I wondered a bit about differences between who stayed at the City Hotel and who stayed at the Resort Hotel. I filtered out canceld bookings since they didn’t end up staying. At both hotels, most groups are two adults with no children or babies. Slightly more children and babies stay at the Resort Hotel over the City Hotel. It looks like sometimes bookings include no people, but at this time I don’t think I have enough information about this data to understand why.

I also looked at how long customers tended to stay at either hotel and whether more of their stay tends to be weekends or weeknights, but the central tendencies were pretty close for both!

Code
#| label: explore 2
# group by cancelation, removing customers who are not repeat customers

by_canceled <- bookings %>% filter(`is_repeated_guest` > 0) %>% group_by(`is_canceled`)

# See if the statistics show if people who have canceled before are more likely to have canceled their current reservation

summarize(by_canceled, "Avg Previous Cancelations"=mean(`previous_cancellations`), "Avg Repeat Booking Not Canceled"=mean(`previous_bookings_not_canceled`))
# A tibble: 2 × 3
  is_canceled `Avg Previous Cancelations` `Avg Repeat Booking Not Canceled`
        <dbl>                       <dbl>                             <dbl>
1           0                       0.335                              3.96
2           1                       1.27                               1.38

One other thing I noticed they were tracking is whether or not the customer had previous cancelations so I wanted to see if whether or not those who had canceled their reservation were more likely to have canceled previous bookings. The customers who had canceled their booking had a higher average number of previous cancelations than those who kept their reservation. I also looked at the average repeated booking not canceled to see if people who had canceled a lot of bookings were just making more bookings, but that doesn’t seem to be the case.

Source Code
---
title: "Challenge 2 Solution"
author: "Ryan O'Donnell"
desription: "Data wrangling: using group() and summarise()"
date: "09/20/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
  - ryan_odonnell
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)


knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.

-   railroad\*.csv or StateCounty2012.xls ⭐
-   FAOstat\*.csv or birds.csv ⭐⭐⭐
-   hotel_bookings.csv ⭐⭐⭐⭐

```{r}
#| label: load_in
# read in data and set up initial objects

bookings <- read_csv("_data/hotel_bookings.csv")

```

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

## Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

```{r}
#| label: summary
str(bookings)

```

"hotel_bookings.csv" contains 119,390 observations of 32 variables. The data appears to be tracking the details of reservations at some hotels. There is a lot of information recorded for each reservation. The arrival date is stored across 3 columns, 4 if you count the week number. Some of the columns have codes in them but there is not a key to fully understand these codes.

## Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

```{r}
#| label: explore 1
# filter out cancelation and create object that is grouped by hotel

by_hotel_no_cancel <- bookings %>% filter(`is_canceled` == 0) %>% group_by(`hotel`)

# summarize adults, children, and babies for each hotel
summarize(by_hotel_no_cancel, "Avg Adults"=mean(adults), "Med Adults"=median(adults), "Avg Children"=mean(children), "Med Children"=median(children), "Avg Babies"=mean(babies), "Med Babies"=median(babies))

summarize(by_hotel_no_cancel, Adults=quantile(adults), Children=quantile(children), Babies=quantile(babies))

# create columns that includes the total stay length, % of stay that is weekend, and % of stay that is weeknight, remove stays that are 0 days

stays_expanded_by_hotel <- by_hotel_no_cancel %>% select(hotel, stays_in_weekend_nights, stays_in_week_nights) %>%
  mutate("Total Stay" = `stays_in_weekend_nights` + `stays_in_week_nights`,
         "% Weekend" = `stays_in_weekend_nights` / (`stays_in_weekend_nights` + `stays_in_week_nights`),
         "% Weeknight" = `stays_in_week_nights` / (`stays_in_weekend_nights` + `stays_in_week_nights`)
         ) %>%
  filter(`Total Stay` > 0)

# look at the central tendencies of stays 

summarize(stays_expanded_by_hotel, "Avg Stay"=mean(`Total Stay`), "Med Stay"=median(`Total Stay`), "Avg % Wknd"=mean(`% Weekend`), "Med % Wknd"=median(`% Weekend`), "Avg % Wknt"=mean(`% Weeknight`), "Med % Wknt"=median(`% Weeknight`))

```

First, I wondered a bit about differences between who stayed at the City Hotel and who stayed at the Resort Hotel. I filtered out canceld bookings since they didn't end up staying. At both hotels, most groups are two adults with no children or babies. Slightly more children and babies stay at the Resort Hotel over the City Hotel. It looks like sometimes bookings include no people, but at this time I don't think I have enough information about this data to understand why.

I also looked at how long customers tended to stay at either hotel and whether more of their stay tends to be weekends or weeknights, but the central tendencies were pretty close for both!

```{r}
#| label: explore 2
# group by cancelation, removing customers who are not repeat customers

by_canceled <- bookings %>% filter(`is_repeated_guest` > 0) %>% group_by(`is_canceled`)

# See if the statistics show if people who have canceled before are more likely to have canceled their current reservation

summarize(by_canceled, "Avg Previous Cancelations"=mean(`previous_cancellations`), "Avg Repeat Booking Not Canceled"=mean(`previous_bookings_not_canceled`))

```
One other thing I noticed they were tracking is whether or not the customer had previous cancelations so I wanted to see if whether or not those who had canceled their reservation were more likely to have canceled previous bookings. The customers who had canceled their booking had a higher average number of previous cancelations than those who kept their reservation. I also looked at the average repeated booking not canceled to see if people who had canceled a lot of bookings were just making more bookings, but that doesn't seem to be the case.