Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Miranda Manka
August 16, 2022
Today’s challenge is to
[1] 119390 32
hotel is_canceled lead_time arrival_date_year
Length:119390 Min. :0.0000 Min. : 0 Min. :2015
Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
Mode :character Median :0.0000 Median : 69 Median :2016
Mean :0.3704 Mean :104 Mean :2016
3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
Max. :1.0000 Max. :737 Max. :2017
arrival_date_month arrival_date_week_number arrival_date_day_of_month
Length:119390 Min. : 1.00 Min. : 1.0
Class :character 1st Qu.:16.00 1st Qu.: 8.0
Mode :character Median :28.00 Median :16.0
Mean :27.17 Mean :15.8
3rd Qu.:38.00 3rd Qu.:23.0
Max. :53.00 Max. :31.0
stays_in_weekend_nights stays_in_week_nights adults
Min. : 0.0000 Min. : 0.0 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
Median : 1.0000 Median : 2.0 Median : 2.000
Mean : 0.9276 Mean : 2.5 Mean : 1.856
3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
Max. :19.0000 Max. :50.0 Max. :55.000
children babies meal country
Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
Median : 0.0000 Median : 0.000000 Mode :character Mode :character
Mean : 0.1039 Mean : 0.007949
3rd Qu.: 0.0000 3rd Qu.: 0.000000
Max. :10.0000 Max. :10.000000
NA's :4
market_segment distribution_channel is_repeated_guest
Length:119390 Length:119390 Min. :0.00000
Class :character Class :character 1st Qu.:0.00000
Mode :character Mode :character Median :0.00000
Mean :0.03191
3rd Qu.:0.00000
Max. :1.00000
previous_cancellations previous_bookings_not_canceled reserved_room_type
Min. : 0.00000 Min. : 0.0000 Length:119390
1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
Median : 0.00000 Median : 0.0000 Mode :character
Mean : 0.08712 Mean : 0.1371
3rd Qu.: 0.00000 3rd Qu.: 0.0000
Max. :26.00000 Max. :72.0000
assigned_room_type booking_changes deposit_type agent
Length:119390 Min. : 0.0000 Length:119390 Length:119390
Class :character 1st Qu.: 0.0000 Class :character Class :character
Mode :character Median : 0.0000 Mode :character Mode :character
Mean : 0.2211
3rd Qu.: 0.0000
Max. :21.0000
company days_in_waiting_list customer_type adr
Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
Mode :character Median : 0.000 Mode :character Median : 94.58
Mean : 2.321 Mean : 101.83
3rd Qu.: 0.000 3rd Qu.: 126.00
Max. :391.000 Max. :5400.00
required_car_parking_spaces total_of_special_requests reservation_status
Min. :0.00000 Min. :0.0000 Length:119390
1st Qu.:0.00000 1st Qu.:0.0000 Class :character
Median :0.00000 Median :0.0000 Mode :character
Mean :0.06252 Mean :0.5714
3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :8.00000 Max. :5.0000
reservation_status_date
Min. :2014-10-17
1st Qu.:2016-02-01
Median :2016-08-07
Mean :2016-07-30
3rd Qu.:2017-02-08
Max. :2017-09-14
This dataset has 32 variables with 119,390 observations. The variables include information about hotel bookings, while each observation/case is a different hotel booking. Some variables include hotel type (city hotel vs resort hotel), if the booking was canceled, arrival date, number of nights stayed (week and weekend), number of people and kids and babies, the market segment, if the guest is a repeat guest, and room type (there are more, this just points out a few). Some of the variables have categories (city vs resort hotel, for the type of hotel), some are numeric and continuous (lead time, in days for example 14) and some are numerical but binary (is canceled, 0 or 1). This data likely came from a hotel chain with different locations and/or multiple hotels, as the country variable shows that these are hotels in different countries.
# A tibble: 2 × 3
hotel mean sd
<chr> <dbl> <dbl>
1 City Hotel 2.18 1.46
2 Resort Hotel 3.13 2.46
# A tibble: 2 × 3
hotel mean sd
<chr> <dbl> <dbl>
1 City Hotel 0.795 0.885
2 Resort Hotel 1.19 1.15
# A tibble: 2 × 3
is_repeated_guest mean sd
<dbl> <dbl> <dbl>
1 0 2.53 1.91
2 1 1.48 1.62
I started by picking out a few interesting variables and looking at them. First, I grouped by hotel and looked at number of night stayed during the week to see if there was any difference. The resort hotels had a higher mean (3.1 compared to 2.2 for the city hotels) which was interesting, I thought maybe people staying at resorts plan an extra day more of their trip during the week. I also looked the same hotel grouping for weekend nights and the mean for resort hotels was still higher (1.2 vs 0.8 for city), so maybe people staying at resort hotels simply stay longer. This could be explored more in the future. I also looked grouped by whether someone is a repeated guest (0 for no, 1 for yes), then examined the mean for how many week nights they stayed. Repeat guests tend to stay for shorter amount of nights (1.48 vs 2.53 for non repeat guests). I thought this was interesting because people who tend to stay again aren’t staying as long (maybe more business people for a night rather than a family vacation).
# A tibble: 2 × 8
hotel mean median min max sd var IQR
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City Hotel 110. 74 0 629 111. 12310. 140
2 Resort Hotel 92.7 57 0 737 97.3 9464. 145
I wanted to look at the lead time for each type of hotel. Lead time is how many days ahead of their stay someone booked, for example 7 would mean they booked their hotel a week bfore they showed up. The mean lead time for city hotels is 109.7 days (about 3.5 months ahead of time), while the mean lead time for the resort hotels is 92.7 days (about 3 months ahead of time). That is interesting but is only different by a few weeks. The medians for both groups were much lower than the mean, which indicates the data are skewed (positively, or towards the right), meaning more of the lead times were lower values. The maximums were still high though, with 629 days for city hotels and 737 days for resort hotels, although they both had minimums of 0 (same day or walk-in). The standard deviation and other measures of dispersion were fairly large, indicating the data are spread out (looking at the maximums and minimums, this makes sense).
different_room
0 1
0.8750565 0.1249435
Finally, I thought it would be interesting to look at how many people got the room they booked. I made a binary indicator variable to do this. If someone got a different room they were assigned a 1 for different_room, otherwise a 0 indicating they got the room they booked. The proportion table shows that 87.5% of people got the room they wanted, and 12.5% of people did not.
---
title: "Challenge 2"
author: "Miranda Manka"
desription: "Data wrangling: using group() and summarise()"
date: "08/16/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
```{r}
hotel_bookings = read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)
```
## Describe the data
```{r}
#| label: summary
#Looking at the data
view(hotel_bookings)
#Dimensions of the data
dim(hotel_bookings)
#Summary of the variables in the dataset
summary(hotel_bookings)
```
This dataset has 32 variables with 119,390 observations. The variables include information about hotel bookings, while each observation/case is a different hotel booking. Some variables include hotel type (city hotel vs resort hotel), if the booking was canceled, arrival date, number of nights stayed (week and weekend), number of people and kids and babies, the market segment, if the guest is a repeat guest, and room type (there are more, this just points out a few). Some of the variables have categories (city vs resort hotel, for the type of hotel), some are numeric and continuous (lead time, in days for example 14) and some are numerical but binary (is canceled, 0 or 1). This data likely came from a hotel chain with different locations and/or multiple hotels, as the country variable shows that these are hotels in different countries.
## Provide Grouped Summary Statistics & Explain and Interpret
```{r}
#Find mean and sd for number of stays in week nights grouped by hotel type
hotel_bookings %>%
group_by(hotel) %>%
summarise(mean = mean(stays_in_week_nights), sd = sd(stays_in_week_nights))
#Find mean and sd for number of stays in weekend nights grouped by hotel type
hotel_bookings %>%
group_by(hotel) %>%
summarise(mean = mean(stays_in_weekend_nights), sd = sd(stays_in_weekend_nights))
#Find mean and sd for number of stays in week nights grouped by whether the guest is a repeat guest
hotel_bookings %>%
group_by(is_repeated_guest) %>%
summarise(mean = mean(stays_in_week_nights), sd = sd(stays_in_week_nights))
```
I started by picking out a few interesting variables and looking at them. First, I grouped by hotel and looked at number of night stayed during the week to see if there was any difference. The resort hotels had a higher mean (3.1 compared to 2.2 for the city hotels) which was interesting, I thought maybe people staying at resorts plan an extra day more of their trip during the week. I also looked the same hotel grouping for weekend nights and the mean for resort hotels was still higher (1.2 vs 0.8 for city), so maybe people staying at resort hotels simply stay longer. This could be explored more in the future.
I also looked grouped by whether someone is a repeated guest (0 for no, 1 for yes), then examined the mean for how many week nights they stayed. Repeat guests tend to stay for shorter amount of nights (1.48 vs 2.53 for non repeat guests). I thought this was interesting because people who tend to stay again aren't staying as long (maybe more business people for a night rather than a family vacation).
```{r}
#Find summary statistics for lead time for booking grouped by hotel type
hotel_bookings %>%
group_by(hotel) %>%
select(lead_time, hotel) %>%
summarize_all(list(mean=mean, median = median, min = min, max = max, sd = sd, var = var, IQR = IQR), na.rm = TRUE)
```
I wanted to look at the lead time for each type of hotel. Lead time is how many days ahead of their stay someone booked, for example 7 would mean they booked their hotel a week bfore they showed up. The mean lead time for city hotels is 109.7 days (about 3.5 months ahead of time), while the mean lead time for the resort hotels is 92.7 days (about 3 months ahead of time). That is interesting but is only different by a few weeks. The medians for both groups were much lower than the mean, which indicates the data are skewed (positively, or towards the right), meaning more of the lead times were lower values. The maximums were still high though, with 629 days for city hotels and 737 days for resort hotels, although they both had minimums of 0 (same day or walk-in). The standard deviation and other measures of dispersion were fairly large, indicating the data are spread out (looking at the maximums and minimums, this makes sense).
```{r}
#Creating variable to see if people got the room they booked
different_room = ifelse(hotel_bookings$reserved_room_type != hotel_bookings$assigned_room_type, 1, 0)
#Looking at the results
prop.table(table(different_room))
```
Finally, I thought it would be interesting to look at how many people got the room they booked. I made a binary indicator variable to do this. If someone got a different room they were assigned a 1 for different_room, otherwise a 0 indicating they got the room they booked. The proportion table shows that 87.5% of people got the room they wanted, and 12.5% of people did not.