Code
library(tidyverse)
library(lubridate)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Henry Mitrano
January 20, 2023
Today’s challenge is to:
# A tibble: 6 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resort… 0 342 2015 July 27 1 0 0 2
2 Resort… 0 737 2015 July 27 1 0 0 2
3 Resort… 0 7 2015 July 27 1 0 1 1
4 Resort… 0 13 2015 July 27 1 0 1 1
5 Resort… 0 14 2015 July 27 1 0 2 2
6 Resort… 0 14 2015 July 27 1 0 2 2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
# assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
# agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
It appears this is some kind of data regarding hotels in different countries, and statistics about their bookings. These statistics include number of people in a reservation, whether they’re kids or adults, how their room was booked, whether they’ve been there before, etc. The data isn’t organized horribly, but some stuff definitely stands out as needing tidying.
Any additional comments?
For starters, one of the more confusing things I found when looking through the data was that the arrival dates of the visits are kept in 3 separate columns (arrival date year, arrival date month, and arrival date day), while the checkout or “reservation status” date is stored in 1 column, normally. I’ll want to replicate that mm-dd-yyyy format in one column for arrivals. We can doing that by importing an R library at the top of the file, “lubridate”, that includes a bunch of date formatting tools.
hotel is_canceled lead_time stays_in_weekend_nights
1 Resort Hotel 0 342 0
2 Resort Hotel 0 737 0
3 Resort Hotel 0 7 0
4 Resort Hotel 0 13 0
5 Resort Hotel 0 14 0
6 Resort Hotel 0 14 0
stays_in_week_nights adults children babies meal country market_segment
1 0 2 0 0 BB PRT Direct
2 0 2 0 0 BB PRT Direct
3 1 1 0 0 BB GBR Direct
4 1 1 0 0 BB GBR Corporate
5 2 2 0 0 BB GBR Online TA
6 2 2 0 0 BB GBR Online TA
distribution_channel is_repeated_guest previous_cancellations
1 Direct 0 0
2 Direct 0 0
3 Direct 0 0
4 Corporate 0 0
5 TA/TO 0 0
6 TA/TO 0 0
previous_bookings_not_canceled reserved_room_type assigned_room_type
1 0 C C
2 0 C C
3 0 A C
4 0 A A
5 0 A A
6 0 A A
booking_changes deposit_type agent company days_in_waiting_list customer_type
1 3 No Deposit NULL NULL 0 Transient
2 4 No Deposit NULL NULL 0 Transient
3 0 No Deposit NULL NULL 0 Transient
4 0 No Deposit 304 NULL 0 Transient
5 0 No Deposit 240 NULL 0 Transient
6 0 No Deposit 240 NULL 0 Transient
adr required_car_parking_spaces total_of_special_requests reservation_status
1 0 0 0 Check-Out
2 0 0 0 Check-Out
3 75 0 0 Check-Out
4 75 0 0 Check-Out
5 98 0 1 Check-Out
6 98 0 1 Check-Out
reservation_status_date arrival valued_customer
1 2015-07-01 2015-07-01 no
2 2015-07-01 2015-07-01 no
3 2015-07-02 2015-07-01 no
4 2015-07-02 2015-07-01 no
5 2015-07-03 2015-07-01 no
6 2015-07-03 2015-07-01 no
Additionally, I’ve added a column at the end that distingushes certain booking as important guests, based on historical data. If a customer has been booked at the hotel before and is a repeat visitor, and has never canceled a booking, they are listed as “yes” under the “valued_customer” column at the end of the data.
---
title: "Challenge 4"
author: "Henry Mitrano"
description: "More data wrangling: mutate"
date: "1/20/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
```{r}
hotel_bookings = read_csv("_data/hotel_bookings.csv")
head(hotel_bookings)
```
### Briefly describe the data
It appears this is some kind of data regarding hotels in different countries, and statistics about their bookings. These statistics include number of people in a reservation, whether they're kids or adults, how their room was booked, whether they've been there before, etc. The data isn't organized horribly, but some stuff definitely stands out as needing tidying.
```{r}
tidy_bookings <-
hotel_bookings %>%
mutate(arrival=str_c(arrival_date_day_of_month, arrival_date_month,arrival_date_year, sep="/"), arrival=dmy(arrival))
tidy_bookings = select(tidy_bookings, select = -starts_with("arrival_date") )
```
Any additional comments?
## Identify variables that need to be mutated
For starters, one of the more confusing things I found when looking through the data was that the arrival dates of the visits are kept in 3 separate columns (arrival date year, arrival date month, and arrival date day), while the checkout or "reservation status" date is stored in 1 column, normally. I'll want to replicate that mm-dd-yyyy format in one column for arrivals. We can doing that by importing an R library at the top of the file, "lubridate", that includes a bunch of date formatting tools.
```{r}
tidy_bookings<-transform(tidy_bookings,valued_customer=ifelse(is_repeated_guest >0 && previous_cancellations == 0, "yes", "no"))
head(tidy_bookings)
```
Additionally, I've added a column at the end that distingushes certain booking as important guests, based on historical data. If a customer has been booked at the hotel before and is a repeat visitor, and has never canceled a booking, they are listed as "yes" under the "valued_customer" column at the end of the data.