Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Shreya Varma
May 30, 2023
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
# A tibble: 119,390 × 32
hotel is_canceled lead_time arrival_date_year arrival_date_month
<chr> <dbl> <dbl> <dbl> <chr>
1 Resort Hotel 0 342 2015 July
2 Resort Hotel 0 737 2015 July
3 Resort Hotel 0 7 2015 July
4 Resort Hotel 0 13 2015 July
5 Resort Hotel 0 14 2015 July
6 Resort Hotel 0 14 2015 July
7 Resort Hotel 0 0 2015 July
8 Resort Hotel 0 9 2015 July
9 Resort Hotel 1 85 2015 July
10 Resort Hotel 1 75 2015 July
# ℹ 119,380 more rows
# ℹ 27 more variables: arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
# stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …
Rows: 119,390
Columns: 32
$ hotel <chr> "Resort Hotel", "Resort Hotel", "Resort…
$ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
$ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
$ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ arrival_date_month <chr> "July", "July", "July", "July", "July",…
$ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
$ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
$ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
$ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
$ market_segment <chr> "Direct", "Direct", "Direct", "Corporat…
$ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporat…
$ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
$ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
$ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit…
$ agent <chr> "NULL", "NULL", "NULL", "304", "240", "…
$ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
$ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ customer_type <chr> "Transient", "Transient", "Transient", …
$ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
$ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
$ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", …
$ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…
I will be using the Hotel Bookings dataset for my Challenge I have imported it using the read_csv() function and will use the glimpse() function the see the columns it has. On a high level it seems to have the information of hotel type and its customer data like arrival departure information, number of people, their booking details, payment type and reservation details. The data has 119,390 rows and 32 columns and seems to be captured from different hotels between 2015 to 2015 when people checked in and out.
We can see that there are separate columns for arrival day of month, month and year. I will combine these to get a single arrival date colummn which is easier to read. For this purpose I will change the month name to number and then mutate the three columns. I will also mutate the meal column to its full forms to make it more readable.
hotel_bookings <- hotel_bookings %>%
mutate(arrival_date_month = case_when(
arrival_date_month == "January" ~ 1,
arrival_date_month == "Febuary" ~ 2,
arrival_date_month == "March" ~ 3,
arrival_date_month == "April" ~ 4,
arrival_date_month == "May" ~ 5,
arrival_date_month == "June" ~ 6,
arrival_date_month == "July" ~ 7,
arrival_date_month == "August" ~ 8,
arrival_date_month == "September" ~ 9,
arrival_date_month == "October" ~ 10,
arrival_date_month == "November" ~ 11,
arrival_date_month == "December" ~ 12
))
hotel_bookings <- mutate(hotel_bookings, meal = case_when(
meal == "BB" ~ "Bed and Breakfast",
meal == "HB" ~ "Half Board",
meal == "FB" ~ "Full Board",
meal == "SC" ~ "Self Catering",
))
head(hotel_bookings)
# A tibble: 6 × 32
hotel is_canceled lead_time arrival_date_year arrival_date_month
<chr> <dbl> <dbl> <dbl> <dbl>
1 Resort Hotel 0 342 2015 7
2 Resort Hotel 0 737 2015 7
3 Resort Hotel 0 7 2015 7
4 Resort Hotel 0 13 2015 7
5 Resort Hotel 0 14 2015 7
6 Resort Hotel 0 14 2015 7
# ℹ 27 more variables: arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
# stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, …
Any additional comments?
Thus, now we can combine three columns of date, month and year to form a date. We also need to remove the three columns as they are now redundant.
# A tibble: 6 × 30
hotel is_canceled lead_time arrival_date_week_nu…¹ stays_in_weekend_nig…²
<chr> <dbl> <dbl> <dbl> <dbl>
1 Resort Ho… 0 342 27 0
2 Resort Ho… 0 737 27 0
3 Resort Ho… 0 7 27 0
4 Resort Ho… 0 13 27 0
5 Resort Ho… 0 14 27 0
6 Resort Ho… 0 14 27 0
# ℹ abbreviated names: ¹arrival_date_week_number, ²stays_in_weekend_nights
# ℹ 25 more variables: stays_in_week_nights <dbl>, adults <dbl>,
# children <dbl>, babies <dbl>, meal <chr>, country <chr>,
# market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, …
---
title: "Challenge 4 Solution"
author: "Shreya Varma"
description: "More data wrangling: pivoting"
date: "5/30/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- abc_poll
- eggs
- fed_rates
- hotel_bookings
- debt
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- abc_poll.csv ⭐
- poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
- FedFundsRate.csv⭐⭐⭐
- hotel_bookings.csv⭐⭐⭐⭐
- debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
```{r}
hotel_bookings <- read_csv("_data/hotel_bookings.csv")
hotel_bookings
glimpse(hotel_bookings)
```
### Briefly describe the data
I will be using the Hotel Bookings dataset for my Challenge I have imported it using the read_csv() function and will use the glimpse() function the see the columns it has. On a high level it seems to have the information of hotel type and its customer data like arrival departure information, number of people, their booking details, payment type and reservation details. The data has 119,390 rows and 32 columns and seems to be captured from different hotels between 2015 to 2015 when people checked in and out.
## Tidy Data (as needed)
We can see that there are separate columns for arrival day of month, month and year. I will combine these to get a single arrival date colummn which is easier to read. For this purpose I will change the month name to number and then mutate the three columns. I will also mutate the meal column to its full forms to make it more readable.
```{r}
hotel_bookings <- hotel_bookings %>%
mutate(arrival_date_month = case_when(
arrival_date_month == "January" ~ 1,
arrival_date_month == "Febuary" ~ 2,
arrival_date_month == "March" ~ 3,
arrival_date_month == "April" ~ 4,
arrival_date_month == "May" ~ 5,
arrival_date_month == "June" ~ 6,
arrival_date_month == "July" ~ 7,
arrival_date_month == "August" ~ 8,
arrival_date_month == "September" ~ 9,
arrival_date_month == "October" ~ 10,
arrival_date_month == "November" ~ 11,
arrival_date_month == "December" ~ 12
))
hotel_bookings <- mutate(hotel_bookings, meal = case_when(
meal == "BB" ~ "Bed and Breakfast",
meal == "HB" ~ "Half Board",
meal == "FB" ~ "Full Board",
meal == "SC" ~ "Self Catering",
))
head(hotel_bookings)
```
Any additional comments?
## Identify variables that need to be mutated
Thus, now we can combine three columns of date, month and year to form a date. We also need to remove the three columns as they are now redundant.
```{r}
final_hotel_bookings <- hotel_bookings %>%
mutate(
arrival_date = make_date(arrival_date_year, arrival_date_month, arrival_date_day_of_month)
)
final_hotel_bookings <- select(final_hotel_bookings,-c(arrival_date_day_of_month, arrival_date_month, arrival_date_year))
head(final_hotel_bookings)
```