Challenge 4 Solution

challenge_4
abc_poll
eggs
fed_rates
hotel_bookings
debt
More data wrangling: pivoting
Author

Shreya Varma

Published

May 30, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
Code
hotel_bookings <- read_csv("_data/hotel_bookings.csv")
hotel_bookings
# A tibble: 119,390 × 32
   hotel        is_canceled lead_time arrival_date_year arrival_date_month
   <chr>              <dbl>     <dbl>             <dbl> <chr>             
 1 Resort Hotel           0       342              2015 July              
 2 Resort Hotel           0       737              2015 July              
 3 Resort Hotel           0         7              2015 July              
 4 Resort Hotel           0        13              2015 July              
 5 Resort Hotel           0        14              2015 July              
 6 Resort Hotel           0        14              2015 July              
 7 Resort Hotel           0         0              2015 July              
 8 Resort Hotel           0         9              2015 July              
 9 Resort Hotel           1        85              2015 July              
10 Resort Hotel           1        75              2015 July              
# ℹ 119,380 more rows
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …
Code
glimpse(hotel_bookings)
Rows: 119,390
Columns: 32
$ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
$ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
$ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
$ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
$ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
$ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
$ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
$ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
$ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
$ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
$ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
$ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
$ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
$ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
$ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
$ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ customer_type                  <chr> "Transient", "Transient", "Transient", …
$ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
$ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
$ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
$ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…

Briefly describe the data

I will be using the Hotel Bookings dataset for my Challenge I have imported it using the read_csv() function and will use the glimpse() function the see the columns it has. On a high level it seems to have the information of hotel type and its customer data like arrival departure information, number of people, their booking details, payment type and reservation details. The data has 119,390 rows and 32 columns and seems to be captured from different hotels between 2015 to 2015 when people checked in and out.

Tidy Data (as needed)

We can see that there are separate columns for arrival day of month, month and year. I will combine these to get a single arrival date colummn which is easier to read. For this purpose I will change the month name to number and then mutate the three columns. I will also mutate the meal column to its full forms to make it more readable.

Code
hotel_bookings <- hotel_bookings %>%
                  mutate(arrival_date_month = case_when(
                    arrival_date_month == "January" ~ 1,
                    arrival_date_month == "Febuary" ~ 2,
                    arrival_date_month == "March" ~ 3,
                    arrival_date_month == "April" ~ 4,
                    arrival_date_month == "May" ~ 5,
                    arrival_date_month == "June" ~ 6,
                    arrival_date_month == "July" ~ 7,
                    arrival_date_month == "August" ~ 8,
                    arrival_date_month == "September" ~ 9,
                    arrival_date_month == "October" ~ 10,
                    arrival_date_month == "November" ~ 11,
                    arrival_date_month == "December" ~ 12
                  ))

hotel_bookings <- mutate(hotel_bookings, meal = case_when(
                    meal == "BB" ~ "Bed and Breakfast",
                    meal == "HB" ~ "Half Board",
                    meal == "FB" ~ "Full Board",
                    meal == "SC" ~ "Self Catering",
                  ))


head(hotel_bookings)
# A tibble: 6 × 32
  hotel        is_canceled lead_time arrival_date_year arrival_date_month
  <chr>              <dbl>     <dbl>             <dbl>              <dbl>
1 Resort Hotel           0       342              2015                  7
2 Resort Hotel           0       737              2015                  7
3 Resort Hotel           0         7              2015                  7
4 Resort Hotel           0        13              2015                  7
5 Resort Hotel           0        14              2015                  7
6 Resort Hotel           0        14              2015                  7
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, …

Any additional comments?

Identify variables that need to be mutated

Thus, now we can combine three columns of date, month and year to form a date. We also need to remove the three columns as they are now redundant.

Code
final_hotel_bookings <- hotel_bookings %>%
                mutate(
                  arrival_date = make_date(arrival_date_year, arrival_date_month, arrival_date_day_of_month)
                ) 

final_hotel_bookings <- select(final_hotel_bookings,-c(arrival_date_day_of_month, arrival_date_month, arrival_date_year))

head(final_hotel_bookings)
# A tibble: 6 × 30
  hotel      is_canceled lead_time arrival_date_week_nu…¹ stays_in_weekend_nig…²
  <chr>            <dbl>     <dbl>                  <dbl>                  <dbl>
1 Resort Ho…           0       342                     27                      0
2 Resort Ho…           0       737                     27                      0
3 Resort Ho…           0         7                     27                      0
4 Resort Ho…           0        13                     27                      0
5 Resort Ho…           0        14                     27                      0
6 Resort Ho…           0        14                     27                      0
# ℹ abbreviated names: ¹​arrival_date_week_number, ²​stays_in_weekend_nights
# ℹ 25 more variables: stays_in_week_nights <dbl>, adults <dbl>,
#   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
#   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, …