Challenge 4 Paritosh

challenge_4
abc_poll
eggs
fed_rates
hotel_bookings
debt
More data wrangling: pivoting
Author

Paritosh G

Published

May 27, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
Code
htb <- read_csv("_data/hotel_bookings.csv")

Briefly describe the data

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Code
htb %>% 
  select(starts_with("arrival"))
# A tibble: 119,390 × 4
   arrival_date_year arrival_date_month arrival_date_week_number arrival_date_…¹
               <dbl> <chr>                                 <dbl>           <dbl>
 1              2015 July                                     27               1
 2              2015 July                                     27               1
 3              2015 July                                     27               1
 4              2015 July                                     27               1
 5              2015 July                                     27               1
 6              2015 July                                     27               1
 7              2015 July                                     27               1
 8              2015 July                                     27               1
 9              2015 July                                     27               1
10              2015 July                                     27               1
# … with 119,380 more rows, and abbreviated variable name
#   ¹​arrival_date_day_of_month

Any additional comments?

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

Code
htb_1 <- htb %>%
  mutate(date_arrival = str_c(arrival_date_day_of_month,
                              arrival_date_month,
                              arrival_date_year, sep="/"),
         date_arrival = dmy(date_arrival))%>%
  select(-starts_with("arrival"))

summary(htb_1$date_arrival)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2015-07-01" "2016-03-13" "2016-09-06" "2016-08-28" "2017-03-18" "2017-08-31" 
Code
htb_2 <- htb_1 %>%
  mutate(date_booking = date_arrival-days(lead_time))

summary(htb_2$date_booking)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2013-06-24" "2015-11-28" "2016-05-04" "2016-05-16" "2016-12-09" "2017-08-31" 
Code
summary(htb_2$reservation_status_date)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2014-10-17" "2016-02-01" "2016-08-07" "2016-07-30" "2017-02-08" "2017-09-14" 
Code
htb_3 <- htb_1 %>%
  mutate(change_days = interval(reservation_status_date,
                                date_arrival),
         change_days = change_days %/% days(1))

summary(htb_3$change_days)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -69.00   -3.00   -1.00   29.68   26.00  526.00