challenge4_danielmanning

title: “Challenge 4” author: “Daniel Manning” description: “More data wrangling: mutate” date: “1/7/2022” format: html: toc: true code-fold: true code-copy: true code-tools: true categories: - challenge_4

library(tidyverse)
install.packages('here')
library(here)
library(lubridate)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
identify variables that need to be mutated
mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

abc_poll.csv ⭐
poultry_tidy.csv⭐⭐
FedFundsRate.csv⭐⭐⭐
hotel_bookings.csv⭐⭐⭐⭐
debt_in_trillions ⭐⭐⭐⭐⭐

hotel_bookings <- here("posts","_data","hotel_bookings.csv")%>%
  read_csv()

Briefly describe the data

The “hotel_bookings.csv” dataset consists of 119,390 rows, which correspond to hotel bookings from two hotels between July 2015 and August 2017. Some of the variables include which hotel was booked (“Resort Hotel” and “City Hotel”), whether the reservation was canceled, the length of the stay (broken into weekend and week nights), who stayed (adults, children, and babies), and some history on previous bookings.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

There is definitely some tidying to be done, since the arrival date is broken into several variables (“arrival_date_year”, “arrival_date_month”, “arrival_date_week_number”, and “arrival_date_day_of_month”). Other than that, the dataset is relatively tidy, since each of the columns represents a variable and each row represents a reservation.

hotel_bookings

# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

nrow(hotel_bookings)

[1] 119390

ncol(hotel_bookings)

[1] 32

# Expected  Number of Columns (after consolidating date of arrival into one variable)
ncol(hotel_bookings)-3

[1] 29

Any additional comments?

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

I mutated the multiple variables that code arrival date into one variable in the date-month-year format called “date_arrival”. I also created a new variable “total_stay” by summing the data from “stays_in_weekend_nights” and “stays_in_week_nights”. Lastly, I created a new variable “total_people” by summing the data from “adults”, “children”, and “babies”.

bookings <- hotel_bookings %>%
  mutate(date_arrival=str_c(arrival_date_day_of_month, 
                            arrival_date_month, 
                            arrival_date_year, sep="/"),
         date_arrival=dmy(date_arrival)) %>%
  select(-starts_with("arrival"))

bookings <- bookings %>%
  mutate(total_stay = stays_in_weekend_nights + stays_in_week_nights)
bookings <- bookings %>%
  mutate(total_people = adults + children + babies)
bookings

# A tibble: 119,390 × 31
   hotel     is_ca…¹ lead_…² stays…³ stays…⁴ adults child…⁵ babies meal  country
   <chr>       <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr> <chr>  
 1 Resort H…       0     342       0       0      2       0      0 BB    PRT    
 2 Resort H…       0     737       0       0      2       0      0 BB    PRT    
 3 Resort H…       0       7       0       1      1       0      0 BB    GBR    
 4 Resort H…       0      13       0       1      1       0      0 BB    GBR    
 5 Resort H…       0      14       0       2      2       0      0 BB    GBR    
 6 Resort H…       0      14       0       2      2       0      0 BB    GBR    
 7 Resort H…       0       0       0       2      2       0      0 BB    PRT    
 8 Resort H…       0       9       0       2      2       0      0 FB    PRT    
 9 Resort H…       1      85       0       3      2       0      0 BB    PRT    
10 Resort H…       1      75       0       3      2       0      0 HB    PRT    
# … with 119,380 more rows, 21 more variables: market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …

Any additional comments?