challenge_4
More data wrangling: mutate
Author

Rishita Golla

Published

January 2, 2022

Code
 #| label: setup
 #| warning: false
 #| message: false

 library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
 library(lubridate)

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union
Code
 knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

Code
 library(readr)
 data <- read_csv("_data/hotel_bookings.csv")
 view(data)

Briefly describe the data

Describing data based on challenge 2’s observations- Using the view() command, we see that the there are 119,390 rows for hotel entries and 32 columns for information such as reserved room type, deposit type, reservation status, adults (number), country etc. After analyzing the summary() and above commands in R, we can conclude that the data has been collected from 178 countries over a span of 4 years (2014 - 17), going by the reservation dates. We can further conclude that most of the data collected is from 2016 - 17 as the Q1 of reservation status date lies in 2016. It is also clear that there are two kind of hotels (Resort and City Hotel), and 8 types of market segments. From the summary we can also observe that most bookings have adults followed by children and then babies. Further, on an average stay is booked for 2.5 days during week days and around for a day during weekends.

Tidy Data

Looking at the data wrt arrival we have four fields - arrival_date_day_of_month, arrival_date_month, arrival_date_year, arrival_date_week_number. Using mutate the first three fields can be combined into a single arrival date field and the fourth field - arrival_date_week_number can be dropped as this is not really useful for our analysis.

Code
 data <- data[,-6]

Identify variables that need to be mutated

Since we have arrival date, month, year in three different columns we can combine the fields into one single field. The final column is in date/month/year format. We combine the fields using str_c command.

Code
 bookings<-data%>%
   mutate(date_arrival = str_c(arrival_date_day_of_month,
                               arrival_date_month,
                               arrival_date_year, sep="/"),
          date_arrival = dmy(date_arrival))%>%
  select(-starts_with("arrival"))
Code
 view(bookings)

We see that there is now a single field for arrival date information.

Additionally, we could also get booking time using the arrival date-lead time. This helps us understand how ahead people plan a vacation given the season.

Code
bookings<-bookings%>%
   mutate(date_booking = date_arrival-days(lead_time))

 view(bookings$date_booking)