Challenge 4: Hotel Data

challenge_4
Dirichi Umunna
hotel_bookings
Making Hotel Data useable
Author

Dirichi Umunna

Published

April 18, 2023

Code
library(tidyverse)
library(dplyr)
library(lubridate)
library(stringr)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Introduction

Using the hotel dataset this blog post will be exploring a dataset with the aim of transforming it into a tidy format. Our tasks for today include; describing the dataset using both words and supporting information, tidying the data and performing sanity checks, identifying variables that require mutation, mutating the variables, and then performing a final sanity check to ensure that all the mutations have been performed accurately. By the end of this exercise, we should have a clean and well-organized dataset, ready for further analysis and interpretation.

Code
#read in data
newhotel <- read_csv("_data/hotel_bookings.csv")
head(newhotel)
# A tibble: 6 × 32
  hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <chr>     <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 Resort…       0     342    2015 July         27       1       0       0      2
2 Resort…       0     737    2015 July         27       1       0       0      2
3 Resort…       0       7    2015 July         27       1       0       1      1
4 Resort…       0      13    2015 July         27       1       0       1      1
5 Resort…       0      14    2015 July         27       1       0       2      2
6 Resort…       0      14    2015 July         27       1       0       2      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Code
dim(newhotel)
[1] 119390     32

Description

The hotel dataset is a collection of records containing information about hotel bookings. It includes information about the hotel, such as its name, location, and the type of meal provided. It also includes information about the booking itself, such as the lead time, the number of guests, and whether the booking was cancelled or not. This is potentially a valuable resource for exploring patterns in hotel bookings and understanding customer behavior.

The Problem

Suppose we want to analyze the average daily rate of hotel customers over a certain period of time. To achieve this, we first need to clean the dataset and extract the specific variables of interest. One of the challenges we encounter is the messy date data, which is spread across multiple columns. However, we can easily fix this issue to make the data more manageable and informative for our analysis.

Code
#let us make a dataframe for the arrival columns

arrival_cols <- grep("^arrival", colnames(newhotel), value = TRUE)
arrival_cols
[1] "arrival_date_year"         "arrival_date_month"       
[3] "arrival_date_week_number"  "arrival_date_day_of_month"
Code
##we begin here by concatenating the three distinct columns for dates together, then converting them to dates.
newhotel<- newhotel%>%
  mutate(dateofarrival = str_c(arrival_date_day_of_month,
                              arrival_date_month,
                              arrival_date_year, sep="/"),
         dateofarrival = dmy(dateofarrival))%>%
  select(-starts_with("arrival"))
                 

head(newhotel)
# A tibble: 6 × 29
  hotel      is_ca…¹ lead_…² stays…³ stays…⁴ adults child…⁵ babies meal  country
  <chr>        <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr> <chr>  
1 Resort Ho…       0     342       0       0      2       0      0 BB    PRT    
2 Resort Ho…       0     737       0       0      2       0      0 BB    PRT    
3 Resort Ho…       0       7       0       1      1       0      0 BB    GBR    
4 Resort Ho…       0      13       0       1      1       0      0 BB    GBR    
5 Resort Ho…       0      14       0       2      2       0      0 BB    GBR    
6 Resort Ho…       0      14       0       2      2       0      0 BB    GBR    
# … with 19 more variables: market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …

Tidy Data

Let’s shift our focus to the ADR column now that we have made the dates easier to manipulate. Our next step is to tidy the data by selecting the ADR variable and checking if it requires any modifications.

Code
# Check for missing values in adr column using is.na()
sum(is.na(newhotel$adr))
[1] 0
Code
#thankfully there is no missing data here.

#let us perform a sanity check on our dimensions
dim(newhotel)
[1] 119390     29
Code
#let us rearrange this for further emphasis on our variables

newhotelfinal <- newhotel %>% 
  select(dateofarrival, adr, everything())

Conclusion

With the new date column and properly formatted adr rates, our dataset is now primed for deeper analysis and exploration of trends over time. By combining and tidying these key variables, we have set the stage for uncovering valuable insights and patterns within our data.