library(tidyverse)
library(lubridate)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 4
Setup
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- identify variables that need to be mutated
- mutate variables and sanity check all mutations
Read in data
We will be using the following dataset for this challenge:
- hotel_bookings.csv⭐⭐⭐⭐
Code
<- read.csv("_data/hotel_bookings.csv")
bookings head(bookings)
hotel is_canceled lead_time arrival_date_year arrival_date_month
1 Resort Hotel 0 342 2015 July
2 Resort Hotel 0 737 2015 July
3 Resort Hotel 0 7 2015 July
4 Resort Hotel 0 13 2015 July
5 Resort Hotel 0 14 2015 July
6 Resort Hotel 0 14 2015 July
arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
1 27 1 0
2 27 1 0
3 27 1 0
4 27 1 0
5 27 1 0
6 27 1 0
stays_in_week_nights adults children babies meal country market_segment
1 0 2 0 0 BB PRT Direct
2 0 2 0 0 BB PRT Direct
3 1 1 0 0 BB GBR Direct
4 1 1 0 0 BB GBR Corporate
5 2 2 0 0 BB GBR Online TA
6 2 2 0 0 BB GBR Online TA
distribution_channel is_repeated_guest previous_cancellations
1 Direct 0 0
2 Direct 0 0
3 Direct 0 0
4 Corporate 0 0
5 TA/TO 0 0
6 TA/TO 0 0
previous_bookings_not_canceled reserved_room_type assigned_room_type
1 0 C C
2 0 C C
3 0 A C
4 0 A A
5 0 A A
6 0 A A
booking_changes deposit_type agent company days_in_waiting_list customer_type
1 3 No Deposit NULL NULL 0 Transient
2 4 No Deposit NULL NULL 0 Transient
3 0 No Deposit NULL NULL 0 Transient
4 0 No Deposit 304 NULL 0 Transient
5 0 No Deposit 240 NULL 0 Transient
6 0 No Deposit 240 NULL 0 Transient
adr required_car_parking_spaces total_of_special_requests reservation_status
1 0 0 0 Check-Out
2 0 0 0 Check-Out
3 75 0 0 Check-Out
4 75 0 0 Check-Out
5 98 0 1 Check-Out
6 98 0 1 Check-Out
reservation_status_date
1 2015-07-01
2 2015-07-01
3 2015-07-02
4 2015-07-02
5 2015-07-03
6 2015-07-03
Briefly describe the data
Code
dim(bookings)
[1] 119390 32
Code
colnames(bookings)
[1] "hotel" "is_canceled"
[3] "lead_time" "arrival_date_year"
[5] "arrival_date_month" "arrival_date_week_number"
[7] "arrival_date_day_of_month" "stays_in_weekend_nights"
[9] "stays_in_week_nights" "adults"
[11] "children" "babies"
[13] "meal" "country"
[15] "market_segment" "distribution_channel"
[17] "is_repeated_guest" "previous_cancellations"
[19] "previous_bookings_not_canceled" "reserved_room_type"
[21] "assigned_room_type" "booking_changes"
[23] "deposit_type" "agent"
[25] "company" "days_in_waiting_list"
[27] "customer_type" "adr"
[29] "required_car_parking_spaces" "total_of_special_requests"
[31] "reservation_status" "reservation_status_date"
This dataset contains the booking details for a number of hotels. This is a very comprehensive dataset having 32 variables. It covers various details such as the details of the booking party (adults, children, babies, meal, country, etc), details of the hotel(hotel, reserved room type, assigned room type, etc) and details of the booking itself(arrival date details, reservation status details, etc).
Tidy Data (as needed)
We can do some simple checks to see if the dataset is tidy. Lets check if there are any duplicate rows or null values in the dataset
print(anyDuplicated(bookings) == 0)
[1] FALSE
print(sum(is.na(bookings)) == 0)
[1] FALSE
We can see that there are no duplicate rows or null rows. The dataset is tidy.
Identify variables that need to be mutated
We can see that the arrival details have been split into 4 columns:
Code
grep("arrival",names(bookings), value = TRUE)
[1] "arrival_date_year" "arrival_date_month"
[3] "arrival_date_week_number" "arrival_date_day_of_month"
Instead we can convert this into a single variable which will hold the arrival date in a standard format. In this case we will convert it to the same format as “reservation_status_date” variable (YYYY-MM-DD).
Code
$date_str <- paste(bookings$arrival_date_year, bookings$arrival_date_month, bookings$arrival_date_day_of_month, sep = "-")
bookings
$arrival_date <- as.Date(bookings$date_str, format = "%Y-%B-%d")
bookings
<- bookings[, !(names(bookings) %in% "date_str")]
bookings
head(select(bookings, arrival_date, reservation_status_date))
arrival_date reservation_status_date
1 2015-07-01 2015-07-01
2 2015-07-01 2015-07-01
3 2015-07-01 2015-07-02
4 2015-07-01 2015-07-02
5 2015-07-01 2015-07-03
6 2015-07-01 2015-07-03
Now we can remove the original 4 arrival variables:
Code
<- select(bookings, -arrival_date_year, -arrival_date_month, -arrival_date_week_number, -arrival_date_day_of_month)
bookings colnames(bookings)
[1] "hotel" "is_canceled"
[3] "lead_time" "stays_in_weekend_nights"
[5] "stays_in_week_nights" "adults"
[7] "children" "babies"
[9] "meal" "country"
[11] "market_segment" "distribution_channel"
[13] "is_repeated_guest" "previous_cancellations"
[15] "previous_bookings_not_canceled" "reserved_room_type"
[17] "assigned_room_type" "booking_changes"
[19] "deposit_type" "agent"
[21] "company" "days_in_waiting_list"
[23] "customer_type" "adr"
[25] "required_car_parking_spaces" "total_of_special_requests"
[27] "reservation_status" "reservation_status_date"
[29] "arrival_date"
The lead_time variable tells us the number of days between the reservation date and the arrival date. Therefore, using the lead_time variable, we can add a variable to store the day the reservation was made.
Code
$reservation_booking_date <- bookings$arrival_date - days(bookings$lead_time)
bookingshead(select(bookings, lead_time, arrival_date, reservation_booking_date))
lead_time arrival_date reservation_booking_date
1 342 2015-07-01 2014-07-24
2 737 2015-07-01 2013-06-24
3 7 2015-07-01 2015-06-24
4 13 2015-07-01 2015-06-18
5 14 2015-07-01 2015-06-17
6 14 2015-07-01 2015-06-17
We can look at the final dataset after the mutations:
Code
colnames(bookings)
[1] "hotel" "is_canceled"
[3] "lead_time" "stays_in_weekend_nights"
[5] "stays_in_week_nights" "adults"
[7] "children" "babies"
[9] "meal" "country"
[11] "market_segment" "distribution_channel"
[13] "is_repeated_guest" "previous_cancellations"
[15] "previous_bookings_not_canceled" "reserved_room_type"
[17] "assigned_room_type" "booking_changes"
[19] "deposit_type" "agent"
[21] "company" "days_in_waiting_list"
[23] "customer_type" "adr"
[25] "required_car_parking_spaces" "total_of_special_requests"
[27] "reservation_status" "reservation_status_date"
[29] "arrival_date" "reservation_booking_date"
With this our dataset has been mutated to a more convenient format for further analysis.