Code
library(tidyverse)
library(lubridate)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Neeharika Karanam
December 2, 2022
# A tibble: 119,390 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
In order to check the different unique values in all of the columns I have used the lubridate package.
hotel is_canceled
2 2
lead_time arrival_date_year
479 3
arrival_date_month arrival_date_week_number
12 53
arrival_date_day_of_month stays_in_weekend_nights
31 17
stays_in_week_nights adults
35 14
children babies
6 5
meal country
5 178
market_segment distribution_channel
8 5
is_repeated_guest previous_cancellations
2 15
previous_bookings_not_canceled reserved_room_type
73 10
assigned_room_type booking_changes
12 21
deposit_type agent
3 334
company days_in_waiting_list
353 128
customer_type adr
4 8879
required_car_parking_spaces total_of_special_requests
5 6
reservation_status reservation_status_date
3 926
From the above dataset I have observed that the hotel bookings dataset has 119390 rows and 31 columns also consisting of data from mainly two different hotels called the City Hotel and the Resort Hotel. Each row in the dataset corresponds to the hotel bookings made by each of the customer and it also has a lot of additional information like the arrival date, stay days, meal, type of the room, customer, type, reservation status, number of the adult/children/babies in the room, previous cancellations, if there is a car parking space and lots more information and this data is from different parts of the world.
I would like to tidy the data in the following way.
ABW AGO AIA ALB AND ARE ARG ARM ASM ATA ATF AUS AUT
2 362 1 12 7 51 214 8 1 2 1 426 1263
AZE BDI BEL BEN BFA BGD BGR BHR BHS BIH BLR BOL BRA
17 1 2342 3 1 12 75 5 1 13 26 10 2224
BRB BWA CAF CHE CHL CHN CIV CMR CN COL COM CPV CRI
4 1 5 1730 65 999 6 10 1279 71 2 24 19
CUB CYM CYP CZE DEU DJI DMA DNK DOM DZA ECU EGY ESP
8 1 51 171 7287 1 1 435 14 103 27 32 8568
EST ETH FIN FJI FRA FRO GAB GBR GEO GGY GHA GIB GLP
83 3 447 1 10415 5 4 12129 22 3 4 18 2
GNB GRC GTM GUY HKG HND HRV HUN IDN IMN IND IRL IRN
9 128 4 1 29 1 100 230 35 2 152 3375 83
IRQ ISL ISR ITA JAM JEY JOR JPN KAZ KEN KHM KIR KNA
14 57 669 3766 6 8 21 197 19 6 2 1 2
KOR KWT LAO LBN LBY LCA LIE LKA LTU LUX LVA MAC MAR
133 16 2 31 8 1 3 7 81 287 55 16 259
MCO MDG MDV MEX MKD MLI MLT MMR MNE MOZ MRT MUS MWI
4 1 12 85 10 1 18 1 5 67 1 7 2
MYS MYT NAM NCL NGA NIC NLD NOR NPL NULL NZL OMN PAK
28 2 1 1 34 1 2104 607 1 488 74 18 14
PAN PER PHL PLW POL PRI PRT PRY PYF QAT ROU RUS RWA
9 29 40 1 919 12 48590 4 1 15 500 632 2
SAU SDN SEN SGP SLE SLV SMR SRB STP SUR SVK SVN SWE
48 1 11 39 1 2 1 101 2 5 65 57 1024
SYC SYR TGO THA TJK TMP TUN TUR TWN TZA UGA UKR UMI
2 3 2 59 9 3 39 248 51 5 2 68 1
URY USA UZB VEN VGB VNM ZAF ZMB ZWE
32 2097 4 26 1 8 80 2 4
From the above we can observe that there is a NULL value in the country column and we can remove this value because this information will not help us in futher analysis.
hotel is_canceled
"character" "numeric"
lead_time arrival_date_year
"numeric" "numeric"
arrival_date_month arrival_date_week_number
"character" "numeric"
arrival_date_day_of_month stays_in_weekend_nights
"numeric" "numeric"
stays_in_week_nights adults
"numeric" "numeric"
children babies
"numeric" "numeric"
meal country
"character" "character"
market_segment distribution_channel
"character" "character"
is_repeated_guest previous_cancellations
"numeric" "numeric"
previous_bookings_not_canceled reserved_room_type
"numeric" "character"
assigned_room_type booking_changes
"character" "numeric"
deposit_type agent
"character" "character"
company days_in_waiting_list
"character" "numeric"
customer_type adr
"character" "numeric"
required_car_parking_spaces total_of_special_requests
"numeric" "numeric"
reservation_status reservation_status_date
"character" "Date"
We can now understand that there are 2 different fields called the agents and the company which have numerical values in them but has their class described as character. Therefore, these NULL values can be modified from NULL to NA and we can change the type of the column to numeric. We can also combine the arrival date in year, month and the date into one single field and name it as the arrival date.
I have identified the various variables that can be mutated and have performed the mutation.
#Mutating the arrival date into a single field and also mutate the adults, babies and children in order to get the total guests in the hotel.
hotels_booking_mutate <- hotels_booking %>%
mutate(arrival_date = str_c(arrival_date_day_of_month,
arrival_date_month,
arrival_date_year, sep="/"),
arrival_date = dmy(arrival_date),
total_guests = adults + children + babies) %>%
select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year))
hotels_booking_mutate
# A tibble: 118,902 × 31
hotel is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Resort H… 0 342 27 0 0 2 0 0 BB
2 Resort H… 0 737 27 0 0 2 0 0 BB
3 Resort H… 0 7 27 0 1 1 0 0 BB
4 Resort H… 0 13 27 0 1 1 0 0 BB
5 Resort H… 0 14 27 0 2 2 0 0 BB
6 Resort H… 0 14 27 0 2 2 0 0 BB
7 Resort H… 0 0 27 0 2 2 0 0 BB
8 Resort H… 0 9 27 0 2 2 0 0 FB
9 Resort H… 1 85 27 0 3 2 0 0 BB
10 Resort H… 1 75 27 0 3 2 0 0 HB
# … with 118,892 more rows, 21 more variables: country <chr>,
# market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>, reservation_status <chr>, …
After the mutation now lets find the range of the arrival date of the different bookings of the data
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2015-07-01" "2016-03-14" "2016-09-07" "2016-08-29" "2017-03-19" "2017-08-31"
We can now observe that the arrival dates now lie between the July 2015 - August 2017. Also, the lead time will help us in understanding when the total hotel reservation has been started and we can also calculate the booking date by subtracting the lead time from the arrival date.
# A tibble: 118,902 × 32
hotel is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Resort H… 0 342 27 0 0 2 0 0 BB
2 Resort H… 0 737 27 0 0 2 0 0 BB
3 Resort H… 0 7 27 0 1 1 0 0 BB
4 Resort H… 0 13 27 0 1 1 0 0 BB
5 Resort H… 0 14 27 0 2 2 0 0 BB
6 Resort H… 0 14 27 0 2 2 0 0 BB
7 Resort H… 0 0 27 0 2 2 0 0 BB
8 Resort H… 0 9 27 0 2 2 0 0 FB
9 Resort H… 1 85 27 0 3 2 0 0 BB
10 Resort H… 1 75 27 0 3 2 0 0 HB
# … with 118,892 more rows, 22 more variables: country <chr>,
# market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>, reservation_status <chr>, …
#We can find the number of days after booking when the status was changed to cancelled
hotel_booking_cancellation <- hotels_booking_mutate %>%
filter(reservation_status == 'Canceled') %>%
mutate(canc_time = booking_date - reservation_status_date)
hotel_booking_cancellation %>%
summarise(min = min(canc_time), max = max(canc_time), mean = mean(canc_time), median = median(canc_time))
# A tibble: 1 × 4
min max mean median
<drtn> <drtn> <drtn> <drtn>
1 -584 days 0 days -58.9916 days -26 days
Now we can change the class of the column of the agent and the company from character to numeric.
[1] TRUE
---
title: "Challenge 4"
author: "Neeharika Karanam"
desription: "More data wrangling: pivoting"
date: "12/02/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- abc_poll
- eggs
- fed_rates
- hotel_bookings
- debt
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in data
- hotel_bookings.csv⭐⭐⭐⭐
```{r}
# Reading in the hotels booking data
hotels_booking <- read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)
hotels_booking
```
In order to check the different unique values in all of the columns I have used the lubridate package.
```{r}
rapply(hotels_booking,function(x)length(unique(x)))
```
### Briefly describe the data
From the above dataset I have observed that the hotel bookings dataset has 119390 rows and 31 columns also consisting of data from mainly two different hotels called the City Hotel and the Resort Hotel. Each row in the dataset corresponds to the hotel bookings made by each of the customer and it also has a lot of additional information like the arrival date, stay days, meal, type of the room, customer, type, reservation status, number of the adult/children/babies in the room, previous cancellations, if there is a car parking space and lots more information and this data is from different parts of the world.
## Tidy Data (as needed)
I would like to tidy the data in the following way.
```{r}
table(hotels_booking$country)
```
From the above we can observe that there is a NULL value in the country column and we can remove this value because this information will not help us in futher analysis.
```{r}
hotels_booking <- hotels_booking %>%
filter(!(country == "NULL"))
```
```{r}
#To understand the class for each of the columns in the dataset.
sapply(hotels_booking, class)
```
We can now understand that there are 2 different fields called the agents and the company which have numerical values in them but has their class described as character. Therefore, these NULL values can be modified from NULL to NA and we can change the type of the column to numeric. We can also combine the arrival date in year, month and the date into one single field and name it as the arrival date.
## Identify variables that need to be mutated
I have identified the various variables that can be mutated and have performed the mutation.
```{r}
#Mutating the arrival date into a single field and also mutate the adults, babies and children in order to get the total guests in the hotel.
hotels_booking_mutate <- hotels_booking %>%
mutate(arrival_date = str_c(arrival_date_day_of_month,
arrival_date_month,
arrival_date_year, sep="/"),
arrival_date = dmy(arrival_date),
total_guests = adults + children + babies) %>%
select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year))
hotels_booking_mutate
```
After the mutation now lets find the range of the arrival date of the different bookings of the data
```{r}
summary(hotels_booking_mutate$arrival_date)
```
We can now observe that the arrival dates now lie between the July 2015 - August 2017. Also, the lead time will help us in understanding when the total hotel reservation has been started and we can also calculate the booking date by subtracting the lead time from the arrival date.
```{r}
hotels_booking_mutate <- hotels_booking_mutate %>%
mutate(booking_date = arrival_date - lead_time)
hotels_booking_mutate
```
```{r}
unique(hotels_booking$reservation_status)
```
```{r}
#We can find the number of days after booking when the status was changed to cancelled
hotel_booking_cancellation <- hotels_booking_mutate %>%
filter(reservation_status == 'Canceled') %>%
mutate(canc_time = booking_date - reservation_status_date)
hotel_booking_cancellation %>%
summarise(min = min(canc_time), max = max(canc_time), mean = mean(canc_time), median = median(canc_time))
```
Now we can change the class of the column of the agent and the company from character to numeric.
```{r}
#Mutating the class of the agent and company field from character to numeric
hotels_booking_mutate <- hotels_booking_mutate %>%
mutate(across(c(agent, company),~ replace(.,str_detect(., "NULL"), NA))) %>% mutate_at(vars(agent, company),as.numeric)
is.numeric(hotels_booking_mutate$agent)
```
```{r}
is.numeric(hotels_booking_mutate$company)
```