Code
library(tidyverse)
library(lubridate)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Sai Pranav Kurly
April 22, 2023
# A tibble: 119,390 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
This dataset, which comprises 120k records and 32 columns and dates from 2015 to 2017, summarizes numerous details about a hotel booking. This dataset contains two types of hotels: resort hotels and city hotels. Customers book these hotels from all over the world; roughly 160-170 nations. It can also be shown that on average, approximately 37% of appointments are cancelled and approximately 3% of guests are repeated. Customers may have to wait 2.3 days on average in the backlog to finalize a booking, and approximately 57% of these bookings involve certain special requirements. Customers can choose from four different sorts of meals at the hotels.
ABW AGO AIA ALB AND ARE ARG ARM ASM ATA ATF AUS AUT
2 362 1 12 7 51 214 8 1 2 1 426 1263
AZE BDI BEL BEN BFA BGD BGR BHR BHS BIH BLR BOL BRA
17 1 2342 3 1 12 75 5 1 13 26 10 2224
BRB BWA CAF CHE CHL CHN CIV CMR CN COL COM CPV CRI
4 1 5 1730 65 999 6 10 1279 71 2 24 19
CUB CYM CYP CZE DEU DJI DMA DNK DOM DZA ECU EGY ESP
8 1 51 171 7287 1 1 435 14 103 27 32 8568
EST ETH FIN FJI FRA FRO GAB GBR GEO GGY GHA GIB GLP
83 3 447 1 10415 5 4 12129 22 3 4 18 2
GNB GRC GTM GUY HKG HND HRV HUN IDN IMN IND IRL IRN
9 128 4 1 29 1 100 230 35 2 152 3375 83
IRQ ISL ISR ITA JAM JEY JOR JPN KAZ KEN KHM KIR KNA
14 57 669 3766 6 8 21 197 19 6 2 1 2
KOR KWT LAO LBN LBY LCA LIE LKA LTU LUX LVA MAC MAR
133 16 2 31 8 1 3 7 81 287 55 16 259
MCO MDG MDV MEX MKD MLI MLT MMR MNE MOZ MRT MUS MWI
4 1 12 85 10 1 18 1 5 67 1 7 2
MYS MYT NAM NCL NGA NIC NLD NOR NPL NULL NZL OMN PAK
28 2 1 1 34 1 2104 607 1 488 74 18 14
PAN PER PHL PLW POL PRI PRT PRY PYF QAT ROU RUS RWA
9 29 40 1 919 12 48590 4 1 15 500 632 2
SAU SDN SEN SGP SLE SLV SMR SRB STP SUR SVK SVN SWE
48 1 11 39 1 2 1 101 2 5 65 57 1024
SYC SYR TGO THA TJK TMP TUN TUR TWN TZA UGA UKR UMI
2 3 2 59 9 3 39 248 51 5 2 68 1
URY USA UZB VEN VGB VNM ZAF ZMB ZWE
32 2097 4 26 1 8 80 2 4
From the above we can observe that there is a NULL value in the country column and we can remove this value because this information will not help us in futher analysis.
A few variables can be mutated and below is one such example
#Mutating the arrival date into a single field and also mutate the adults, babies and children in order to get the total guests in the hotel.
hotels_booking_mutate <- hotels_booking %>%
mutate(arrival_date = str_c(arrival_date_day_of_month,
arrival_date_month,
arrival_date_year, sep="/"),
arrival_date = dmy(arrival_date),
total_guests = adults + children + babies) %>%
select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year))
hotels_booking_mutate
# A tibble: 118,902 × 31
hotel is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Resort H… 0 342 27 0 0 2 0 0 BB
2 Resort H… 0 737 27 0 0 2 0 0 BB
3 Resort H… 0 7 27 0 1 1 0 0 BB
4 Resort H… 0 13 27 0 1 1 0 0 BB
5 Resort H… 0 14 27 0 2 2 0 0 BB
6 Resort H… 0 14 27 0 2 2 0 0 BB
7 Resort H… 0 0 27 0 2 2 0 0 BB
8 Resort H… 0 9 27 0 2 2 0 0 FB
9 Resort H… 1 85 27 0 3 2 0 0 BB
10 Resort H… 1 75 27 0 3 2 0 0 HB
# … with 118,892 more rows, 21 more variables: country <chr>,
# market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>, reservation_status <chr>, …
After the mutation now lets find the range of the arrival date of the different bookings of the data
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2015-07-01" "2016-03-14" "2016-09-07" "2016-08-29" "2017-03-19" "2017-08-31"
We can now observe that the arrival dates now lie between the July 2015 - August 2017.
---
title: "Challenge 4"
author: "Sai Pranav Kurly"
description: "More data wrangling: pivoting"
date: "04/22/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- abc_poll
- eggs
- fed_rates
- hotel_bookings
- debt
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in data
- hotel_bookings.csv⭐⭐⭐⭐
```{r}
hotels_booking <- read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)
hotels_booking
```
### Briefly describe the data
This dataset, which comprises 120k records and 32 columns and dates from 2015 to 2017, summarizes numerous details about a hotel booking. This dataset contains two types of hotels: resort hotels and city hotels. Customers book these hotels from all over the world; roughly 160-170 nations. It can also be shown that on average, approximately 37% of appointments are cancelled and approximately 3% of guests are repeated. Customers may have to wait 2.3 days on average in the backlog to finalize a booking, and approximately 57% of these bookings involve certain special requirements. Customers can choose from four different sorts of meals at the hotels.
```{r}
summarize(hotels_booking)
```
## Tidy Data (as needed)
```{r}
table(hotels_booking$country)
```
From the above we can observe that there is a NULL value in the country column and we can remove this value because this information will not help us in futher analysis.
```{r}
hotels_booking <- hotels_booking %>%
filter(!(country == "NULL"))
```
## Identify variables that need to be mutated
A few variables can be mutated and below is one such example
```{r}
#Mutating the arrival date into a single field and also mutate the adults, babies and children in order to get the total guests in the hotel.
hotels_booking_mutate <- hotels_booking %>%
mutate(arrival_date = str_c(arrival_date_day_of_month,
arrival_date_month,
arrival_date_year, sep="/"),
arrival_date = dmy(arrival_date),
total_guests = adults + children + babies) %>%
select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year))
hotels_booking_mutate
```
After the mutation now lets find the range of the arrival date of the different bookings of the data
```{r}
summary(hotels_booking_mutate$arrival_date)
```
We can now observe that the arrival dates now lie between the July 2015 - August 2017.