Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Pavan Datta Abbineni
August 16, 2022
I’ve decided to use the hotel bookings dataset for this challenge.
Let’s get a brief understanding of our dataset.
# A tibble: 6 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resort… 0 342 2015 July 27 1 0 0 2
2 Resort… 0 737 2015 July 27 1 0 0 2
3 Resort… 0 7 2015 July 27 1 0 1 1
4 Resort… 0 13 2015 July 27 1 0 1 1
5 Resort… 0 14 2015 July 27 1 0 2 2
6 Resort… 0 14 2015 July 27 1 0 2 2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
# assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
# agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
# ℹ Use `colnames()` to see all variable names
# A tibble: 6 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 City H… 0 21 2017 August 35 30 2 5 2
2 City H… 0 23 2017 August 35 30 2 5 2
3 City H… 0 102 2017 August 35 31 2 5 3
4 City H… 0 34 2017 August 35 31 2 5 2
5 City H… 0 109 2017 August 35 31 2 5 2
6 City H… 0 205 2017 August 35 29 2 7 2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
# assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
# agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
# ℹ Use `colnames()` to see all variable names
The dataset contains a total of 119390 observations with a total of 32 columns, lets see the column names next.
[1] "hotel" "is_canceled"
[3] "lead_time" "arrival_date_year"
[5] "arrival_date_month" "arrival_date_week_number"
[7] "arrival_date_day_of_month" "stays_in_weekend_nights"
[9] "stays_in_week_nights" "adults"
[11] "children" "babies"
[13] "meal" "country"
[15] "market_segment" "distribution_channel"
[17] "is_repeated_guest" "previous_cancellations"
[19] "previous_bookings_not_canceled" "reserved_room_type"
[21] "assigned_room_type" "booking_changes"
[23] "deposit_type" "agent"
[25] "company" "days_in_waiting_list"
[27] "customer_type" "adr"
[29] "required_car_parking_spaces" "total_of_special_requests"
[31] "reservation_status" "reservation_status_date"
To get the dispersion values of the data set let’s call the summary function.
hotel is_canceled lead_time arrival_date_year
Length:119390 Min. :0.0000 Min. : 0 Min. :2015
Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
Mode :character Median :0.0000 Median : 69 Median :2016
Mean :0.3704 Mean :104 Mean :2016
3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
Max. :1.0000 Max. :737 Max. :2017
arrival_date_month arrival_date_week_number arrival_date_day_of_month
Length:119390 Min. : 1.00 Min. : 1.0
Class :character 1st Qu.:16.00 1st Qu.: 8.0
Mode :character Median :28.00 Median :16.0
Mean :27.17 Mean :15.8
3rd Qu.:38.00 3rd Qu.:23.0
Max. :53.00 Max. :31.0
stays_in_weekend_nights stays_in_week_nights adults
Min. : 0.0000 Min. : 0.0 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
Median : 1.0000 Median : 2.0 Median : 2.000
Mean : 0.9276 Mean : 2.5 Mean : 1.856
3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
Max. :19.0000 Max. :50.0 Max. :55.000
children babies meal country
Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
Median : 0.0000 Median : 0.000000 Mode :character Mode :character
Mean : 0.1039 Mean : 0.007949
3rd Qu.: 0.0000 3rd Qu.: 0.000000
Max. :10.0000 Max. :10.000000
NA's :4
market_segment distribution_channel is_repeated_guest
Length:119390 Length:119390 Min. :0.00000
Class :character Class :character 1st Qu.:0.00000
Mode :character Mode :character Median :0.00000
Mean :0.03191
3rd Qu.:0.00000
Max. :1.00000
previous_cancellations previous_bookings_not_canceled reserved_room_type
Min. : 0.00000 Min. : 0.0000 Length:119390
1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
Median : 0.00000 Median : 0.0000 Mode :character
Mean : 0.08712 Mean : 0.1371
3rd Qu.: 0.00000 3rd Qu.: 0.0000
Max. :26.00000 Max. :72.0000
assigned_room_type booking_changes deposit_type agent
Length:119390 Min. : 0.0000 Length:119390 Length:119390
Class :character 1st Qu.: 0.0000 Class :character Class :character
Mode :character Median : 0.0000 Mode :character Mode :character
Mean : 0.2211
3rd Qu.: 0.0000
Max. :21.0000
company days_in_waiting_list customer_type adr
Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
Mode :character Median : 0.000 Mode :character Median : 94.58
Mean : 2.321 Mean : 101.83
3rd Qu.: 0.000 3rd Qu.: 126.00
Max. :391.000 Max. :5400.00
required_car_parking_spaces total_of_special_requests reservation_status
Min. :0.00000 Min. :0.0000 Length:119390
1st Qu.:0.00000 1st Qu.:0.0000 Class :character
Median :0.00000 Median :0.0000 Mode :character
Mean :0.06252 Mean :0.5714
3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :8.00000 Max. :5.0000
reservation_status_date
Min. :2014-10-17
1st Qu.:2016-02-01
Median :2016-08-07
Mean :2016-07-30
3rd Qu.:2017-02-08
Max. :2017-09-14
To find which hotels are accomodating more guests lets find the mean for both the stays_in_week_nights and stays_in_weekend_nights.
# A tibble: 2 × 2
hotel mean
<chr> <dbl>
1 City Hotel 2.18
2 Resort Hotel 3.13
# A tibble: 2 × 2
hotel mean
<chr> <dbl>
1 City Hotel 0.795
2 Resort Hotel 1.19
We can see that the resort hotel is generally more popular with guests both during the week and on the weekends, in general both hotels are more popular during the week than the weekends.
# A tibble: 2 × 2
hotel totalCancellations
<chr> <dbl>
1 City Hotel 33102
2 Resort Hotel 11122
We can see that the City Hotel had more cancellations when compared to a resort hotel.
Let’s find out which countries are more popular
# A tibble: 178 × 2
# Groups: country [178]
country hotel
<chr> <int>
1 PRT 48590
2 GBR 12129
3 FRA 10415
4 ESP 8568
5 DEU 7287
6 ITA 3766
7 IRL 3375
8 BEL 2342
9 BRA 2224
10 NLD 2104
# … with 168 more rows
# ℹ Use `print(n = ...)` to see more rows
From the above results we can find the countries with highest hotel bookings and the ones with the least hotel bookings.
But some of reservations are cancelled so let’s find the actual number of stays.
# A tibble: 178 × 2
country totalCancellations
<chr> <dbl>
1 PRT 27519
2 GBR 2453
3 ESP 2177
4 FRA 1934
5 ITA 1333
6 DEU 1218
7 IRL 832
8 BRA 830
9 USA 501
10 BEL 474
# … with 168 more rows
# ℹ Use `print(n = ...)` to see more rows
Firstly I wanted to find which of the hotels is more popular ( City Hotel or the Resort Hotel ), from the data obtained it is very clear that the Resort Hotel on an average has more reservations both during the weekends and week-days, it also has less cancellations when compared to City Hotel.
I decided to find which country has a more successfull hotel business, it is clear that PRT has the highest number of hotel bookings.
---
title: "Challenge 2"
author: "Pavan Datta Abbineni "
desription: "Data wrangling: using group() and summarise()"
date: "08/16/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Data
I've decided to use the hotel bookings dataset for this challenge.
```{r reading-data}
hotelBookingsCompleteData <- read_csv("_data/hotel_bookings.csv")
```
## Describe the data
Let's get a brief understanding of our dataset.
```{r head-visualize}
head(hotelBookingsCompleteData)
```
```{r tail-visualize}
tail(hotelBookingsCompleteData)
```
```{r dimensions}
dim(hotelBookingsCompleteData)
```
The dataset contains a total of 119390 observations with a total of 32 columns, lets see the column names next.
```{r colnames}
colnames(hotelBookingsCompleteData)
```
To get the dispersion values of the data set let's call the summary function.
```{r summary}
summary(hotelBookingsCompleteData)
```
## Provide Grouped Summary Statistics
To find which hotels are accomodating more guests lets find the mean for both the stays_in_week_nights and stays_in_weekend_nights.
```{r mean-of-week}
hotelBookingsCompleteData %>%
group_by(hotel) %>%
summarise(mean = mean(stays_in_week_nights))
```
```{r mean-of-weekend}
hotelBookingsCompleteData %>%
group_by(hotel) %>%
summarise(mean = mean(stays_in_weekend_nights))
```
We can see that the resort hotel is generally more popular with guests both during the week and on the weekends, in general both hotels are more popular during the week than the weekends.
```{r cancelled-reservations}
hotelBookingsCompleteData %>%
group_by(hotel) %>%
summarise(totalCancellations = sum(is_canceled))
```
We can see that the City Hotel had more cancellations when compared to a resort hotel.
Let's find out which countries are more popular
```{r popular-country}
countryTableData<-hotelBookingsCompleteData %>% group_by(country) %>% summarise(hotel = n())
countryTableData %>% group_by(country) %>% arrange(desc(hotel))
```
From the above results we can find the countries with highest hotel bookings and the ones with the least hotel bookings.
But some of reservations are cancelled so let's find the actual number of stays.
```{r cancelled-reservations-country}
hotelBookingsCompleteData %>%
group_by(country) %>%
summarise(totalCancellations = sum(is_canceled))%>%
arrange(desc(totalCancellations))
```
### Explain and Interpret
Firstly I wanted to find which of the hotels is more popular ( City Hotel or the Resort Hotel ), from the data obtained it is very clear that the Resort Hotel on an average has more reservations both during the weekends and week-days, it also has less cancellations when compared to City Hotel.
I decided to find which country has a more successfull hotel business, it is clear that PRT has the highest number of hotel bookings.