Code
# Import important libraries needed for functions
library(tidyverse)
library(rmarkdown)
library(summarytools)
library(dplyr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Kavya Harlalka
September 24, 2022
This dataset represents records of hotel bookings across different countries with details like booking data, booking status, no of adults/children/babies, etc.
Data Frame Summary
data
Dimensions: 119390 x 32
Duplicates: 31994
-----------------------------------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- -------------------------------- -------------------------- ---------------------- ---------------------- ---------- ---------
1 hotel 1. City Hotel 79330 (66.4%) IIIIIIIIIIIII 119390 0
[character] 2. Resort Hotel 40060 (33.6%) IIIIII (100.0%) (0.0%)
2 is_canceled Min : 0 0 : 75166 (63.0%) IIIIIIIIIIII 119390 0
[integer] Mean : 0.4 1 : 44224 (37.0%) IIIIIII (100.0%) (0.0%)
Max : 1
3 lead_time Mean (sd) : 104 (106.9) 479 distinct values : 119390 0
[integer] min < med < max: : (100.0%) (0.0%)
0 < 69 < 737 :
IQR (CV) : 142 (1) : : .
: : : . .
4 arrival_date_year Mean (sd) : 2016.2 (0.7) 2015 : 21996 (18.4%) III 119390 0
[integer] min < med < max: 2016 : 56707 (47.5%) IIIIIIIII (100.0%) (0.0%)
2015 < 2016 < 2017 2017 : 40687 (34.1%) IIIIII
IQR (CV) : 1 (0)
5 arrival_date_month 1. August 13877 (11.6%) II 119390 0
[character] 2. July 12661 (10.6%) II (100.0%) (0.0%)
3. May 11791 ( 9.9%) I
4. October 11160 ( 9.3%) I
5. April 11089 ( 9.3%) I
6. June 10939 ( 9.2%) I
7. September 10508 ( 8.8%) I
8. March 9794 ( 8.2%) I
9. February 8068 ( 6.8%) I
10. November 6794 ( 5.7%) I
[ 2 others ] 12709 (10.6%) II
6 arrival_date_week_number Mean (sd) : 27.2 (13.6) 53 distinct values . : . . . 119390 0
[integer] min < med < max: . : : : : : : (100.0%) (0.0%)
1 < 28 < 53 . : : : : : : : : :
IQR (CV) : 22 (0.5) : : : : : : : : : :
: : : : : : : : : :
7 arrival_date_day_of_month Mean (sd) : 15.8 (8.8) 31 distinct values : 119390 0
[integer] min < med < max: : : : . : : . : : (100.0%) (0.0%)
1 < 16 < 31 : : : : : : : : : :
IQR (CV) : 15 (0.6) : : : : : : : : : :
: : : : : : : : : :
8 stays_in_weekend_nights Mean (sd) : 0.9 (1) 17 distinct values : 119390 0
[integer] min < med < max: : (100.0%) (0.0%)
0 < 1 < 19 :
IQR (CV) : 2 (1.1) : :
: :
9 stays_in_week_nights Mean (sd) : 2.5 (1.9) 35 distinct values : 119390 0
[integer] min < med < max: : (100.0%) (0.0%)
0 < 2 < 50 :
IQR (CV) : 2 (0.8) :
:
10 adults Mean (sd) : 1.9 (0.6) 14 distinct values : 119390 0
[integer] min < med < max: : (100.0%) (0.0%)
0 < 2 < 55 :
IQR (CV) : 0 (0.3) :
:
11 children Mean (sd) : 0.1 (0.4) 0 : 110796 (92.8%) IIIIIIIIIIIIIIIIII 119386 4
[integer] min < med < max: 1 : 4861 ( 4.1%) (100.0%) (0.0%)
0 < 0 < 10 2 : 3652 ( 3.1%)
IQR (CV) : 0 (3.8) 3 : 76 ( 0.1%)
10 : 1 ( 0.0%)
12 babies Mean (sd) : 0 (0.1) 0 : 118473 (99.2%) IIIIIIIIIIIIIIIIIII 119390 0
[integer] min < med < max: 1 : 900 ( 0.8%) (100.0%) (0.0%)
0 < 0 < 10 2 : 15 ( 0.0%)
IQR (CV) : 0 (12.3) 9 : 1 ( 0.0%)
10 : 1 ( 0.0%)
13 meal 1. BB 92310 (77.3%) IIIIIIIIIIIIIII 119390 0
[character] 2. FB 798 ( 0.7%) (100.0%) (0.0%)
3. HB 14463 (12.1%) II
4. SC 10650 ( 8.9%) I
5. Undefined 1169 ( 1.0%)
14 country 1. PRT 48590 (40.7%) IIIIIIII 119390 0
[character] 2. GBR 12129 (10.2%) II (100.0%) (0.0%)
3. FRA 10415 ( 8.7%) I
4. ESP 8568 ( 7.2%) I
5. DEU 7287 ( 6.1%) I
6. ITA 3766 ( 3.2%)
7. IRL 3375 ( 2.8%)
8. BEL 2342 ( 2.0%)
9. BRA 2224 ( 1.9%)
10. NLD 2104 ( 1.8%)
[ 168 others ] 18590 (15.6%) III
15 market_segment 1. Aviation 237 ( 0.2%) 119390 0
[character] 2. Complementary 743 ( 0.6%) (100.0%) (0.0%)
3. Corporate 5295 ( 4.4%)
4. Direct 12606 (10.6%) II
5. Groups 19811 (16.6%) III
6. Offline TA/TO 24219 (20.3%) IIII
7. Online TA 56477 (47.3%) IIIIIIIII
8. Undefined 2 ( 0.0%)
16 distribution_channel 1. Corporate 6677 ( 5.6%) I 119390 0
[character] 2. Direct 14645 (12.3%) II (100.0%) (0.0%)
3. GDS 193 ( 0.2%)
4. TA/TO 97870 (82.0%) IIIIIIIIIIIIIIII
5. Undefined 5 ( 0.0%)
17 is_repeated_guest Min : 0 0 : 115580 (96.8%) IIIIIIIIIIIIIIIIIII 119390 0
[integer] Mean : 0 1 : 3810 ( 3.2%) (100.0%) (0.0%)
Max : 1
18 previous_cancellations Mean (sd) : 0.1 (0.8) 15 distinct values : 119390 0
[integer] min < med < max: : (100.0%) (0.0%)
0 < 0 < 26 :
IQR (CV) : 0 (9.7) :
:
19 previous_bookings_not_canceled Mean (sd) : 0.1 (1.5) 73 distinct values : 119390 0
[integer] min < med < max: : (100.0%) (0.0%)
0 < 0 < 72 :
IQR (CV) : 0 (10.9) :
:
20 reserved_room_type 1. A 85994 (72.0%) IIIIIIIIIIIIII 119390 0
[character] 2. B 1118 ( 0.9%) (100.0%) (0.0%)
3. C 932 ( 0.8%)
4. D 19201 (16.1%) III
5. E 6535 ( 5.5%) I
6. F 2897 ( 2.4%)
7. G 2094 ( 1.8%)
8. H 601 ( 0.5%)
9. L 6 ( 0.0%)
10. P 12 ( 0.0%)
21 assigned_room_type 1. A 74053 (62.0%) IIIIIIIIIIII 119390 0
[character] 2. D 25322 (21.2%) IIII (100.0%) (0.0%)
3. E 7806 ( 6.5%) I
4. F 3751 ( 3.1%)
5. G 2553 ( 2.1%)
6. C 2375 ( 2.0%)
7. B 2163 ( 1.8%)
8. H 712 ( 0.6%)
9. I 363 ( 0.3%)
10. K 279 ( 0.2%)
[ 2 others ] 13 ( 0.0%)
22 booking_changes Mean (sd) : 0.2 (0.7) 21 distinct values : 119390 0
[integer] min < med < max: : (100.0%) (0.0%)
0 < 0 < 21 :
IQR (CV) : 0 (2.9) :
:
23 deposit_type 1. No Deposit 104641 (87.6%) IIIIIIIIIIIIIIIII 119390 0
[character] 2. Non Refund 14587 (12.2%) II (100.0%) (0.0%)
3. Refundable 162 ( 0.1%)
24 agent 1. 9 31961 (26.8%) IIIII 119390 0
[character] 2. NULL 16340 (13.7%) II (100.0%) (0.0%)
3. 240 13922 (11.7%) II
4. 1 7191 ( 6.0%) I
5. 14 3640 ( 3.0%)
6. 7 3539 ( 3.0%)
7. 6 3290 ( 2.8%)
8. 250 2870 ( 2.4%)
9. 241 1721 ( 1.4%)
10. 28 1666 ( 1.4%)
[ 324 others ] 33250 (27.8%) IIIII
25 company 1. NULL 112593 (94.3%) IIIIIIIIIIIIIIIIII 119390 0
[character] 2. 40 927 ( 0.8%) (100.0%) (0.0%)
3. 223 784 ( 0.7%)
4. 67 267 ( 0.2%)
5. 45 250 ( 0.2%)
6. 153 215 ( 0.2%)
7. 174 149 ( 0.1%)
8. 219 141 ( 0.1%)
9. 281 138 ( 0.1%)
10. 154 133 ( 0.1%)
[ 343 others ] 3793 ( 3.2%)
26 days_in_waiting_list Mean (sd) : 2.3 (17.6) 128 distinct values : 119390 0
[integer] min < med < max: : (100.0%) (0.0%)
0 < 0 < 391 :
IQR (CV) : 0 (7.6) :
:
27 customer_type 1. Contract 4076 ( 3.4%) 119390 0
[character] 2. Group 577 ( 0.5%) (100.0%) (0.0%)
3. Transient 89613 (75.1%) IIIIIIIIIIIIIII
4. Transient-Party 25124 (21.0%) IIII
28 adr Mean (sd) : 101.8 (50.5) 8879 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
-6.4 < 94.6 < 5400 :
IQR (CV) : 56.7 (0.5) :
:
29 required_car_parking_spaces Mean (sd) : 0.1 (0.2) 0 : 111974 (93.8%) IIIIIIIIIIIIIIIIII 119390 0
[integer] min < med < max: 1 : 7383 ( 6.2%) I (100.0%) (0.0%)
0 < 0 < 8 2 : 28 ( 0.0%)
IQR (CV) : 0 (3.9) 3 : 3 ( 0.0%)
8 : 2 ( 0.0%)
30 total_of_special_requests Mean (sd) : 0.6 (0.8) 0 : 70318 (58.9%) IIIIIIIIIII 119390 0
[integer] min < med < max: 1 : 33226 (27.8%) IIIII (100.0%) (0.0%)
0 < 0 < 5 2 : 12969 (10.9%) II
IQR (CV) : 1 (1.4) 3 : 2497 ( 2.1%)
4 : 340 ( 0.3%)
5 : 40 ( 0.0%)
31 reservation_status 1. Canceled 43017 (36.0%) IIIIIII 119390 0
[character] 2. Check-Out 75166 (63.0%) IIIIIIIIIIII (100.0%) (0.0%)
3. No-Show 1207 ( 1.0%)
32 reservation_status_date 1. 2015-10-21 1461 ( 1.2%) 119390 0
[character] 2. 2015-07-06 805 ( 0.7%) (100.0%) (0.0%)
3. 2016-11-25 790 ( 0.7%)
4. 2015-01-01 763 ( 0.6%)
5. 2016-01-18 625 ( 0.5%)
6. 2015-07-02 469 ( 0.4%)
7. 2016-12-07 450 ( 0.4%)
8. 2015-12-18 423 ( 0.4%)
9. 2016-02-09 412 ( 0.3%)
10. 2016-04-04 382 ( 0.3%)
[ 916 others ] 112810 (94.5%) IIIIIIIIIIIIIIIIII
-----------------------------------------------------------------------------------------------------------------------------------
As can be seen in the above summary, hotel has only two values - Resort and City. The dataset has data from the year 2015 to the year 2017. It can be observed that about 37% of the bookings are canceled from the is_canceled column summary. We can see a large number of columns with categorical values like country, market_segment, distribution_channel, customer_type, etc. and how the percentage of these categorical values are distributed in the data set. This gives us insight on what analysis can be done on the data.
I chose the market_segment column to group the data. The market segment would give us insight into the behavior of each segment. This can be further understood by looking at the tables below :-
# Group dataset by Market Segment
dataGroupedByMS <- data %>%
group_by(market_segment)
# Analyze the grouped segment's stays in weekdays and weekends using mean of the number of nights.
dataGroupedByMS %>%
summarise(
avg_stays_in_week_nights = mean(stays_in_week_nights, nr.rm = TRUE),
avg_stays_in_weekend_nights = mean(stays_in_weekend_nights, nr.rm = TRUE)
)
# A tibble: 8 × 3
market_segment avg_stays_in_week_nights avg_stays_in_weekend_nights
<chr> <dbl> <dbl>
1 Aviation 2.51 1.09
2 Complementary 1.30 0.351
3 Corporate 1.65 0.438
4 Direct 2.35 0.856
5 Groups 2.20 0.789
6 Offline TA/TO 2.85 1.05
7 Online TA 2.58 0.991
8 Undefined 1 0.5
reservation_status
market_segment Canceled Check-Out No-Show
Aviation 0.03 0.15 0.01
Complementary 0.07 0.54 0.01
Corporate 0.77 3.60 0.06
Direct 1.44 8.94 0.18
Groups 10.07 6.46 0.06
Offline TA/TO 6.77 13.32 0.19
Online TA 16.88 29.93 0.50
Undefined 0.00 0.00 0.00
The first table gives insight into which market segment bookings come for weekdays and which ones come during weekends. This could help the hotels decide which events to organize or how to plan activities targeted to a certain segment during those days.
The second table gives percentage of bookings of each type of segment that were cancelled, no-shows or checked-out. This could help the hotels analyze which market segments gets the most cancellations and which ones get the least, and find ways to reduce the number of cancellations by targeted schemes or penalties towards certain segments.
---
title: "Challenge 2"
author: "Kavya Harlalka"
desription: "Data wrangling: using group() and summarise()"
date: "09/24/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
```{r Libraries}
#| label: setup
#| warning: false
#| message: false
# Import important libraries needed for functions
library(tidyverse)
library(rmarkdown)
library(summarytools)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Data
```{r Read In Data}
# Read data from the csv file
data <- read.csv('_data/hotel_bookings.csv')
# Printing the read data as a paged table
paged_table(data)
```
This dataset represents records of hotel bookings across different countries with details like booking data, booking status, no of adults/children/babies, etc.
## Describe the data
```{r Summary of Data}
# Get and print summary of the dataset to help us analyze it.
print(dfSummary(data))
```
As can be seen in the above summary, hotel has only two values - Resort and City. The dataset has data from the year 2015 to the year 2017. It can be observed that about 37% of the bookings are canceled from the is_canceled column summary. We can see a large number of columns with categorical values like country, market_segment, distribution_channel, customer_type, etc. and how the percentage of these categorical values are distributed in the data set. This gives us insight on what analysis can be done on the data.
## Provide Grouped Summary Statistics
I chose the market_segment column to group the data. The market segment would give us insight into the behavior of each segment. This can be further understood by looking at the tables below :-
```{r Analysis of Data}
# Group dataset by Market Segment
dataGroupedByMS <- data %>%
group_by(market_segment)
# Analyze the grouped segment's stays in weekdays and weekends using mean of the number of nights.
dataGroupedByMS %>%
summarise(
avg_stays_in_week_nights = mean(stays_in_week_nights, nr.rm = TRUE),
avg_stays_in_weekend_nights = mean(stays_in_weekend_nights, nr.rm = TRUE)
)
# Analyze the grouped segment's booking status using percentage of distribution for each categorical value in reservation_status.
prop.table(table(select(dataGroupedByMS, reservation_status))) %>%
`*`(100) %>%
round(2)
```
### Explain and Interpret
The first table gives insight into which market segment bookings come for weekdays and which ones come during weekends. This could help the hotels decide which events to organize or how to plan activities targeted to a certain segment during those days.
The second table gives percentage of bookings of each type of segment that were cancelled, no-shows or checked-out. This could help the hotels analyze which market segments gets the most cancellations and which ones get the least, and find ways to reduce the number of cancellations by targeted schemes or penalties towards certain segments.