hw3
hotel_booking
Author

Priya Marla

Published

January 22, 2023

Code
library(epiDisplay)
library(dbplyr)
library(tidyverse)
library(here)
library(lubridate)
library(summarytools)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE)

Instructions

The emphasis in this homework is on exploratory data analysis using both graphics and statistics. You should build on your prior homework - incorporating any feedback and adjusting the code and text as needed. These homeworks are intended to be cumulative. Therefore, while it is fine to switch datasets, you will need to include all of the information from HW2 for your new (or old) dataset in this hw submission as well.

Include descriptive statistics (e.g, mean, median, and standard deviation for numerical variables, and frequencies and/or mode for categorical variables Include relevant visualizations using ggplot2 to complement these descriptive statistics. Be sure to use faceting, coloring, and titles as needed. Each visualization should be accompanied by descriptive text that highlights: the variable(s) used what questions might be answered with the visualizations what conclusions you can draw Use group_by() and summarise() to compute descriptive stats and/or visualizations for any relevant groupings. For example, if you were interested in how average income varies by state, you might compute mean income for all states combined, and then compare this to the range and distribution of mean income for each individual state in the US. Identify limitations of your visualization, such as: What questions are left unanswered with your visualizations What about the visualizations may be unclear to a naive viewer How could you improve the visualizations for the final project

Read in data

Code
dataset <- here("posts","_data","hotel_bookings.csv") %>%
  read_csv()
Rows: 119390 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
date  (1): reservation_status_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
dataset
# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Briefly describe the data

This dataset describes the information of the reservations made in Resort Hotel and City Hotel. There are 119390 rows and 32 columns. Each columns represents various data such as for what dates and what hotel the booking was made, whether reservation is called or not and if the payment is made or not etc.

Tidy Data (as needed)

The data needs to be cleaned. The date is spread across in multiple columns. If is there in a single column it;’ll will be easy to calculate stats. Can also calculate the date with lead time and date of arrival columns.

The country column is placed in the 14th place, it should be moved up to 2nd place i.e after the hotel column. Created a new column arrival_date by getting the date from “arrival_date_day_of_month”, “arrival_date_month”,“arrival_date_year” columns. Removing these 3 columns and moving the arrival date column to column after “lead_time”. Creating the new column booking_date to know the date of booking with the information from columns lead_time and arrival_date. Created a new column “child” to get the total of children plus babies to replace the columns children and babies. After tidying up the data, total columns remaining are 28

Code
dplyr::select(dataset, 3:7 )
# A tibble: 119,390 × 5
   lead_time arrival_date_year arrival_date_month arrival_date_week_nu…¹ arriv…²
       <dbl>             <dbl> <chr>                               <dbl>   <dbl>
 1       342              2015 July                                   27       1
 2       737              2015 July                                   27       1
 3         7              2015 July                                   27       1
 4        13              2015 July                                   27       1
 5        14              2015 July                                   27       1
 6        14              2015 July                                   27       1
 7         0              2015 July                                   27       1
 8         9              2015 July                                   27       1
 9        85              2015 July                                   27       1
10        75              2015 July                                   27       1
# … with 119,380 more rows, and abbreviated variable names
#   ¹​arrival_date_week_number, ²​arrival_date_day_of_month
Code
#tidying the dataset
tidy_data <- dataset %>%
  relocate("country",.after = "hotel") %>% #relocating the country column
  mutate(arrival_date = (str_c(arrival_date_day_of_month,arrival_date_month,arrival_date_year, sep = "/")), arrival_date = dmy(arrival_date), .after = lead_time) %>% #variable for arrival date
  mutate(booking_date = arrival_date-days(lead_time), .after = lead_time) %>% #variable to know the date of booking
  mutate(childs = children + babies, .after = adults) %>%
  dplyr::select(-c("lead_time","children","babies"))
tidy_data <- tidy_data[,-6:-9] #removed columns with arrival date information
tidy_data
# A tibble: 119,390 × 28
   hotel     country is_ca…¹ booking_…² arrival_…³ stays…⁴ stays…⁵ adults childs
   <chr>     <chr>     <dbl> <date>     <date>       <dbl>   <dbl>  <dbl>  <dbl>
 1 Resort H… PRT           0 2014-07-24 2015-07-01       0       0      2      0
 2 Resort H… PRT           0 2013-06-24 2015-07-01       0       0      2      0
 3 Resort H… GBR           0 2015-06-24 2015-07-01       0       1      1      0
 4 Resort H… GBR           0 2015-06-18 2015-07-01       0       1      1      0
 5 Resort H… GBR           0 2015-06-17 2015-07-01       0       2      2      0
 6 Resort H… GBR           0 2015-06-17 2015-07-01       0       2      2      0
 7 Resort H… PRT           0 2015-07-01 2015-07-01       0       2      2      0
 8 Resort H… PRT           0 2015-06-22 2015-07-01       0       2      2      0
 9 Resort H… PRT           1 2015-04-07 2015-07-01       0       3      2      0
10 Resort H… PRT           1 2015-04-17 2015-07-01       0       3      2      0
# … with 119,380 more rows, 19 more variables: meal <chr>,
#   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …

Narrative of Dataset

From the summarytools output dataframe, we can see the stats of various values of each column. For example here we can see that there are 2 types of hotels(data type - character) and 66.4% people have their bookings in City hotel and rest in Resort hotel. Is_canceled column describes whether the booking has been canceled (numeric value 1) or not canceled (numeric value 0). 63% of the customers don’t cancel their reservations whereas 37% customers cancel their reservations.

Code
print(summarytools::dfSummary(tidy_data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

tidy_data

Dimensions: 119390 x 28
Duplicates: 31995
Variable Stats / Values Freqs (% of Valid) Graph Missing
hotel [character]
1. City Hotel
2. Resort Hotel
79330(66.4%)
40060(33.6%)
0 (0.0%)
country [character]
1. PRT
2. GBR
3. FRA
4. ESP
5. DEU
6. ITA
7. IRL
8. BEL
9. BRA
10. NLD
[ 168 others ]
48590(40.7%)
12129(10.2%)
10415(8.7%)
8568(7.2%)
7287(6.1%)
3766(3.2%)
3375(2.8%)
2342(2.0%)
2224(1.9%)
2104(1.8%)
18590(15.6%)
0 (0.0%)
is_canceled [numeric]
Min : 0
Mean : 0.4
Max : 1
0:75166(63.0%)
1:44224(37.0%)
0 (0.0%)
booking_date [Date]
min : 2013-06-24
med : 2016-05-04
max : 2017-08-31
range : 4y 2m 7d
984 distinct values 0 (0.0%)
arrival_date [Date]
min : 2015-07-01
med : 2016-09-06
max : 2017-08-31
range : 2y 1m 30d
793 distinct values 0 (0.0%)
stays_in_weekend_nights [numeric]
Mean (sd) : 0.9 (1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 2 (1.1)
17 distinct values 0 (0.0%)
stays_in_week_nights [numeric]
Mean (sd) : 2.5 (1.9)
min ≤ med ≤ max:
0 ≤ 2 ≤ 50
IQR (CV) : 2 (0.8)
35 distinct values 0 (0.0%)
adults [numeric]
Mean (sd) : 1.9 (0.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.3)
14 distinct values 0 (0.0%)
childs [numeric]
Mean (sd) : 0.1 (0.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (3.7)
0:110054(92.2%)
1:5446(4.6%)
2:3772(3.2%)
3:111(0.1%)
9:1(0.0%)
10:2(0.0%)
4 (0.0%)
meal [character]
1. BB
2. FB
3. HB
4. SC
5. Undefined
92310(77.3%)
798(0.7%)
14463(12.1%)
10650(8.9%)
1169(1.0%)
0 (0.0%)
market_segment [character]
1. Aviation
2. Complementary
3. Corporate
4. Direct
5. Groups
6. Offline TA/TO
7. Online TA
8. Undefined
237(0.2%)
743(0.6%)
5295(4.4%)
12606(10.6%)
19811(16.6%)
24219(20.3%)
56477(47.3%)
2(0.0%)
0 (0.0%)
distribution_channel [character]
1. Corporate
2. Direct
3. GDS
4. TA/TO
5. Undefined
6677(5.6%)
14645(12.3%)
193(0.2%)
97870(82.0%)
5(0.0%)
0 (0.0%)
is_repeated_guest [numeric]
Min : 0
Mean : 0
Max : 1
0:115580(96.8%)
1:3810(3.2%)
0 (0.0%)
previous_cancellations [numeric]
Mean (sd) : 0.1 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 26
IQR (CV) : 0 (9.7)
15 distinct values 0 (0.0%)
previous_bookings_not_canceled [numeric]
Mean (sd) : 0.1 (1.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 72
IQR (CV) : 0 (10.9)
73 distinct values 0 (0.0%)
reserved_room_type [character]
1. A
2. B
3. C
4. D
5. E
6. F
7. G
8. H
9. L
10. P
85994(72.0%)
1118(0.9%)
932(0.8%)
19201(16.1%)
6535(5.5%)
2897(2.4%)
2094(1.8%)
601(0.5%)
6(0.0%)
12(0.0%)
0 (0.0%)
assigned_room_type [character]
1. A
2. D
3. E
4. F
5. G
6. C
7. B
8. H
9. I
10. K
[ 2 others ]
74053(62.0%)
25322(21.2%)
7806(6.5%)
3751(3.1%)
2553(2.1%)
2375(2.0%)
2163(1.8%)
712(0.6%)
363(0.3%)
279(0.2%)
13(0.0%)
0 (0.0%)
booking_changes [numeric]
Mean (sd) : 0.2 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 21
IQR (CV) : 0 (2.9)
21 distinct values 0 (0.0%)
deposit_type [character]
1. No Deposit
2. Non Refund
3. Refundable
104641(87.6%)
14587(12.2%)
162(0.1%)
0 (0.0%)
agent [character]
1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]
31961(26.8%)
16340(13.7%)
13922(11.7%)
7191(6.0%)
3640(3.0%)
3539(3.0%)
3290(2.8%)
2870(2.4%)
1721(1.4%)
1666(1.4%)
33250(27.8%)
0 (0.0%)
company [character]
1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]
112593(94.3%)
927(0.8%)
784(0.7%)
267(0.2%)
250(0.2%)
215(0.2%)
149(0.1%)
141(0.1%)
138(0.1%)
133(0.1%)
3793(3.2%)
0 (0.0%)
days_in_waiting_list [numeric]
Mean (sd) : 2.3 (17.6)
min ≤ med ≤ max:
0 ≤ 0 ≤ 391
IQR (CV) : 0 (7.6)
128 distinct values 0 (0.0%)
customer_type [character]
1. Contract
2. Group
3. Transient
4. Transient-Party
4076(3.4%)
577(0.5%)
89613(75.1%)
25124(21.0%)
0 (0.0%)
adr [numeric]
Mean (sd) : 101.8 (50.5)
min ≤ med ≤ max:
-6.4 ≤ 94.6 ≤ 5400
IQR (CV) : 56.7 (0.5)
8879 distinct values 0 (0.0%)
required_car_parking_spaces [numeric]
Mean (sd) : 0.1 (0.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (3.9)
0:111974(93.8%)
1:7383(6.2%)
2:28(0.0%)
3:3(0.0%)
8:2(0.0%)
0 (0.0%)
total_of_special_requests [numeric]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 1 (1.4)
0:70318(58.9%)
1:33226(27.8%)
2:12969(10.9%)
3:2497(2.1%)
4:340(0.3%)
5:40(0.0%)
0 (0.0%)
reservation_status [character]
1. Canceled
2. Check-Out
3. No-Show
43017(36.0%)
75166(63.0%)
1207(1.0%)
0 (0.0%)
reservation_status_date [Date]
min : 2014-10-17
med : 2016-08-07
max : 2017-09-14
range : 2y 10m 28d
926 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.0)
2023-01-23

Potential Research Question

From this dataset following research questions can be answered

  • In what location do the customers prefer to stay? It’s clear from the statistics that 40.7% of the people are staying in the location PRT. If someone is planning on building or investing in a hotel, then these statistics can give a clear idea.

  • What percent of people stay with children and babies? With this question, hotels can make sure sufficient resources are available for children and babies

  • What kind of meals do customers prefer? This question can answer things like what kind of meals do prefer when living in a hotel and also help hotels to make food arrangements.

  • Customer_type answers what type of customers are staying in hotels.

Descriptive Visualizations

Following are the visualizations for the previous Research questions

Code
CountHotels <- tidy_data %>%
 count(hotel) 
CountHotels
# A tibble: 2 × 2
  hotel            n
  <chr>        <int>
1 City Hotel   79330
2 Resort Hotel 40060
Code
ggplot(CountHotels, aes(x = hotel, y = n/sum(n))) + #plot the x-axis and y-axis
  geom_bar(fill = "purple", 
           color="black", stat = "identity") + #to get the bar graph
  geom_text(aes(label = round(n/sum(n),2)), vjust = -0.5)+ # to get the labels for each bar plot
  labs(x = "Hotels", 
       y = "Frequency of Hotels",
       title = "Frequency visualization of Hotel column") #to get the labels of axis and title of the graph

variable used: hotel With this above visualization we can answer the question of which hotel is preferred more. By referring to the above plot, we can say that the City Hotel is preferred by higher percentage of people i.e 66%.

Code
CountLocation <- tidy_data %>%
 count(country) %>%
  arrange(desc(n))
CountLocation
# A tibble: 178 × 2
   country     n
   <chr>   <int>
 1 PRT     48590
 2 GBR     12129
 3 FRA     10415
 4 ESP      8568
 5 DEU      7287
 6 ITA      3766
 7 IRL      3375
 8 BEL      2342
 9 BRA      2224
10 NLD      2104
# … with 168 more rows
Code
CountLocation <- CountLocation %>%
  slice(1:5)


ggplot(CountLocation, aes(x = country, y = n)) + #plot the x-axis and y-axis
  geom_bar(fill = "red", 
           color="black", stat = "identity") + #to get the bar graph
  geom_text(aes(label = n, vjust = -0.5))+ # to get the labels for each bar plot
  labs(x = "Locations", 
       y = "Frequency of Locations",
       title = "Frequency visualization of location column") #to get the labels of axis and title of the graph

Variable used: country This visualization answers the question of at what location are most of the hotels located? There are around 178 countries where city and resort hotel are located, out of those a bar graph has been plotted for the top 5 most visited countries. We can see that PRT country has highest visitors od 48590.

Code
CountChildren <- tidy_data %>%
  count(childs > 0) %>%
  drop_na()
colnames(CountChildren) <- c("childs","n")
CountChildren$childs[CountChildren$childs == TRUE] <- 'childs > 0'
CountChildren$childs[CountChildren$childs == FALSE] <- 'childs = 0'
CountChildren
# A tibble: 2 × 2
  childs          n
  <chr>       <int>
1 childs = 0 110054
2 childs > 0   9332
Code
ggplot(CountChildren, aes(x = childs, y = n/sum(n))) + #plot the x-axis and y-axis
  geom_bar(fill = "blue", 
           color="black", stat = "identity") + #to get the bar graph
  geom_text(aes(label = round(n/sum(n), 2), vjust = -0.5))+ # to get the labels for each bar plot
  labs(x = "Locations", 
       y = "Children count",
       title = "Frequency visualization of location column") #to get the labels of axis and title of the graph

Variables used: childs This visualization answers the question of how many people prefer to come with children and babies? From the above plot we can infer that most of the people visit hotels without children. There are 0.92% of people visiting without children and 0.08% of people visiting with 1 or more children and babies.

Code
CountMeals <- tidy_data %>%
 count(meal) 
CountMeals
# A tibble: 5 × 2
  meal          n
  <chr>     <int>
1 BB        92310
2 FB          798
3 HB        14463
4 SC        10650
5 Undefined  1169
Code
ggplot(CountMeals, aes(x = meal, y = n/sum(n))) + #plot the x-axis and y-axis
  geom_bar(fill = "green", 
           color="black", stat = "identity") + #to get the bar graph
  geom_text(aes(label = round(n/sum(n),2)), vjust = -0.5)+ # to get the labels for each bar plot
  labs(x = "Hotels", 
       y = "Percentage",
       title = "Meal Preference percentage") #to get the labels of axis and title of the graph

Variable used: meal. The above visualization answers the question what meals do people prefer? We can see that 77% of the people prefer “BB” meal. The least prefered meal is “FB”.

Code
#count for customer_type
CountType <- tidy_data %>%
  count(customer_type)
CountType
# A tibble: 4 × 2
  customer_type       n
  <chr>           <int>
1 Contract         4076
2 Group             577
3 Transient       89613
4 Transient-Party 25124
Code
ggplot(CountType, aes(x = customer_type, y = n/sum(n))) + #plot the x-axis and y-axis
  geom_bar(fill = "yellow", 
           color="black", stat = "identity") + #to get the bar graph
  geom_text(aes(label = round(n/sum(n),3)), vjust = -0.5)+ # to get the labels for each bar plot
  labs(x = "Customer Type", 
       y = "Percentage",
       title = "Various types of Customers") #to get the labels of axis and title of the graph

Variable used: customer_type This visualization answers the question what type of customers exist and their corresponding statistics. Most of the customer who visit the hotels are Transient type.

Grouping Visualizations

Following are the visualizations using group_by and summarise functions

Code
tidy_data %>%
  group_by(hotel) %>%
  summarise(sum_cancel = sum(is_canceled)) %>%
  ggplot(aes(x = hotel, y = sum_cancel)) + 
  geom_bar(fill = "black", 
           color="black", stat = "identity") + #to get the bar graph
  geom_text(aes(label = sum_cancel), vjust = -0.5)+ # to get the labels for each bar plot
  labs(x = "Hotels", 
       y = "Number of cancellations",
       title = "Cancellations in various hotels") #to get the labels of axis and title of the graph

From this visualization we can see the statistics of how many cancellations are made in various hotel. Number of cancellations in City hotel are 33102 whereas number of cancellations in Resort hotel are 11122.

Code
tidy_data %>%
  group_by(hotel) %>%
  summarise(parking = sum(required_car_parking_spaces)) %>%
  ggplot(aes(x = hotel, y = parking)) + 
  geom_bar(fill = "brown", 
           color="black", stat = "identity") + #to get the bar graph
  geom_text(aes(label = parking), vjust = -0.5)+ # to get the labels for each bar plot
  labs(x = "Hotels", 
       y = "Number of Parking Spaces",
       title = "Parking spaces required for various hotels") #to get the labels of axis and title of the graph

In this visualization we’ll know how many parking spaces are required for different hotels. From the above plot, city hotel takes up 1933 parking spaces and Resort hotel requires 5531 parking spaces.

Code
tidy_data %>%
  group_by(arrival_date) %>%
  summarise(bookings = sum(previous_bookings_not_canceled )) %>%
  arrange(arrival_date) %>%
  ggplot(aes(x=arrival_date)) + 
  geom_line(aes(y=bookings)) + 
  labs(title="Not canceled bookings over the time", 
       y="Number of bookings")   # title and caption # change to monthly ticks and labels

From this time series plot, we can observe that the in between the dates “2016-01” to “2016-07”, the hotels have seen a spike in the guest arrivals whereas they recorded the least guests between the dates “2015-07” and “2016-01”.

Future Scope

There is more scope for visualizations in this dataset like - using multiple variables to plot a single graph over the period of time. - one of the example is hotel in a particular country having a non canceled booking in a particular year. I’m planning on covering such multi variable plots in my final project.

:::