HW2

hw2

hotel_booking

Author

Priya Marla

Published

January 20, 2023

Code

library(dbplyr)
library(tidyverse)
library(here)
library(lubridate)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE)

Instructions

For this homework, your goal is to read in a more complicated dataset. Please use the category tag “hw2” as well as a tag for the dataset you choose to use. 1. Read in a dataset from the _data folder in the course blog repository, or choose your own data. If you decide to use one of the datasets we have provided, please use a challenging dataset - check with us if you are not sure. 2. Clean the data as needed using dplyr and related tidyverse packages. 3. Provide a narrative about the data set (look it up if you aren’t sure what you have got) and the variables in your dataset, including what type of data each variable is. The goal of this step is to communicate in a visually appealing way to non-experts - not to replicate r-code. 4. Identify potential research questions that your dataset can help answer.

Read in data

Code

dataset <- here("posts","_data","hotel_bookings.csv") %>%
  read_csv()

Rows: 119390 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
date  (1): reservation_status_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

dataset

# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Briefly describe the data

This dataset describes the information of the reservations made in Resort Hotel and City Hotel. There are 119390 rows and 32 columns. Each columns represents various data such as for what dates and what hotel the booking was made, whether reservation is called or not and if the payment is made or not etc.

Tidy Data (as needed)

The data needs to be cleaned. The date is spread across in multiple columns. If is there in a single column it;’ll will be easy to calculate stats. Can also calculate the date with lead time and date of arrival columns.

The country column is placed in the 14th place, it should be moved up to 2nd place i.e after the hotel column. Created a new column arrival_date by getting the date from “arrival_date_day_of_month”, “arrival_date_month”,“arrival_date_year” columns. Removing these 3 columns and moving the arrival date column to column after “lead_time”. Creating the new column booking_date to know the date of booking with the information from columns lead_time and arrival_date. Created a new column “child” to get the total of children plus babies to replace the columns children and babies. After tidying up the data, total columns remaining are 28

Code

dplyr::select(dataset, 3:7 )

# A tibble: 119,390 × 5
   lead_time arrival_date_year arrival_date_month arrival_date_week_nu…¹ arriv…²
       <dbl>             <dbl> <chr>                               <dbl>   <dbl>
 1       342              2015 July                                   27       1
 2       737              2015 July                                   27       1
 3         7              2015 July                                   27       1
 4        13              2015 July                                   27       1
 5        14              2015 July                                   27       1
 6        14              2015 July                                   27       1
 7         0              2015 July                                   27       1
 8         9              2015 July                                   27       1
 9        85              2015 July                                   27       1
10        75              2015 July                                   27       1
# … with 119,380 more rows, and abbreviated variable names
#   ¹arrival_date_week_number, ²arrival_date_day_of_month

Code

#tidying the dataset
tidy_data <- dataset %>%
  relocate("country",.after = "hotel") %>% #relocating the country column
  mutate(arrival_date = (str_c(arrival_date_day_of_month,arrival_date_month,arrival_date_year, sep = "/")), arrival_date = dmy(arrival_date), .after = lead_time) %>% #variable for arrival date
  mutate(booking_date = arrival_date-days(lead_time), .after = lead_time) %>% #variable to know the date of booking
  mutate(childs = children + babies, .after = adults) %>%
  dplyr::select(-c("lead_time","children","babies"))
tidy_data <- tidy_data[,-6:-9] #removed columns with arrival date information
tidy_data

# A tibble: 119,390 × 28
   hotel     country is_ca…¹ booking_…² arrival_…³ stays…⁴ stays…⁵ adults childs
   <chr>     <chr>     <dbl> <date>     <date>       <dbl>   <dbl>  <dbl>  <dbl>
 1 Resort H… PRT           0 2014-07-24 2015-07-01       0       0      2      0
 2 Resort H… PRT           0 2013-06-24 2015-07-01       0       0      2      0
 3 Resort H… GBR           0 2015-06-24 2015-07-01       0       1      1      0
 4 Resort H… GBR           0 2015-06-18 2015-07-01       0       1      1      0
 5 Resort H… GBR           0 2015-06-17 2015-07-01       0       2      2      0
 6 Resort H… GBR           0 2015-06-17 2015-07-01       0       2      2      0
 7 Resort H… PRT           0 2015-07-01 2015-07-01       0       2      2      0
 8 Resort H… PRT           0 2015-06-22 2015-07-01       0       2      2      0
 9 Resort H… PRT           1 2015-04-07 2015-07-01       0       3      2      0
10 Resort H… PRT           1 2015-04-17 2015-07-01       0       3      2      0
# … with 119,380 more rows, 19 more variables: meal <chr>,
#   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …

Narrative of Dataset

From the summarytools output dataframe, we can see the stats of various values of each column. For example here we can see that there are 2 types of hotels(data type - character) and 66.4% people have their bookings in City hotel and rest in Resort hotel. Is_canceled column describes whether the booking has been canceled (numeric value 1) or not canceled (numeric value 0). 63% of the customers don’t cancel their reservations whereas 37% customers cancel their reservations.

Code

print(summarytools::dfSummary(tidy_data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

tidy_data

Dimensions: 119390 x 28
Duplicates: 31995

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

hotel [character]

1. City Hotel

2. Resort Hotel

79330	(	66.4%	)
40060	(	33.6%	)

0 (0.0%)

country [character]

1. PRT

2. GBR

3. FRA

4. ESP

5. DEU

6. ITA

7. IRL

8. BEL

9. BRA

10. NLD

[ 168 others ]

48590	(	40.7%	)
12129	(	10.2%	)
10415	(	8.7%	)
8568	(	7.2%	)
7287	(	6.1%	)
3766	(	3.2%	)
3375	(	2.8%	)
2342	(	2.0%	)
2224	(	1.9%	)
2104	(	1.8%	)
18590	(	15.6%	)

0 (0.0%)

is_canceled [numeric]

Min : 0

Mean : 0.4

Max : 1

0	:	75166	(	63.0%	)
1	:	44224	(	37.0%	)

0 (0.0%)

booking_date [Date]

min : 2013-06-24

med : 2016-05-04

max : 2017-08-31

range : 4y 2m 7d

984 distinct values

0 (0.0%)

arrival_date [Date]

min : 2015-07-01

med : 2016-09-06

max : 2017-08-31

range : 2y 1m 30d

793 distinct values

0 (0.0%)

stays_in_weekend_nights [numeric]

Mean (sd) : 0.9 (1)

min ≤ med ≤ max:

0 ≤ 1 ≤ 19

IQR (CV) : 2 (1.1)

17 distinct values

0 (0.0%)

stays_in_week_nights [numeric]

Mean (sd) : 2.5 (1.9)

min ≤ med ≤ max:

0 ≤ 2 ≤ 50

IQR (CV) : 2 (0.8)

35 distinct values

0 (0.0%)

adults [numeric]

Mean (sd) : 1.9 (0.6)

min ≤ med ≤ max:

0 ≤ 2 ≤ 55

IQR (CV) : 0 (0.3)

14 distinct values

0 (0.0%)

childs [numeric]

Mean (sd) : 0.1 (0.4)

min ≤ med ≤ max:

0 ≤ 0 ≤ 10

IQR (CV) : 0 (3.7)

0	:	110054	(	92.2%	)
1	:	5446	(	4.6%	)
2	:	3772	(	3.2%	)
3	:	111	(	0.1%	)
9	:	1	(	0.0%	)
10	:	2	(	0.0%	)

4 (0.0%)

meal [character]

1. BB

2. FB

3. HB

4. SC

5. Undefined

92310	(	77.3%	)
798	(	0.7%	)
14463	(	12.1%	)
10650	(	8.9%	)
1169	(	1.0%	)

0 (0.0%)

market_segment [character]

1. Aviation

2. Complementary

3. Corporate

4. Direct

5. Groups

6. Offline TA/TO

7. Online TA

8. Undefined

237	(	0.2%	)
743	(	0.6%	)
5295	(	4.4%	)
12606	(	10.6%	)
19811	(	16.6%	)
24219	(	20.3%	)
56477	(	47.3%	)
2	(	0.0%	)

0 (0.0%)

distribution_channel [character]

1. Corporate

2. Direct

3. GDS

4. TA/TO

5. Undefined

6677	(	5.6%	)
14645	(	12.3%	)
193	(	0.2%	)
97870	(	82.0%	)
5	(	0.0%	)

0 (0.0%)

is_repeated_guest [numeric]

Min : 0

Mean : 0

Max : 1

0	:	115580	(	96.8%	)
1	:	3810	(	3.2%	)

0 (0.0%)

previous_cancellations [numeric]

Mean (sd) : 0.1 (0.8)

min ≤ med ≤ max:

0 ≤ 0 ≤ 26

IQR (CV) : 0 (9.7)

15 distinct values

0 (0.0%)

previous_bookings_not_canceled [numeric]

Mean (sd) : 0.1 (1.5)

min ≤ med ≤ max:

0 ≤ 0 ≤ 72

IQR (CV) : 0 (10.9)

73 distinct values

0 (0.0%)

reserved_room_type [character]

1. A

2. B

3. C

4. D

5. E

6. F

7. G

8. H

9. L

10. P

85994	(	72.0%	)
1118	(	0.9%	)
932	(	0.8%	)
19201	(	16.1%	)
6535	(	5.5%	)
2897	(	2.4%	)
2094	(	1.8%	)
601	(	0.5%	)
6	(	0.0%	)
12	(	0.0%	)

0 (0.0%)

assigned_room_type [character]

1. A

2. D

3. E

4. F

5. G

6. C

7. B

8. H

9. I

10. K

[ 2 others ]

74053	(	62.0%	)
25322	(	21.2%	)
7806	(	6.5%	)
3751	(	3.1%	)
2553	(	2.1%	)
2375	(	2.0%	)
2163	(	1.8%	)
712	(	0.6%	)
363	(	0.3%	)
279	(	0.2%	)
13	(	0.0%	)

0 (0.0%)

booking_changes [numeric]

Mean (sd) : 0.2 (0.7)

min ≤ med ≤ max:

0 ≤ 0 ≤ 21

IQR (CV) : 0 (2.9)

21 distinct values

0 (0.0%)

deposit_type [character]

1. No Deposit

2. Non Refund

3. Refundable

104641	(	87.6%	)
14587	(	12.2%	)
162	(	0.1%	)

0 (0.0%)

agent [character]

1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]

31961	(	26.8%	)
16340	(	13.7%	)
13922	(	11.7%	)
7191	(	6.0%	)
3640	(	3.0%	)
3539	(	3.0%	)
3290	(	2.8%	)
2870	(	2.4%	)
1721	(	1.4%	)
1666	(	1.4%	)
33250	(	27.8%	)

0 (0.0%)

company [character]

1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]

112593	(	94.3%	)
927	(	0.8%	)
784	(	0.7%	)
267	(	0.2%	)
250	(	0.2%	)
215	(	0.2%	)
149	(	0.1%	)
141	(	0.1%	)
138	(	0.1%	)
133	(	0.1%	)
3793	(	3.2%	)

0 (0.0%)

days_in_waiting_list [numeric]

Mean (sd) : 2.3 (17.6)

min ≤ med ≤ max:

0 ≤ 0 ≤ 391

IQR (CV) : 0 (7.6)

128 distinct values

0 (0.0%)

customer_type [character]

1. Contract

2. Group

3. Transient

4. Transient-Party

4076	(	3.4%	)
577	(	0.5%	)
89613	(	75.1%	)
25124	(	21.0%	)

0 (0.0%)

adr [numeric]

Mean (sd) : 101.8 (50.5)

min ≤ med ≤ max:

-6.4 ≤ 94.6 ≤ 5400

IQR (CV) : 56.7 (0.5)

8879 distinct values

0 (0.0%)

required_car_parking_spaces [numeric]

Mean (sd) : 0.1 (0.2)

min ≤ med ≤ max:

0 ≤ 0 ≤ 8

IQR (CV) : 0 (3.9)

0	:	111974	(	93.8%	)
1	:	7383	(	6.2%	)
2	:	28	(	0.0%	)
3	:	3	(	0.0%	)
8	:	2	(	0.0%	)

0 (0.0%)

total_of_special_requests [numeric]

Mean (sd) : 0.6 (0.8)

min ≤ med ≤ max:

0 ≤ 0 ≤ 5

IQR (CV) : 1 (1.4)

0	:	70318	(	58.9%	)
1	:	33226	(	27.8%	)
2	:	12969	(	10.9%	)
3	:	2497	(	2.1%	)
4	:	340	(	0.3%	)
5	:	40	(	0.0%	)

0 (0.0%)

reservation_status [character]

1. Canceled

2. Check-Out

3. No-Show

43017	(	36.0%	)
75166	(	63.0%	)
1207	(	1.0%	)

0 (0.0%)

reservation_status_date [Date]

min : 2014-10-17

med : 2016-08-07

max : 2017-09-14

range : 2y 10m 28d

926 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.0)
2023-01-23

Potential Research Question

From this dataset following research questions can be answered

In what location do the customers prefer to stay? It’s clear from the statistics that 40.7% of the people are staying in the location PRT. If someone is planning on building or investing in a hotel, then these statistics can give a clear idea.
What percent of people stay with children and babies? With this question, hotels can make sure sufficient resources are available for children and babies
What kind of meals do customers prefer? This question can answer things like what kind of meals do prefer when living in a hotel and also help hotels to make food arrangements.
Customer_type answers what type of customers are staying in hotels.

:::