DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Homework-2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read Data
  • Describe Data
  • Some analysis
  • Potential research questions

Homework-2

  • Show All Code
  • Hide All Code

  • View Source
hw2
hotel_bookings
Author

Said Arslan

Published

October 20, 2022

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(psych)

Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha
Code
library(summarytools)

Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view
Code
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read Data

Code
hotel.bookings <- read.csv("_data/hotel_bookings.csv")

Describe Data

Code
dim(hotel.bookings)
[1] 119390     32
Code
colnames(hotel.bookings)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       
Code
colnames(hotel.bookings)[28] <- "average_daily_rate"

head(hotel.bookings)
         hotel is_canceled lead_time arrival_date_year arrival_date_month
1 Resort Hotel           0       342              2015               July
2 Resort Hotel           0       737              2015               July
3 Resort Hotel           0         7              2015               July
4 Resort Hotel           0        13              2015               July
5 Resort Hotel           0        14              2015               July
6 Resort Hotel           0        14              2015               July
  arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
1                       27                         1                       0
2                       27                         1                       0
3                       27                         1                       0
4                       27                         1                       0
5                       27                         1                       0
6                       27                         1                       0
  stays_in_week_nights adults children babies meal country market_segment
1                    0      2        0      0   BB     PRT         Direct
2                    0      2        0      0   BB     PRT         Direct
3                    1      1        0      0   BB     GBR         Direct
4                    1      1        0      0   BB     GBR      Corporate
5                    2      2        0      0   BB     GBR      Online TA
6                    2      2        0      0   BB     GBR      Online TA
  distribution_channel is_repeated_guest previous_cancellations
1               Direct                 0                      0
2               Direct                 0                      0
3               Direct                 0                      0
4            Corporate                 0                      0
5                TA/TO                 0                      0
6                TA/TO                 0                      0
  previous_bookings_not_canceled reserved_room_type assigned_room_type
1                              0                  C                  C
2                              0                  C                  C
3                              0                  A                  C
4                              0                  A                  A
5                              0                  A                  A
6                              0                  A                  A
  booking_changes deposit_type agent company days_in_waiting_list customer_type
1               3   No Deposit  NULL    NULL                    0     Transient
2               4   No Deposit  NULL    NULL                    0     Transient
3               0   No Deposit  NULL    NULL                    0     Transient
4               0   No Deposit   304    NULL                    0     Transient
5               0   No Deposit   240    NULL                    0     Transient
6               0   No Deposit   240    NULL                    0     Transient
  average_daily_rate required_car_parking_spaces total_of_special_requests
1                  0                           0                         0
2                  0                           0                         0
3                 75                           0                         0
4                 75                           0                         0
5                 98                           0                         1
6                 98                           0                         1
  reservation_status reservation_status_date
1          Check-Out              2015-07-01
2          Check-Out              2015-07-01
3          Check-Out              2015-07-02
4          Check-Out              2015-07-02
5          Check-Out              2015-07-03
6          Check-Out              2015-07-03
Code
tail(hotel.bookings)
            hotel is_canceled lead_time arrival_date_year arrival_date_month
119385 City Hotel           0        21              2017             August
119386 City Hotel           0        23              2017             August
119387 City Hotel           0       102              2017             August
119388 City Hotel           0        34              2017             August
119389 City Hotel           0       109              2017             August
119390 City Hotel           0       205              2017             August
       arrival_date_week_number arrival_date_day_of_month
119385                       35                        30
119386                       35                        30
119387                       35                        31
119388                       35                        31
119389                       35                        31
119390                       35                        29
       stays_in_weekend_nights stays_in_week_nights adults children babies meal
119385                       2                    5      2        0      0   BB
119386                       2                    5      2        0      0   BB
119387                       2                    5      3        0      0   BB
119388                       2                    5      2        0      0   BB
119389                       2                    5      2        0      0   BB
119390                       2                    7      2        0      0   HB
       country market_segment distribution_channel is_repeated_guest
119385     BEL  Offline TA/TO                TA/TO                 0
119386     BEL  Offline TA/TO                TA/TO                 0
119387     FRA      Online TA                TA/TO                 0
119388     DEU      Online TA                TA/TO                 0
119389     GBR      Online TA                TA/TO                 0
119390     DEU      Online TA                TA/TO                 0
       previous_cancellations previous_bookings_not_canceled reserved_room_type
119385                      0                              0                  A
119386                      0                              0                  A
119387                      0                              0                  E
119388                      0                              0                  D
119389                      0                              0                  A
119390                      0                              0                  A
       assigned_room_type booking_changes deposit_type agent company
119385                  A               0   No Deposit   394    NULL
119386                  A               0   No Deposit   394    NULL
119387                  E               0   No Deposit     9    NULL
119388                  D               0   No Deposit     9    NULL
119389                  A               0   No Deposit    89    NULL
119390                  A               0   No Deposit     9    NULL
       days_in_waiting_list customer_type average_daily_rate
119385                    0     Transient              96.14
119386                    0     Transient              96.14
119387                    0     Transient             225.43
119388                    0     Transient             157.71
119389                    0     Transient             104.40
119390                    0     Transient             151.20
       required_car_parking_spaces total_of_special_requests reservation_status
119385                           0                         2          Check-Out
119386                           0                         0          Check-Out
119387                           0                         2          Check-Out
119388                           0                         4          Check-Out
119389                           0                         0          Check-Out
119390                           0                         2          Check-Out
       reservation_status_date
119385              2017-09-06
119386              2017-09-06
119387              2017-09-07
119388              2017-09-07
119389              2017-09-07
119390              2017-09-07
Code
str(hotel.bookings)
'data.frame':   119390 obs. of  32 variables:
 $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
 $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
 $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
 $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
 $ arrival_date_month            : chr  "July" "July" "July" "July" ...
 $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
 $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
 $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
 $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ meal                          : chr  "BB" "BB" "BB" "BB" ...
 $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
 $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
 $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
 $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
 $ reserved_room_type            : chr  "C" "C" "A" "A" ...
 $ assigned_room_type            : chr  "C" "C" "C" "A" ...
 $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
 $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
 $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
 $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
 $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
 $ average_daily_rate            : num  0 0 75 75 98 ...
 $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
 $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
 $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
Code
table(sapply(hotel.bookings, function(x) typeof(x)))

character    double   integer 
       14         1        17 
Code
sapply(
  hotel.bookings, function(x) n_distinct(x)
  )
                         hotel                    is_canceled 
                             2                              2 
                     lead_time              arrival_date_year 
                           479                              3 
            arrival_date_month       arrival_date_week_number 
                            12                             53 
     arrival_date_day_of_month        stays_in_weekend_nights 
                            31                             17 
          stays_in_week_nights                         adults 
                            35                             14 
                      children                         babies 
                             6                              5 
                          meal                        country 
                             5                            178 
                market_segment           distribution_channel 
                             8                              5 
             is_repeated_guest         previous_cancellations 
                             2                             15 
previous_bookings_not_canceled             reserved_room_type 
                            73                             10 
            assigned_room_type                booking_changes 
                            12                             21 
                  deposit_type                          agent 
                             3                            334 
                       company           days_in_waiting_list 
                           353                            128 
                 customer_type             average_daily_rate 
                             4                           8879 
   required_car_parking_spaces      total_of_special_requests 
                             5                              6 
            reservation_status        reservation_status_date 
                             3                            926 
Code
unique(hotel.bookings$hotel)
[1] "Resort Hotel" "City Hotel"  
Code
unique(hotel.bookings$arrival_date_year)
[1] 2015 2016 2017
Code
unique(hotel.bookings$reservation_status)
[1] "Check-Out" "Canceled"  "No-Show"  
Code
unique(hotel.bookings$distribution_channel)
[1] "Direct"    "Corporate" "TA/TO"     "Undefined" "GDS"      
Code
unique(hotel.bookings$customer_type)
[1] "Transient"       "Contract"        "Transient-Party" "Group"          

In this dataset there are 32 variables(columns) and 119,390 observations(rows). By looking at the variable names, we can say that the it consists of reservation data of some hotels. 14 of the variables compose of character values, 18 of the variables store numerical values. That being said, some of the character variables and some of the numeric variables are actually indicate dates.

All of the variable names are descriptive enough except adr. “adr” should mean “average daily rate”. Therefore, I changed original column name adr with average_daily_rate.

There are two types of hotels in the dataset: Resort Hotel and City Hotel. Reservations are made in 2015, 2016 and 2017. The data covers reservations from 178 countries. So the data should belong to a big worldwide hotels chain. Both canceled and completed reservations are stored in the dataset as well as no-shows. Thus, each observation includes information about a reservation such as what type of hotel and in which country the reservation is for, number of visitors, dates, daily rates, stay durations and some categorical info about customer and the channel of reservation.

In the dataset, each column repsresent a variable and each row is an observation. Thus, no pivoting is needed.

Code
hotel.bookings <- mutate(
  hotel.bookings, 
  number_of_guests = adults + children + babies,
  total_stay = stays_in_weekend_nights + stays_in_week_nights) # adding two new variables

print(dfSummary(hotel.bookings, 
                varnumbers= FALSE, 
                plain.ascii= FALSE, 
                style= "grid", 
                graph.magnif= 0.80, 
                valid.col= TRUE),
      method= 'render', 
      table.classes= 'table-condensed')

Data Frame Summary

hotel.bookings

Dimensions: 119390 x 34
Duplicates: 31994
Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
hotel [character]
1. City Hotel
2. Resort Hotel
79330(66.4%)
40060(33.6%)
119390 (100.0%) 0 (0.0%)
is_canceled [integer]
Min : 0
Mean : 0.4
Max : 1
0:75166(63.0%)
1:44224(37.0%)
119390 (100.0%) 0 (0.0%)
lead_time [integer]
Mean (sd) : 104 (106.9)
min ≤ med ≤ max:
0 ≤ 69 ≤ 737
IQR (CV) : 142 (1)
479 distinct values 119390 (100.0%) 0 (0.0%)
arrival_date_year [integer]
Mean (sd) : 2016.2 (0.7)
min ≤ med ≤ max:
2015 ≤ 2016 ≤ 2017
IQR (CV) : 1 (0)
2015:21996(18.4%)
2016:56707(47.5%)
2017:40687(34.1%)
119390 (100.0%) 0 (0.0%)
arrival_date_month [character]
1. August
2. July
3. May
4. October
5. April
6. June
7. September
8. March
9. February
10. November
[ 2 others ]
13877(11.6%)
12661(10.6%)
11791(9.9%)
11160(9.3%)
11089(9.3%)
10939(9.2%)
10508(8.8%)
9794(8.2%)
8068(6.8%)
6794(5.7%)
12709(10.6%)
119390 (100.0%) 0 (0.0%)
arrival_date_week_number [integer]
Mean (sd) : 27.2 (13.6)
min ≤ med ≤ max:
1 ≤ 28 ≤ 53
IQR (CV) : 22 (0.5)
53 distinct values 119390 (100.0%) 0 (0.0%)
arrival_date_day_of_month [integer]
Mean (sd) : 15.8 (8.8)
min ≤ med ≤ max:
1 ≤ 16 ≤ 31
IQR (CV) : 15 (0.6)
31 distinct values 119390 (100.0%) 0 (0.0%)
stays_in_weekend_nights [integer]
Mean (sd) : 0.9 (1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 2 (1.1)
17 distinct values 119390 (100.0%) 0 (0.0%)
stays_in_week_nights [integer]
Mean (sd) : 2.5 (1.9)
min ≤ med ≤ max:
0 ≤ 2 ≤ 50
IQR (CV) : 2 (0.8)
35 distinct values 119390 (100.0%) 0 (0.0%)
adults [integer]
Mean (sd) : 1.9 (0.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.3)
14 distinct values 119390 (100.0%) 0 (0.0%)
children [integer]
Mean (sd) : 0.1 (0.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (3.8)
0:110796(92.8%)
1:4861(4.1%)
2:3652(3.1%)
3:76(0.1%)
10:1(0.0%)
119386 (100.0%) 4 (0.0%)
babies [integer]
Mean (sd) : 0 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (12.3)
0:118473(99.2%)
1:900(0.8%)
2:15(0.0%)
9:1(0.0%)
10:1(0.0%)
119390 (100.0%) 0 (0.0%)
meal [character]
1. BB
2. FB
3. HB
4. SC
5. Undefined
92310(77.3%)
798(0.7%)
14463(12.1%)
10650(8.9%)
1169(1.0%)
119390 (100.0%) 0 (0.0%)
country [character]
1. PRT
2. GBR
3. FRA
4. ESP
5. DEU
6. ITA
7. IRL
8. BEL
9. BRA
10. NLD
[ 168 others ]
48590(40.7%)
12129(10.2%)
10415(8.7%)
8568(7.2%)
7287(6.1%)
3766(3.2%)
3375(2.8%)
2342(2.0%)
2224(1.9%)
2104(1.8%)
18590(15.6%)
119390 (100.0%) 0 (0.0%)
market_segment [character]
1. Aviation
2. Complementary
3. Corporate
4. Direct
5. Groups
6. Offline TA/TO
7. Online TA
8. Undefined
237(0.2%)
743(0.6%)
5295(4.4%)
12606(10.6%)
19811(16.6%)
24219(20.3%)
56477(47.3%)
2(0.0%)
119390 (100.0%) 0 (0.0%)
distribution_channel [character]
1. Corporate
2. Direct
3. GDS
4. TA/TO
5. Undefined
6677(5.6%)
14645(12.3%)
193(0.2%)
97870(82.0%)
5(0.0%)
119390 (100.0%) 0 (0.0%)
is_repeated_guest [integer]
Min : 0
Mean : 0
Max : 1
0:115580(96.8%)
1:3810(3.2%)
119390 (100.0%) 0 (0.0%)
previous_cancellations [integer]
Mean (sd) : 0.1 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 26
IQR (CV) : 0 (9.7)
15 distinct values 119390 (100.0%) 0 (0.0%)
previous_bookings_not_canceled [integer]
Mean (sd) : 0.1 (1.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 72
IQR (CV) : 0 (10.9)
73 distinct values 119390 (100.0%) 0 (0.0%)
reserved_room_type [character]
1. A
2. B
3. C
4. D
5. E
6. F
7. G
8. H
9. L
10. P
85994(72.0%)
1118(0.9%)
932(0.8%)
19201(16.1%)
6535(5.5%)
2897(2.4%)
2094(1.8%)
601(0.5%)
6(0.0%)
12(0.0%)
119390 (100.0%) 0 (0.0%)
assigned_room_type [character]
1. A
2. D
3. E
4. F
5. G
6. C
7. B
8. H
9. I
10. K
[ 2 others ]
74053(62.0%)
25322(21.2%)
7806(6.5%)
3751(3.1%)
2553(2.1%)
2375(2.0%)
2163(1.8%)
712(0.6%)
363(0.3%)
279(0.2%)
13(0.0%)
119390 (100.0%) 0 (0.0%)
booking_changes [integer]
Mean (sd) : 0.2 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 21
IQR (CV) : 0 (2.9)
21 distinct values 119390 (100.0%) 0 (0.0%)
deposit_type [character]
1. No Deposit
2. Non Refund
3. Refundable
104641(87.6%)
14587(12.2%)
162(0.1%)
119390 (100.0%) 0 (0.0%)
agent [character]
1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]
31961(26.8%)
16340(13.7%)
13922(11.7%)
7191(6.0%)
3640(3.0%)
3539(3.0%)
3290(2.8%)
2870(2.4%)
1721(1.4%)
1666(1.4%)
33250(27.8%)
119390 (100.0%) 0 (0.0%)
company [character]
1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]
112593(94.3%)
927(0.8%)
784(0.7%)
267(0.2%)
250(0.2%)
215(0.2%)
149(0.1%)
141(0.1%)
138(0.1%)
133(0.1%)
3793(3.2%)
119390 (100.0%) 0 (0.0%)
days_in_waiting_list [integer]
Mean (sd) : 2.3 (17.6)
min ≤ med ≤ max:
0 ≤ 0 ≤ 391
IQR (CV) : 0 (7.6)
128 distinct values 119390 (100.0%) 0 (0.0%)
customer_type [character]
1. Contract
2. Group
3. Transient
4. Transient-Party
4076(3.4%)
577(0.5%)
89613(75.1%)
25124(21.0%)
119390 (100.0%) 0 (0.0%)
average_daily_rate [numeric]
Mean (sd) : 101.8 (50.5)
min ≤ med ≤ max:
-6.4 ≤ 94.6 ≤ 5400
IQR (CV) : 56.7 (0.5)
8879 distinct values 119390 (100.0%) 0 (0.0%)
required_car_parking_spaces [integer]
Mean (sd) : 0.1 (0.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (3.9)
0:111974(93.8%)
1:7383(6.2%)
2:28(0.0%)
3:3(0.0%)
8:2(0.0%)
119390 (100.0%) 0 (0.0%)
total_of_special_requests [integer]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 1 (1.4)
0:70318(58.9%)
1:33226(27.8%)
2:12969(10.9%)
3:2497(2.1%)
4:340(0.3%)
5:40(0.0%)
119390 (100.0%) 0 (0.0%)
reservation_status [character]
1. Canceled
2. Check-Out
3. No-Show
43017(36.0%)
75166(63.0%)
1207(1.0%)
119390 (100.0%) 0 (0.0%)
reservation_status_date [character]
1. 2015-10-21
2. 2015-07-06
3. 2016-11-25
4. 2015-01-01
5. 2016-01-18
6. 2015-07-02
7. 2016-12-07
8. 2015-12-18
9. 2016-02-09
10. 2016-04-04
[ 916 others ]
1461(1.2%)
805(0.7%)
790(0.7%)
763(0.6%)
625(0.5%)
469(0.4%)
450(0.4%)
423(0.4%)
412(0.3%)
382(0.3%)
112810(94.5%)
119390 (100.0%) 0 (0.0%)
number_of_guests [integer]
Mean (sd) : 2 (0.7)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.4)
15 distinct values 119386 (100.0%) 4 (0.0%)
total_stay [integer]
Mean (sd) : 3.4 (2.6)
min ≤ med ≤ max:
0 ≤ 3 ≤ 69
IQR (CV) : 2 (0.7)
45 distinct values 119390 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

In this table, we can see some descriptive statistics for numeric variables in the data. For example, 37% of the reservations are actually canceled. On average, reservations are made 104 days before date of stay. On average, every reservation is made for 1.97 people. Approximately one out of 10 reservations involves a child or a baby. Average stay duration is 3.43 days. 22% of the booking is changed afterwards. Lastly, average daily rate of hotels is $101.

Code
colSums(is.na(hotel.bookings))
                         hotel                    is_canceled 
                             0                              0 
                     lead_time              arrival_date_year 
                             0                              0 
            arrival_date_month       arrival_date_week_number 
                             0                              0 
     arrival_date_day_of_month        stays_in_weekend_nights 
                             0                              0 
          stays_in_week_nights                         adults 
                             0                              0 
                      children                         babies 
                             4                              0 
                          meal                        country 
                             0                              0 
                market_segment           distribution_channel 
                             0                              0 
             is_repeated_guest         previous_cancellations 
                             0                              0 
previous_bookings_not_canceled             reserved_room_type 
                             0                              0 
            assigned_room_type                booking_changes 
                             0                              0 
                  deposit_type                          agent 
                             0                              0 
                       company           days_in_waiting_list 
                             0                              0 
                 customer_type             average_daily_rate 
                             0                              0 
   required_car_parking_spaces      total_of_special_requests 
                             0                              0 
            reservation_status        reservation_status_date 
                             0                              0 
              number_of_guests                     total_stay 
                             4                              0 

In terms of numerical variables, only 4 value is missing. However, when we examine the summary table above and the dataset itself we can see that there are some “NULL” entries that shows up as string. From summary table, agent and company variables have “NULL” values. As a sanity check, I will check each column separately if they have “NULL” value or not.

Code
nulls <- sapply(hotel.bookings, function(x) table(grepl("NULL", x)))

for (i in 1:32) {
  if (!is.na(nulls[[i]][2])) {
    print(nulls[i])
    }
}
$country

 FALSE   TRUE 
118902    488 

$agent

 FALSE   TRUE 
103050  16340 

$company

 FALSE   TRUE 
  6797 112593 

So, actully 3 variables, country, agent and company have “NULL” values.

Code
round(100*prop.table(table(grepl("NULL", hotel.bookings$country))), 2)

FALSE  TRUE 
99.59  0.41 
Code
round(100*prop.table(table(grepl("NULL", hotel.bookings$agent))), 2)

FALSE  TRUE 
86.31 13.69 
Code
round(100*prop.table(table(grepl("NULL", hotel.bookings$company))), 2)

FALSE  TRUE 
 5.69 94.31 

0.41% of country data, 13.89% of agent data and 94.31% of company data of reservations are missing.

Some analysis

According to the summary table, daily rate of a city hotel may go up to 5400 dollars while it is only 508 dollars for resort hotels. I suspect it is the true case.

Code
hotel.bookings %>% 
  arrange(desc(average_daily_rate)) %>% 
  slice_head(n=10) %>% 
  select(hotel, arrival_date_year, 
         country, 
         agent, 
         number_of_guests, 
         total_stay, 
         reservation_status, 
         average_daily_rate)
          hotel arrival_date_year country agent number_of_guests total_stay
1    City Hotel              2016     PRT    12                2          1
2    City Hotel              2017     ITA   159                1          1
3  Resort Hotel              2015     PRT  NULL                2          1
4    City Hotel              2016     PRT  NULL                4          2
5  Resort Hotel              2017     PRT   314                2         14
6  Resort Hotel              2017     PRT   250                4          6
7  Resort Hotel              2017     PRT   250                4          8
8  Resort Hotel              2017     ESP   240                4          5
9  Resort Hotel              2017     MAR   250                5          8
10 Resort Hotel              2017     PRT   250                3         10
   reservation_status average_daily_rate
1            Canceled            5400.00
2           Check-Out             510.00
3           Check-Out             508.00
4           Check-Out             451.50
5            Canceled             450.00
6            Canceled             437.00
7           Check-Out             426.25
8           Check-Out             402.00
9           Check-Out             397.38
10           Canceled             392.00
Code
hotel.bookings %>% 
  arrange(average_daily_rate) %>% 
  slice_head(n=10) %>% 
  select(hotel, arrival_date_year, 
         country, 
         agent, 
         number_of_guests, 
         total_stay, 
         reservation_status, 
         average_daily_rate)
          hotel arrival_date_year country agent number_of_guests total_stay
1  Resort Hotel              2017     GBR   273                2         10
2  Resort Hotel              2015     PRT  NULL                2          0
3  Resort Hotel              2015     PRT  NULL                2          0
4  Resort Hotel              2015     PRT  NULL                4          1
5  Resort Hotel              2015     PRT   240                2          0
6  Resort Hotel              2015     PRT   250                1          0
7  Resort Hotel              2015     PRT  NULL                2          0
8  Resort Hotel              2015     PRT   240                2          0
9  Resort Hotel              2015     PRT   305                2          2
10 Resort Hotel              2015     PRT   305                1          2
   reservation_status average_daily_rate
1           Check-Out              -6.38
2           Check-Out               0.00
3           Check-Out               0.00
4           Check-Out               0.00
5           Check-Out               0.00
6           Check-Out               0.00
7           Check-Out               0.00
8           Check-Out               0.00
9            Canceled               0.00
10          Check-Out               0.00

So, it looks that the row with 5,400 dollars average daily rate is a wrong entry. Also there is a row with negative average daily rate. I will remove both of them.

Code
hotel.bookings <- hotel.bookings %>% filter(average_daily_rate>=0 & average_daily_rate<=510)
Code
hotel.bookings %>% 
  select(hotel, average_daily_rate) %>% 
  group_by(hotel) %>% 
  summarise_if(is.numeric, list(min = min, max = max, mean = mean, std_dev = sd, median = median), na.rm = TRUE)
# A tibble: 2 × 6
  hotel          min   max  mean std_dev median
  <chr>        <dbl> <dbl> <dbl>   <dbl>  <dbl>
1 City Hotel       0   510 105.     39.3   99.9
2 Resort Hotel     0   508  95.0    61.4   75  
Code
hotel.bookings %>% 
  select(country) %>% 
  group_by(country) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  head(n=10)
# A tibble: 10 × 2
# Groups:   country [10]
   country     n
   <chr>   <int>
 1 PRT     48589
 2 GBR     12128
 3 FRA     10415
 4 ESP      8568
 5 DEU      7287
 6 ITA      3766
 7 IRL      3375
 8 BEL      2342
 9 BRA      2224
10 NLD      2104
Code
hotel.bookings %>% 
  select(country, is_canceled) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(is_canceled)) %>% 
  head(n=10)
# A tibble: 10 × 2
   country is_canceled
   <chr>         <int>
 1 PRT           27518
 2 GBR            2453
 3 ESP            2177
 4 FRA            1934
 5 ITA            1333
 6 DEU            1218
 7 IRL             832
 8 BRA             830
 9 USA             501
10 BEL             474
Code
hotel.bookings %>% 
  filter(country %in% c("PRT", "GBR", "ESP", "FRA", "ITA")) %>% 
  select(country,is_canceled) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, mean, na.rm = TRUE) %>% 
  arrange(desc(is_canceled))
# A tibble: 5 × 2
  country is_canceled
  <chr>         <dbl>
1 PRT           0.566
2 ITA           0.354
3 ESP           0.254
4 GBR           0.202
5 FRA           0.186
Code
hotel.bookings %>% 
  filter(reservation_status == "Check-Out") %>% 
  select(country, number_of_guests) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(number_of_guests)) %>% 
  head(n=10)
# A tibble: 10 × 2
   country number_of_guests
   <chr>              <int>
 1 PRT                37670
 2 GBR                19256
 3 FRA                17299
 4 ESP                13213
 5 DEU                11659
 6 IRL                 5102
 7 ITA                 4894
 8 BEL                 3855
 9 NLD                 3374
10 USA                 3212
Code
hotel.bookings %>% 
  filter(reservation_status == "Check-Out") %>% 
  select(country, number_of_guests) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(number_of_guests)) %>% 
  head(n=10)
# A tibble: 10 × 2
   country number_of_guests
   <chr>              <int>
 1 PRT                37670
 2 GBR                19256
 3 FRA                17299
 4 ESP                13213
 5 DEU                11659
 6 IRL                 5102
 7 ITA                 4894
 8 BEL                 3855
 9 NLD                 3374
10 USA                 3212
Code
hotel.bookings %>% 
  select(country, average_daily_rate) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, mean, na.rm = TRUE) %>% 
  arrange(desc(average_daily_rate)) %>% 
  head(n=10)
# A tibble: 10 × 2
   country average_daily_rate
   <chr>                <dbl>
 1 DJI                   273 
 2 AIA                   265 
 3 AND                   203.
 4 UMI                   200 
 5 LAO                   182.
 6 MYT                   178.
 7 NCL                   176.
 8 GEO                   169.
 9 COM                   165.
10 FRO                   155.
Code
table(hotel.bookings$arrival_date_month)

    April    August  December  February   January      July      June     March 
    11089     13877      6780      8068      5929     12661     10939      9792 
      May  November   October September 
    11791      6794     11160     10508 

Average daily rate for City Hotels is 11 dollars higher than Resort Hotels. On the other hand, variation in price for Resort Hotels is bigger than the price variation of City Hotels.

The most popular 10 countries in terms of total reservations are Portugal, Great Britain, France, Spain, Germany, Italy, Ireland, Belgium, Brasil and Netherlands. However, we can see that 56% reservations made for Portugal hotels are actually canceled. This ratio is 35% for Italy and 25% for Spain. Nevertheless, among all of them, the country that hosts the highest number of guests is Portugal with total of 37,670 guests in 3 years.

Interestingly, in terms of average daily rate, the most expensive hotels are in Djibouti, Anguilla, Andorra, United States Minor Outlying Islands, Laos and so on. It looks that hotels in small countries that host a small number of guests are much more expensive.

August, July and May, respectively, are the months when hotels are the busiest throughout the year.

Lastly, let me check how many rows with zero daily rate.

Code
hotel.bookings %>% 
  filter(average_daily_rate == 0) %>% 
  count()
     n
1 1959

There are 1959 reservations with zero daily rate.

Code
hotel.bookings %>% 
  filter(average_daily_rate == 0) %>% 
  group_by(country) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  head()
# A tibble: 6 × 2
# Groups:   country [6]
  country     n
  <chr>   <int>
1 PRT      1550
2 ESP        79
3 GBR        73
4 FRA        55
5 DEU        41
6 NULL       21

Wow! Most of the zero values are from Portugal. So, accuracy of data about hotels in Portugal should need further investigation.

Potential research questions

  1. How do hotel room rates change seasonally?
  2. Are the room rates change according to the length of stay or not?
  3. How is the performance of agents in terms of number of reservations and length of stay?
  4. How the preferences of families with children differ from other visitors?
  5. What are the differences between city hotel reservations and resort hotel reservations
  6. Is the cancellation of a reservation predictable? Which variables are more correlated with cancellation?
Source Code
---
title: "Homework-2"
author: "Said Arslan"
date: "10/20/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - hw2
  - hotel_bookings
  
---

```{r setup}

library(tidyverse)
library(psych)
library(summarytools)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

```


## Read Data


```{r read_data}


hotel.bookings <- read.csv("_data/hotel_bookings.csv")

```


## Describe Data


```{r data_description_1}

dim(hotel.bookings)
colnames(hotel.bookings)
colnames(hotel.bookings)[28] <- "average_daily_rate"

head(hotel.bookings)
tail(hotel.bookings)

str(hotel.bookings)
table(sapply(hotel.bookings, function(x) typeof(x)))

sapply(
  hotel.bookings, function(x) n_distinct(x)
  )

unique(hotel.bookings$hotel)
unique(hotel.bookings$arrival_date_year)
unique(hotel.bookings$reservation_status)
unique(hotel.bookings$distribution_channel)
unique(hotel.bookings$customer_type)
       
```


In this dataset there are 32 variables(columns) and 119,390 observations(rows). By looking at the variable names, we can say that the it consists of reservation data of some hotels. 14 of the variables compose of character values, 18 of the variables store numerical values. That being said, some of the character variables and some of the numeric variables are actually indicate dates. 

All of the variable names are descriptive enough except `adr`. "adr" should mean "average daily rate". Therefore, I changed original column name `adr` with `average_daily_rate`.

There are two types of hotels in the dataset: Resort Hotel and City Hotel. Reservations are made in 2015, 2016 and 2017. The data covers reservations from 178 countries. So the data should belong to a big worldwide hotels chain. Both canceled and completed reservations are stored in the dataset as well as no-shows. Thus, each observation includes information about a reservation such as what type of hotel and in which country the reservation is for, number of visitors, dates, daily rates, stay durations and some categorical info about customer and the channel of reservation.

In the dataset, each column repsresent a variable and each row is an observation. Thus, no pivoting is needed.




```{r data_description_2}

hotel.bookings <- mutate(
  hotel.bookings, 
  number_of_guests = adults + children + babies,
  total_stay = stays_in_weekend_nights + stays_in_week_nights) # adding two new variables

print(dfSummary(hotel.bookings, 
                varnumbers= FALSE, 
                plain.ascii= FALSE, 
                style= "grid", 
                graph.magnif= 0.80, 
                valid.col= TRUE),
      method= 'render', 
      table.classes= 'table-condensed')

```


In this table, we can see some descriptive statistics for numeric variables in the data. For example, 37% of the reservations are actually canceled. On average, reservations are made 104 days before date of stay. On average, every reservation is made for 1.97 people. Approximately one out of 10 reservations involves a child or a baby. Average stay duration is 3.43 days. 22% of the booking is changed afterwards. Lastly, average daily rate of hotels is $101.



```{r missing values}

colSums(is.na(hotel.bookings))

```


In terms of numerical variables, only 4 value is missing. However, when we examine the summary table above and the dataset itself we can see that there are some "NULL" entries that shows up as string. From summary table, `agent` and `company` variables have "NULL" values. As a sanity check, I will check each column separately if they have "NULL" value or not.



```{r}
nulls <- sapply(hotel.bookings, function(x) table(grepl("NULL", x)))

for (i in 1:32) {
  if (!is.na(nulls[[i]][2])) {
    print(nulls[i])
    }
}
  
```

So, actully 3 variables, `country`, `agent` and `company` have "NULL" values. 

```{r}
round(100*prop.table(table(grepl("NULL", hotel.bookings$country))), 2)
round(100*prop.table(table(grepl("NULL", hotel.bookings$agent))), 2)
round(100*prop.table(table(grepl("NULL", hotel.bookings$company))), 2)



```


0.41% of `country` data, 13.89% of `agent` data and 94.31% of `company` data of reservations are missing.



## Some analysis

According to the summary table, daily rate of a city hotel may go up to 5400 dollars while it is only 508 dollars for resort hotels. I suspect it is the true case.


```{r}
hotel.bookings %>% 
  arrange(desc(average_daily_rate)) %>% 
  slice_head(n=10) %>% 
  select(hotel, arrival_date_year, 
         country, 
         agent, 
         number_of_guests, 
         total_stay, 
         reservation_status, 
         average_daily_rate)


hotel.bookings %>% 
  arrange(average_daily_rate) %>% 
  slice_head(n=10) %>% 
  select(hotel, arrival_date_year, 
         country, 
         agent, 
         number_of_guests, 
         total_stay, 
         reservation_status, 
         average_daily_rate)

```

So, it looks that the row with 5,400 dollars average daily rate is a wrong entry. Also there is a row with negative average daily rate. I will remove both of them. 


```{r}
hotel.bookings <- hotel.bookings %>% filter(average_daily_rate>=0 & average_daily_rate<=510)

```


```{r}
hotel.bookings %>% 
  select(hotel, average_daily_rate) %>% 
  group_by(hotel) %>% 
  summarise_if(is.numeric, list(min = min, max = max, mean = mean, std_dev = sd, median = median), na.rm = TRUE)


hotel.bookings %>% 
  select(country) %>% 
  group_by(country) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  head(n=10)


hotel.bookings %>% 
  select(country, is_canceled) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(is_canceled)) %>% 
  head(n=10)

hotel.bookings %>% 
  filter(country %in% c("PRT", "GBR", "ESP", "FRA", "ITA")) %>% 
  select(country,is_canceled) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, mean, na.rm = TRUE) %>% 
  arrange(desc(is_canceled))

hotel.bookings %>% 
  filter(reservation_status == "Check-Out") %>% 
  select(country, number_of_guests) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(number_of_guests)) %>% 
  head(n=10)


hotel.bookings %>% 
  filter(reservation_status == "Check-Out") %>% 
  select(country, number_of_guests) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(number_of_guests)) %>% 
  head(n=10)


hotel.bookings %>% 
  select(country, average_daily_rate) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, mean, na.rm = TRUE) %>% 
  arrange(desc(average_daily_rate)) %>% 
  head(n=10)

table(hotel.bookings$arrival_date_month)

```

Average daily rate for City Hotels is 11 dollars higher than Resort Hotels. On the other hand, variation in price for Resort Hotels is bigger than the price variation of City Hotels.

The most popular 10 countries in terms of total reservations are Portugal, Great Britain, France, Spain, Germany, Italy, Ireland, Belgium, Brasil and Netherlands. However, we can see that 56% reservations made for Portugal hotels are actually canceled. This ratio is 35% for Italy and 25% for Spain. 
Nevertheless, among all of them, the country that hosts the highest number of guests is Portugal with total of 37,670 guests in 3 years.

Interestingly, in terms of average daily rate, the most expensive hotels are in Djibouti, Anguilla, Andorra, United States Minor Outlying Islands, Laos and so on. It looks that hotels in small countries that host a small number of guests are much more expensive.

August, July and May, respectively, are the months when hotels are the busiest throughout the year.


Lastly, let me check how many rows with zero daily rate.

```{r}
hotel.bookings %>% 
  filter(average_daily_rate == 0) %>% 
  count()


```

There are 1959 reservations with zero daily rate.


```{r}
hotel.bookings %>% 
  filter(average_daily_rate == 0) %>% 
  group_by(country) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  head()

```


Wow! Most of the zero values are from Portugal. So, accuracy of data about hotels in Portugal should need further investigation.



## Potential research questions

1. How do hotel room rates change seasonally?
2. Are the room rates change according to the length of stay or not?
3. How is the performance of agents in terms of number of reservations and length of stay?
4. How the preferences of families with children differ from other visitors?
5. What are the differences between city hotel reservations and resort hotel reservations
6. Is the cancellation of a reservation predictable? Which variables are more correlated with cancellation?