challenge_2
railroads
faostat
hotel_bookings
Author

Khadijat Adeleye

Published

March 14, 2023

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.1.8
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(readxl)
knitr::opts_chunk$set(echo=TRUE,warning=FALSE,message=FALSE)
Code
bookings<-read_csv("_data/hotel_bookings.csv")
view(bookings)

###Description of Dataset The data set contains 119,390 hotel bookings from two hotels (“City Hotel” and “Resort Hotel”) with an arrival date between July 2015 and August 201. Each observation represents a hotel booking.The dataset contains room type, date of arrival, meals, distribution channel, nationalities , transaction and length of stay of every booking.

Code
print(summarytools::dfSummary(bookings,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',table.classes = 'table-condensed')

Data Frame Summary

bookings

Dimensions: 119390 x 32
Duplicates: 31994
Variable Stats / Values Freqs (% of Valid) Graph Missing
hotel [character]
1. City Hotel
2. Resort Hotel
79330 ( 66.4% )
40060 ( 33.6% )
0 (0.0%)
is_canceled [numeric]
Min : 0
Mean : 0.4
Max : 1
0 : 75166 ( 63.0% )
1 : 44224 ( 37.0% )
0 (0.0%)
lead_time [numeric]
Mean (sd) : 104 (106.9)
min ≤ med ≤ max:
0 ≤ 69 ≤ 737
IQR (CV) : 142 (1)
479 distinct values 0 (0.0%)
arrival_date_year [numeric]
Mean (sd) : 2016.2 (0.7)
min ≤ med ≤ max:
2015 ≤ 2016 ≤ 2017
IQR (CV) : 1 (0)
2015 : 21996 ( 18.4% )
2016 : 56707 ( 47.5% )
2017 : 40687 ( 34.1% )
0 (0.0%)
arrival_date_month [character]
1. August
2. July
3. May
4. October
5. April
6. June
7. September
8. March
9. February
10. November
[ 2 others ]
13877 ( 11.6% )
12661 ( 10.6% )
11791 ( 9.9% )
11160 ( 9.3% )
11089 ( 9.3% )
10939 ( 9.2% )
10508 ( 8.8% )
9794 ( 8.2% )
8068 ( 6.8% )
6794 ( 5.7% )
12709 ( 10.6% )
0 (0.0%)
arrival_date_week_number [numeric]
Mean (sd) : 27.2 (13.6)
min ≤ med ≤ max:
1 ≤ 28 ≤ 53
IQR (CV) : 22 (0.5)
53 distinct values 0 (0.0%)
arrival_date_day_of_month [numeric]
Mean (sd) : 15.8 (8.8)
min ≤ med ≤ max:
1 ≤ 16 ≤ 31
IQR (CV) : 15 (0.6)
31 distinct values 0 (0.0%)
stays_in_weekend_nights [numeric]
Mean (sd) : 0.9 (1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 2 (1.1)
17 distinct values 0 (0.0%)
stays_in_week_nights [numeric]
Mean (sd) : 2.5 (1.9)
min ≤ med ≤ max:
0 ≤ 2 ≤ 50
IQR (CV) : 2 (0.8)
35 distinct values 0 (0.0%)
adults [numeric]
Mean (sd) : 1.9 (0.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.3)
14 distinct values 0 (0.0%)
children [numeric]
Mean (sd) : 0.1 (0.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (3.8)
0 : 110796 ( 92.8% )
1 : 4861 ( 4.1% )
2 : 3652 ( 3.1% )
3 : 76 ( 0.1% )
10 : 1 ( 0.0% )
4 (0.0%)
babies [numeric]
Mean (sd) : 0 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (12.3)
0 : 118473 ( 99.2% )
1 : 900 ( 0.8% )
2 : 15 ( 0.0% )
9 : 1 ( 0.0% )
10 : 1 ( 0.0% )
0 (0.0%)
meal [character]
1. BB
2. FB
3. HB
4. SC
5. Undefined
92310 ( 77.3% )
798 ( 0.7% )
14463 ( 12.1% )
10650 ( 8.9% )
1169 ( 1.0% )
0 (0.0%)
country [character]
1. PRT
2. GBR
3. FRA
4. ESP
5. DEU
6. ITA
7. IRL
8. BEL
9. BRA
10. NLD
[ 168 others ]
48590 ( 40.7% )
12129 ( 10.2% )
10415 ( 8.7% )
8568 ( 7.2% )
7287 ( 6.1% )
3766 ( 3.2% )
3375 ( 2.8% )
2342 ( 2.0% )
2224 ( 1.9% )
2104 ( 1.8% )
18590 ( 15.6% )
0 (0.0%)
market_segment [character]
1. Aviation
2. Complementary
3. Corporate
4. Direct
5. Groups
6. Offline TA/TO
7. Online TA
8. Undefined
237 ( 0.2% )
743 ( 0.6% )
5295 ( 4.4% )
12606 ( 10.6% )
19811 ( 16.6% )
24219 ( 20.3% )
56477 ( 47.3% )
2 ( 0.0% )
0 (0.0%)
distribution_channel [character]
1. Corporate
2. Direct
3. GDS
4. TA/TO
5. Undefined
6677 ( 5.6% )
14645 ( 12.3% )
193 ( 0.2% )
97870 ( 82.0% )
5 ( 0.0% )
0 (0.0%)
is_repeated_guest [numeric]
Min : 0
Mean : 0
Max : 1
0 : 115580 ( 96.8% )
1 : 3810 ( 3.2% )
0 (0.0%)
previous_cancellations [numeric]
Mean (sd) : 0.1 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 26
IQR (CV) : 0 (9.7)
15 distinct values 0 (0.0%)
previous_bookings_not_canceled [numeric]
Mean (sd) : 0.1 (1.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 72
IQR (CV) : 0 (10.9)
73 distinct values 0 (0.0%)
reserved_room_type [character]
1. A
2. B
3. C
4. D
5. E
6. F
7. G
8. H
9. L
10. P
85994 ( 72.0% )
1118 ( 0.9% )
932 ( 0.8% )
19201 ( 16.1% )
6535 ( 5.5% )
2897 ( 2.4% )
2094 ( 1.8% )
601 ( 0.5% )
6 ( 0.0% )
12 ( 0.0% )
0 (0.0%)
assigned_room_type [character]
1. A
2. D
3. E
4. F
5. G
6. C
7. B
8. H
9. I
10. K
[ 2 others ]
74053 ( 62.0% )
25322 ( 21.2% )
7806 ( 6.5% )
3751 ( 3.1% )
2553 ( 2.1% )
2375 ( 2.0% )
2163 ( 1.8% )
712 ( 0.6% )
363 ( 0.3% )
279 ( 0.2% )
13 ( 0.0% )
0 (0.0%)
booking_changes [numeric]
Mean (sd) : 0.2 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 21
IQR (CV) : 0 (2.9)
21 distinct values 0 (0.0%)
deposit_type [character]
1. No Deposit
2. Non Refund
3. Refundable
104641 ( 87.6% )
14587 ( 12.2% )
162 ( 0.1% )
0 (0.0%)
agent [character]
1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]
31961 ( 26.8% )
16340 ( 13.7% )
13922 ( 11.7% )
7191 ( 6.0% )
3640 ( 3.0% )
3539 ( 3.0% )
3290 ( 2.8% )
2870 ( 2.4% )
1721 ( 1.4% )
1666 ( 1.4% )
33250 ( 27.8% )
0 (0.0%)
company [character]
1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]
112593 ( 94.3% )
927 ( 0.8% )
784 ( 0.7% )
267 ( 0.2% )
250 ( 0.2% )
215 ( 0.2% )
149 ( 0.1% )
141 ( 0.1% )
138 ( 0.1% )
133 ( 0.1% )
3793 ( 3.2% )
0 (0.0%)
days_in_waiting_list [numeric]
Mean (sd) : 2.3 (17.6)
min ≤ med ≤ max:
0 ≤ 0 ≤ 391
IQR (CV) : 0 (7.6)
128 distinct values 0 (0.0%)
customer_type [character]
1. Contract
2. Group
3. Transient
4. Transient-Party
4076 ( 3.4% )
577 ( 0.5% )
89613 ( 75.1% )
25124 ( 21.0% )
0 (0.0%)
adr [numeric]
Mean (sd) : 101.8 (50.5)
min ≤ med ≤ max:
-6.4 ≤ 94.6 ≤ 5400
IQR (CV) : 56.7 (0.5)
8879 distinct values 0 (0.0%)
required_car_parking_spaces [numeric]
Mean (sd) : 0.1 (0.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (3.9)
0 : 111974 ( 93.8% )
1 : 7383 ( 6.2% )
2 : 28 ( 0.0% )
3 : 3 ( 0.0% )
8 : 2 ( 0.0% )
0 (0.0%)
total_of_special_requests [numeric]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 1 (1.4)
0 : 70318 ( 58.9% )
1 : 33226 ( 27.8% )
2 : 12969 ( 10.9% )
3 : 2497 ( 2.1% )
4 : 340 ( 0.3% )
5 : 40 ( 0.0% )
0 (0.0%)
reservation_status [character]
1. Canceled
2. Check-Out
3. No-Show
43017 ( 36.0% )
75166 ( 63.0% )
1207 ( 1.0% )
0 (0.0%)
reservation_status_date [Date]
min : 2014-10-17
med : 2016-08-07
max : 2017-09-14
range : 2y 10m 28d
926 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-03-15

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
bookings%>% select(arrival_date_year, arrival_date_month)%>% distinct()
# A tibble: 26 × 2
   arrival_date_year arrival_date_month
               <dbl> <chr>             
 1              2015 July              
 2              2015 August            
 3              2015 September         
 4              2015 October           
 5              2015 November          
 6              2015 December          
 7              2016 January           
 8              2016 February          
 9              2016 March             
10              2016 April             
# … with 16 more rows
Code
d<-bookings%>%
  select(arrival_date_year, arrival_date_month)%>%
  n_distinct

bookings%>%
  select(hotel, arrival_date_year, arrival_date_month)%>%
  distinct()%>%
  slice(c(1, d, d+1, d*2))
# A tibble: 4 × 3
  hotel        arrival_date_year arrival_date_month
  <chr>                    <dbl> <chr>             
1 Resort Hotel              2015 July              
2 Resort Hotel              2017 August            
3 City Hotel                2015 July              
4 City Hotel                2017 August            
Code
bookings%>%
  group_by(hotel)%>%
  count(reserved_room_type)
# A tibble: 18 × 3
# Groups:   hotel [2]
   hotel        reserved_room_type     n
   <chr>        <chr>              <int>
 1 City Hotel   A                  62595
 2 City Hotel   B                   1115
 3 City Hotel   C                     14
 4 City Hotel   D                  11768
 5 City Hotel   E                   1553
 6 City Hotel   F                   1791
 7 City Hotel   G                    484
 8 City Hotel   P                     10
 9 Resort Hotel A                  23399
10 Resort Hotel B                      3
11 Resort Hotel C                    918
12 Resort Hotel D                   7433
13 Resort Hotel E                   4982
14 Resort Hotel F                   1106
15 Resort Hotel G                   1610
16 Resort Hotel H                    601
17 Resort Hotel L                      6
18 Resort Hotel P                      2
Code
bookings%>%
  group_by(hotel)%>%
  count(reserved_room_type)%>%
  pivot_wider(names_from= hotel, values_from = n)
# A tibble: 10 × 3
   reserved_room_type `City Hotel` `Resort Hotel`
   <chr>                     <int>          <int>
 1 A                         62595          23399
 2 B                          1115              3
 3 C                            14            918
 4 D                         11768           7433
 5 E                          1553           4982
 6 F                          1791           1106
 7 G                           484           1610
 8 P                            10              2
 9 H                            NA            601
10 L                            NA              6
Code
t1<-bookings%>%
  group_by(hotel, reserved_room_type)%>%
  summarise(price = mean(adr),
            adults = mean(adults),
            children = mean(children+babies, na.rm=TRUE)
            )%>%
  pivot_wider(names_from= hotel, 
              values_from = c(price, adults, children))

knitr::kable(t1,
             digits=1,
             col.names = c("Type", "City", "Resort",
                           "City", "Resort", "City", "Resort"))%>%
  kableExtra::kable_styling(htmltable_class = "lightable-minimal")%>%
  kableExtra::add_header_above(c("Room" = 1, "Price" = 2,
                                 "Adults" = 2, "Children & Babies" = 2))
Room
Price
Adults
Children & Babies
Type City Resort City Resort City Resort
A 96.2 76.2 1.8 1.8 0.0 0.0
B 90.3 104.7 1.6 2.0 0.6 0.0
C 85.5 161.4 1.5 2.0 0.1 1.4
D 131.5 103.6 2.2 2.0 0.0 0.1
E 156.8 114.5 2.1 2.0 0.3 0.0
F 189.3 132.8 2.0 2.0 1.6 0.1
G 201.8 168.2 2.3 2.0 1.1 1.4
P 0.0 0.0 0.0 0.0 0.0 0.0
H NA 188.2 NA 2.7 NA 1.0
L NA 124.7 NA 2.2 NA 0.0

###Explain and Interpret Understanding the type of hotelbooking according to the type of rooms.Also identifying the difference in price of room between resort and City hotel. From the table above,both resort and city hotel has no booking from adult and children for Type P.
In both Resort and city hotel, Room type G has the highest mean. Although Room type A has the least mean price, but it is not accommodated by children.