DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2 Instructions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge 2 Instructions

  • Show All Code
  • Hide All Code

  • View Source
hotel_bookings
Author

Shoshana Buck

Published

September 21, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

Code
hotel_bookings <- read_csv("_data/hotel_bookings.csv")
hotel_bookings

The ’Hotel_bookings’ data set is comparing two types of hotels, City Hotel and Resort Hotel. The data has 119,390 observations with 32 variables, identifying important information when booking like arrival, length of stay, location of hotel and assigned room type. Data for ‘Hotel_bookings’ was collected data starting in August of 2015 and ending in April of 2017. The data is originally from Hotel booking demand data set by Nuno Antonio, Ana de Almeida, and Luis Nunes.

Describe the data

In order to get a great breakdown and visualization of the data set I used the R package, summary tools. Summary tools gives the variable, stats within that variable, the frequency, and a graph of each variable. From the summary it shows that the average year to stay in either a resort or city hotel was in 2016. Additionally, the top month to arrive/stay at the hotel was August and the two least popular month to stay in a hotel was December and January.

Code
colnames(hotel_bookings)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       
Code
print(summarytools::dfSummary(hotel_bookings,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

hotel_bookings

Dimensions: 119390 x 32
Duplicates: 31994
Variable Stats / Values Freqs (% of Valid) Graph Missing
hotel [character]
1. City Hotel
2. Resort Hotel
79330(66.4%)
40060(33.6%)
0 (0.0%)
is_canceled [numeric]
Min : 0
Mean : 0.4
Max : 1
0:75166(63.0%)
1:44224(37.0%)
0 (0.0%)
lead_time [numeric]
Mean (sd) : 104 (106.9)
min ≤ med ≤ max:
0 ≤ 69 ≤ 737
IQR (CV) : 142 (1)
479 distinct values 0 (0.0%)
arrival_date_year [numeric]
Mean (sd) : 2016.2 (0.7)
min ≤ med ≤ max:
2015 ≤ 2016 ≤ 2017
IQR (CV) : 1 (0)
2015:21996(18.4%)
2016:56707(47.5%)
2017:40687(34.1%)
0 (0.0%)
arrival_date_month [character]
1. August
2. July
3. May
4. October
5. April
6. June
7. September
8. March
9. February
10. November
[ 2 others ]
13877(11.6%)
12661(10.6%)
11791(9.9%)
11160(9.3%)
11089(9.3%)
10939(9.2%)
10508(8.8%)
9794(8.2%)
8068(6.8%)
6794(5.7%)
12709(10.6%)
0 (0.0%)
arrival_date_week_number [numeric]
Mean (sd) : 27.2 (13.6)
min ≤ med ≤ max:
1 ≤ 28 ≤ 53
IQR (CV) : 22 (0.5)
53 distinct values 0 (0.0%)
arrival_date_day_of_month [numeric]
Mean (sd) : 15.8 (8.8)
min ≤ med ≤ max:
1 ≤ 16 ≤ 31
IQR (CV) : 15 (0.6)
31 distinct values 0 (0.0%)
stays_in_weekend_nights [numeric]
Mean (sd) : 0.9 (1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 2 (1.1)
17 distinct values 0 (0.0%)
stays_in_week_nights [numeric]
Mean (sd) : 2.5 (1.9)
min ≤ med ≤ max:
0 ≤ 2 ≤ 50
IQR (CV) : 2 (0.8)
35 distinct values 0 (0.0%)
adults [numeric]
Mean (sd) : 1.9 (0.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.3)
14 distinct values 0 (0.0%)
children [numeric]
Mean (sd) : 0.1 (0.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (3.8)
0:110796(92.8%)
1:4861(4.1%)
2:3652(3.1%)
3:76(0.1%)
10:1(0.0%)
4 (0.0%)
babies [numeric]
Mean (sd) : 0 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (12.3)
0:118473(99.2%)
1:900(0.8%)
2:15(0.0%)
9:1(0.0%)
10:1(0.0%)
0 (0.0%)
meal [character]
1. BB
2. FB
3. HB
4. SC
5. Undefined
92310(77.3%)
798(0.7%)
14463(12.1%)
10650(8.9%)
1169(1.0%)
0 (0.0%)
country [character]
1. PRT
2. GBR
3. FRA
4. ESP
5. DEU
6. ITA
7. IRL
8. BEL
9. BRA
10. NLD
[ 168 others ]
48590(40.7%)
12129(10.2%)
10415(8.7%)
8568(7.2%)
7287(6.1%)
3766(3.2%)
3375(2.8%)
2342(2.0%)
2224(1.9%)
2104(1.8%)
18590(15.6%)
0 (0.0%)
market_segment [character]
1. Aviation
2. Complementary
3. Corporate
4. Direct
5. Groups
6. Offline TA/TO
7. Online TA
8. Undefined
237(0.2%)
743(0.6%)
5295(4.4%)
12606(10.6%)
19811(16.6%)
24219(20.3%)
56477(47.3%)
2(0.0%)
0 (0.0%)
distribution_channel [character]
1. Corporate
2. Direct
3. GDS
4. TA/TO
5. Undefined
6677(5.6%)
14645(12.3%)
193(0.2%)
97870(82.0%)
5(0.0%)
0 (0.0%)
is_repeated_guest [numeric]
Min : 0
Mean : 0
Max : 1
0:115580(96.8%)
1:3810(3.2%)
0 (0.0%)
previous_cancellations [numeric]
Mean (sd) : 0.1 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 26
IQR (CV) : 0 (9.7)
15 distinct values 0 (0.0%)
previous_bookings_not_canceled [numeric]
Mean (sd) : 0.1 (1.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 72
IQR (CV) : 0 (10.9)
73 distinct values 0 (0.0%)
reserved_room_type [character]
1. A
2. B
3. C
4. D
5. E
6. F
7. G
8. H
9. L
10. P
85994(72.0%)
1118(0.9%)
932(0.8%)
19201(16.1%)
6535(5.5%)
2897(2.4%)
2094(1.8%)
601(0.5%)
6(0.0%)
12(0.0%)
0 (0.0%)
assigned_room_type [character]
1. A
2. D
3. E
4. F
5. G
6. C
7. B
8. H
9. I
10. K
[ 2 others ]
74053(62.0%)
25322(21.2%)
7806(6.5%)
3751(3.1%)
2553(2.1%)
2375(2.0%)
2163(1.8%)
712(0.6%)
363(0.3%)
279(0.2%)
13(0.0%)
0 (0.0%)
booking_changes [numeric]
Mean (sd) : 0.2 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 21
IQR (CV) : 0 (2.9)
21 distinct values 0 (0.0%)
deposit_type [character]
1. No Deposit
2. Non Refund
3. Refundable
104641(87.6%)
14587(12.2%)
162(0.1%)
0 (0.0%)
agent [character]
1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]
31961(26.8%)
16340(13.7%)
13922(11.7%)
7191(6.0%)
3640(3.0%)
3539(3.0%)
3290(2.8%)
2870(2.4%)
1721(1.4%)
1666(1.4%)
33250(27.8%)
0 (0.0%)
company [character]
1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]
112593(94.3%)
927(0.8%)
784(0.7%)
267(0.2%)
250(0.2%)
215(0.2%)
149(0.1%)
141(0.1%)
138(0.1%)
133(0.1%)
3793(3.2%)
0 (0.0%)
days_in_waiting_list [numeric]
Mean (sd) : 2.3 (17.6)
min ≤ med ≤ max:
0 ≤ 0 ≤ 391
IQR (CV) : 0 (7.6)
128 distinct values 0 (0.0%)
customer_type [character]
1. Contract
2. Group
3. Transient
4. Transient-Party
4076(3.4%)
577(0.5%)
89613(75.1%)
25124(21.0%)
0 (0.0%)
adr [numeric]
Mean (sd) : 101.8 (50.5)
min ≤ med ≤ max:
-6.4 ≤ 94.6 ≤ 5400
IQR (CV) : 56.7 (0.5)
8879 distinct values 0 (0.0%)
required_car_parking_spaces [numeric]
Mean (sd) : 0.1 (0.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (3.9)
0:111974(93.8%)
1:7383(6.2%)
2:28(0.0%)
3:3(0.0%)
8:2(0.0%)
0 (0.0%)
total_of_special_requests [numeric]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 1 (1.4)
0:70318(58.9%)
1:33226(27.8%)
2:12969(10.9%)
3:2497(2.1%)
4:340(0.3%)
5:40(0.0%)
0 (0.0%)
reservation_status [character]
1. Canceled
2. Check-Out
3. No-Show
43017(36.0%)
75166(63.0%)
1207(1.0%)
0 (0.0%)
reservation_status_date [Date]
min : 2014-10-17
med : 2016-08-07
max : 2017-09-14
range : 2y 10m 28d
926 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21

Provide Grouped Summary Statistics

Code
hotel_bookings %>% 
  summarise(across(c(country,arrival_date_year,adr, stays_in_week_nights, stays_in_weekend_nights), n_distinct))

I used the function summarize() to look the variables across the data set. From the table we can see that there were 178 unique countries that came from the data. Additionally, on average a room was booked for 35 week nights and 17 weekend nights. Showing that people normally stayed in a hotel on week days rather than weekends.

I Know that the 3 for arrival year makes sense because the data was taken from 2015-2017. However, I am a little confused about the numerical value for the average daily rate. I don’t see 8879 reflected in the original data set when I put it in descending order.

Code
hotel_bookings %>% 
  group_by(hotel,country,adr, stays_in_week_nights, stays_in_weekend_nights,arrival_date_year) %>% 
  summarize(total_adults = sum(adults)) %>% 
  arrange(desc(adr))

For the second code chunk I used the function group_by() to focus in on variables that I thought were interesting. I think piped and used the function summarize() to find ‘total_adults.’ I then piped again to arrange my date set in descending order based off of the ’average daily rate.’ I thought it was crazy that the most expensive city hotel was in Portugal in 2016 that had 2 adults staying for one week night at a rate of $5400.

Since the chart is arranged in descending order based off the average daily rate, Portugal has seven out of the ten most expensive hotels with five being a resort style hotel. Other countries within the top ten are Italy, Spain, and Morocco.

Code
bookings<- hotel_bookings %>% 
  group_by(hotel,reserved_room_type) %>% 
  summarize(price = mean(adr),
            adults = mean(adults),
            children = mean(children),
            babies = mean(babies), na.rm = TRUE) %>% 
  arrange(reserved_room_type)
bookings

Explain and Interpret

For the third code chunk I thought it would be interesting to use the group_by() function of hotel and room_type to see if there was a correlation between reserved_room_types and the type of hotel. Though I do not know what the reserved_room_types mean it can be seen that a resort hotel is cheaper to stay in than a city hotel.

Source Code
---
title: "Challenge 2 Instructions"
author: "Shoshana Buck"
desription: "Data wrangling: using group() and summarise()"
date: "09/21/2022"
format:
  html:
    df-print: paged
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Read in the Data

```{r}
hotel_bookings <- read_csv("_data/hotel_bookings.csv")
hotel_bookings
```

The '***Hotel_bookings'*** data set is comparing two types of hotels, *City Hotel* and *Resort Hotel*. The data has 119,390 observations with 32 variables, identifying important information when booking like arrival, length of stay, location of hotel and assigned room type. Data for ***'Hotel_bookings'*** was collected data starting in August of 2015 and ending in April of 2017. The data is originally from Hotel booking demand data set by Nuno Antonio, Ana de Almeida, and Luis Nunes.

## Describe the data

In order to get a great breakdown and visualization of the data set I used the R package, *summary tools***.** *Summary tools* gives the variable, stats within that variable, the frequency, and a graph of each variable. From the summary it shows that the average year to stay in either a resort or city hotel was in 2016. Additionally, the top month to arrive/stay at the hotel was August and the two least popular month to stay in a hotel was December and January.

```{r}
#| label: summary
colnames(hotel_bookings)

print(summarytools::dfSummary(hotel_bookings,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

```

## Provide Grouped Summary Statistics

```{r}
hotel_bookings %>% 
  summarise(across(c(country,arrival_date_year,adr, stays_in_week_nights, stays_in_weekend_nights), n_distinct))
```

I used the function summarize() to look the variables across the data set. From the table we can see that there were 178 unique countries that came from the data. Additionally, on average a room was booked for 35 week nights and 17 weekend nights. Showing that people normally stayed in a hotel on week days rather than weekends.

I Know that the 3 for arrival year makes sense because the data was taken from 2015-2017. However, I am a little confused about the numerical value for the average daily rate. I don't see 8879 reflected in the original data set when I put it in descending order.

```{r}
hotel_bookings %>% 
  group_by(hotel,country,adr, stays_in_week_nights, stays_in_weekend_nights,arrival_date_year) %>% 
  summarize(total_adults = sum(adults)) %>% 
  arrange(desc(adr))
```

For the second code chunk I used the function **group_by()** to focus in on variables that I thought were interesting. I think piped and used the function **summarize()** to find '*total_adults*.' I then piped again to arrange my date set in descending order based off of the *'average daily rate.*' I thought it was crazy that the most expensive city hotel was in Portugal in 2016 that had 2 adults staying for one week night at a rate of $5400.

Since the chart is arranged in descending order based off the average daily rate, Portugal has seven out of the ten most expensive hotels with five being a resort style hotel. Other countries within the top ten are Italy, Spain, and Morocco.

```{r}
bookings<- hotel_bookings %>% 
  group_by(hotel,reserved_room_type) %>% 
  summarize(price = mean(adr),
            adults = mean(adults),
            children = mean(children),
            babies = mean(babies), na.rm = TRUE) %>% 
  arrange(reserved_room_type)
bookings
```

### Explain and Interpret

For the third code chunk I thought it would be interesting to use the group_by() function of hotel and room_type to see if there was a correlation between reserved_room_types and the type of hotel. Though I do not know what the reserved_room_types mean it can be seen that a resort hotel is cheaper to stay in than a city hotel.