DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge 2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
Author

Kavya Harlalka

Published

September 24, 2022

Code
# Import important libraries needed for functions
library(tidyverse)
library(rmarkdown)
library(summarytools)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

Code
# Read data from the csv file
data <- read.csv('_data/hotel_bookings.csv')

# Printing the read data as a paged table
paged_table(data)

This dataset represents records of hotel bookings across different countries with details like booking data, booking status, no of adults/children/babies, etc.

Describe the data

Code
# Get and print summary of the dataset to help us analyze it.
print(dfSummary(data))
Data Frame Summary  
data  
Dimensions: 119390 x 32  
Duplicates: 31994  

-----------------------------------------------------------------------------------------------------------------------------------
No   Variable                         Stats / Values             Freqs (% of Valid)     Graph                  Valid      Missing  
---- -------------------------------- -------------------------- ---------------------- ---------------------- ---------- ---------
1    hotel                            1. City Hotel              79330 (66.4%)          IIIIIIIIIIIII          119390     0        
     [character]                      2. Resort Hotel            40060 (33.6%)          IIIIII                 (100.0%)   (0.0%)   

2    is_canceled                      Min  : 0                   0 : 75166 (63.0%)      IIIIIIIIIIII           119390     0        
     [integer]                        Mean : 0.4                 1 : 44224 (37.0%)      IIIIIII                (100.0%)   (0.0%)   
                                      Max  : 1                                                                                     

3    lead_time                        Mean (sd) : 104 (106.9)    479 distinct values    :                      119390     0        
     [integer]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 69 < 737                                      :                                          
                                      IQR (CV) : 142 (1)                                : : .                                      
                                                                                        : : : . .                                  

4    arrival_date_year                Mean (sd) : 2016.2 (0.7)   2015 : 21996 (18.4%)   III                    119390     0        
     [integer]                        min < med < max:           2016 : 56707 (47.5%)   IIIIIIIII              (100.0%)   (0.0%)   
                                      2015 < 2016 < 2017         2017 : 40687 (34.1%)   IIIIII                                     
                                      IQR (CV) : 1 (0)                                                                             

5    arrival_date_month               1. August                  13877 (11.6%)          II                     119390     0        
     [character]                      2. July                    12661 (10.6%)          II                     (100.0%)   (0.0%)   
                                      3. May                     11791 ( 9.9%)          I                                          
                                      4. October                 11160 ( 9.3%)          I                                          
                                      5. April                   11089 ( 9.3%)          I                                          
                                      6. June                    10939 ( 9.2%)          I                                          
                                      7. September               10508 ( 8.8%)          I                                          
                                      8. March                    9794 ( 8.2%)          I                                          
                                      9. February                 8068 ( 6.8%)          I                                          
                                      10. November                6794 ( 5.7%)          I                                          
                                      [ 2 others ]               12709 (10.6%)          II                                         

6    arrival_date_week_number         Mean (sd) : 27.2 (13.6)    53 distinct values           . : . . .        119390     0        
     [integer]                        min < med < max:                                    . : : : : : :        (100.0%)   (0.0%)   
                                      1 < 28 < 53                                       . : : : : : : : : :                        
                                      IQR (CV) : 22 (0.5)                               : : : : : : : : : :                        
                                                                                        : : : : : : : : : :                        

7    arrival_date_day_of_month        Mean (sd) : 15.8 (8.8)     31 distinct values     :                      119390     0        
     [integer]                        min < med < max:                                  : : : . : : . : :      (100.0%)   (0.0%)   
                                      1 < 16 < 31                                       : : : : : : : : : :                        
                                      IQR (CV) : 15 (0.6)                               : : : : : : : : : :                        
                                                                                        : : : : : : : : : :                        

8    stays_in_weekend_nights          Mean (sd) : 0.9 (1)        17 distinct values     :                      119390     0        
     [integer]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 1 < 19                                        :                                          
                                      IQR (CV) : 2 (1.1)                                : :                                        
                                                                                        : :                                        

9    stays_in_week_nights             Mean (sd) : 2.5 (1.9)      35 distinct values     :                      119390     0        
     [integer]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 2 < 50                                        :                                          
                                      IQR (CV) : 2 (0.8)                                :                                          
                                                                                        :                                          

10   adults                           Mean (sd) : 1.9 (0.6)      14 distinct values     :                      119390     0        
     [integer]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 2 < 55                                        :                                          
                                      IQR (CV) : 0 (0.3)                                :                                          
                                                                                        :                                          

11   children                         Mean (sd) : 0.1 (0.4)      0 : 110796 (92.8%)     IIIIIIIIIIIIIIIIII     119386     4        
     [integer]                        min < med < max:           1 :   4861 ( 4.1%)                            (100.0%)   (0.0%)   
                                      0 < 0 < 10                 2 :   3652 ( 3.1%)                                                
                                      IQR (CV) : 0 (3.8)         3 :     76 ( 0.1%)                                                
                                                                 10 :      1 ( 0.0%)                                               

12   babies                           Mean (sd) : 0 (0.1)        0 : 118473 (99.2%)     IIIIIIIIIIIIIIIIIII    119390     0        
     [integer]                        min < med < max:           1 :    900 ( 0.8%)                            (100.0%)   (0.0%)   
                                      0 < 0 < 10                 2 :     15 ( 0.0%)                                                
                                      IQR (CV) : 0 (12.3)        9 :      1 ( 0.0%)                                                
                                                                 10 :      1 ( 0.0%)                                               

13   meal                             1. BB                      92310 (77.3%)          IIIIIIIIIIIIIII        119390     0        
     [character]                      2. FB                        798 ( 0.7%)                                 (100.0%)   (0.0%)   
                                      3. HB                      14463 (12.1%)          II                                         
                                      4. SC                      10650 ( 8.9%)          I                                          
                                      5. Undefined                1169 ( 1.0%)                                                     

14   country                          1. PRT                     48590 (40.7%)          IIIIIIII               119390     0        
     [character]                      2. GBR                     12129 (10.2%)          II                     (100.0%)   (0.0%)   
                                      3. FRA                     10415 ( 8.7%)          I                                          
                                      4. ESP                      8568 ( 7.2%)          I                                          
                                      5. DEU                      7287 ( 6.1%)          I                                          
                                      6. ITA                      3766 ( 3.2%)                                                     
                                      7. IRL                      3375 ( 2.8%)                                                     
                                      8. BEL                      2342 ( 2.0%)                                                     
                                      9. BRA                      2224 ( 1.9%)                                                     
                                      10. NLD                     2104 ( 1.8%)                                                     
                                      [ 168 others ]             18590 (15.6%)          III                                        

15   market_segment                   1. Aviation                  237 ( 0.2%)                                 119390     0        
     [character]                      2. Complementary             743 ( 0.6%)                                 (100.0%)   (0.0%)   
                                      3. Corporate                5295 ( 4.4%)                                                     
                                      4. Direct                  12606 (10.6%)          II                                         
                                      5. Groups                  19811 (16.6%)          III                                        
                                      6. Offline TA/TO           24219 (20.3%)          IIII                                       
                                      7. Online TA               56477 (47.3%)          IIIIIIIII                                  
                                      8. Undefined                   2 ( 0.0%)                                                     

16   distribution_channel             1. Corporate                6677 ( 5.6%)          I                      119390     0        
     [character]                      2. Direct                  14645 (12.3%)          II                     (100.0%)   (0.0%)   
                                      3. GDS                       193 ( 0.2%)                                                     
                                      4. TA/TO                   97870 (82.0%)          IIIIIIIIIIIIIIII                           
                                      5. Undefined                   5 ( 0.0%)                                                     

17   is_repeated_guest                Min  : 0                   0 : 115580 (96.8%)     IIIIIIIIIIIIIIIIIII    119390     0        
     [integer]                        Mean : 0                   1 :   3810 ( 3.2%)                            (100.0%)   (0.0%)   
                                      Max  : 1                                                                                     

18   previous_cancellations           Mean (sd) : 0.1 (0.8)      15 distinct values     :                      119390     0        
     [integer]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 26                                        :                                          
                                      IQR (CV) : 0 (9.7)                                :                                          
                                                                                        :                                          

19   previous_bookings_not_canceled   Mean (sd) : 0.1 (1.5)      73 distinct values     :                      119390     0        
     [integer]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 72                                        :                                          
                                      IQR (CV) : 0 (10.9)                               :                                          
                                                                                        :                                          

20   reserved_room_type               1. A                       85994 (72.0%)          IIIIIIIIIIIIII         119390     0        
     [character]                      2. B                        1118 ( 0.9%)                                 (100.0%)   (0.0%)   
                                      3. C                         932 ( 0.8%)                                                     
                                      4. D                       19201 (16.1%)          III                                        
                                      5. E                        6535 ( 5.5%)          I                                          
                                      6. F                        2897 ( 2.4%)                                                     
                                      7. G                        2094 ( 1.8%)                                                     
                                      8. H                         601 ( 0.5%)                                                     
                                      9. L                           6 ( 0.0%)                                                     
                                      10. P                         12 ( 0.0%)                                                     

21   assigned_room_type               1. A                       74053 (62.0%)          IIIIIIIIIIII           119390     0        
     [character]                      2. D                       25322 (21.2%)          IIII                   (100.0%)   (0.0%)   
                                      3. E                        7806 ( 6.5%)          I                                          
                                      4. F                        3751 ( 3.1%)                                                     
                                      5. G                        2553 ( 2.1%)                                                     
                                      6. C                        2375 ( 2.0%)                                                     
                                      7. B                        2163 ( 1.8%)                                                     
                                      8. H                         712 ( 0.6%)                                                     
                                      9. I                         363 ( 0.3%)                                                     
                                      10. K                        279 ( 0.2%)                                                     
                                      [ 2 others ]                  13 ( 0.0%)                                                     

22   booking_changes                  Mean (sd) : 0.2 (0.7)      21 distinct values     :                      119390     0        
     [integer]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 21                                        :                                          
                                      IQR (CV) : 0 (2.9)                                :                                          
                                                                                        :                                          

23   deposit_type                     1. No Deposit              104641 (87.6%)         IIIIIIIIIIIIIIIII      119390     0        
     [character]                      2. Non Refund               14587 (12.2%)         II                     (100.0%)   (0.0%)   
                                      3. Refundable                 162 ( 0.1%)                                                    

24   agent                            1. 9                       31961 (26.8%)          IIIII                  119390     0        
     [character]                      2. NULL                    16340 (13.7%)          II                     (100.0%)   (0.0%)   
                                      3. 240                     13922 (11.7%)          II                                         
                                      4. 1                        7191 ( 6.0%)          I                                          
                                      5. 14                       3640 ( 3.0%)                                                     
                                      6. 7                        3539 ( 3.0%)                                                     
                                      7. 6                        3290 ( 2.8%)                                                     
                                      8. 250                      2870 ( 2.4%)                                                     
                                      9. 241                      1721 ( 1.4%)                                                     
                                      10. 28                      1666 ( 1.4%)                                                     
                                      [ 324 others ]             33250 (27.8%)          IIIII                                      

25   company                          1. NULL                    112593 (94.3%)         IIIIIIIIIIIIIIIIII     119390     0        
     [character]                      2. 40                         927 ( 0.8%)                                (100.0%)   (0.0%)   
                                      3. 223                        784 ( 0.7%)                                                    
                                      4. 67                         267 ( 0.2%)                                                    
                                      5. 45                         250 ( 0.2%)                                                    
                                      6. 153                        215 ( 0.2%)                                                    
                                      7. 174                        149 ( 0.1%)                                                    
                                      8. 219                        141 ( 0.1%)                                                    
                                      9. 281                        138 ( 0.1%)                                                    
                                      10. 154                       133 ( 0.1%)                                                    
                                      [ 343 others ]               3793 ( 3.2%)                                                    

26   days_in_waiting_list             Mean (sd) : 2.3 (17.6)     128 distinct values    :                      119390     0        
     [integer]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 391                                       :                                          
                                      IQR (CV) : 0 (7.6)                                :                                          
                                                                                        :                                          

27   customer_type                    1. Contract                 4076 ( 3.4%)                                 119390     0        
     [character]                      2. Group                     577 ( 0.5%)                                 (100.0%)   (0.0%)   
                                      3. Transient               89613 (75.1%)          IIIIIIIIIIIIIII                            
                                      4. Transient-Party         25124 (21.0%)          IIII                                       

28   adr                              Mean (sd) : 101.8 (50.5)   8879 distinct values   :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      -6.4 < 94.6 < 5400                                :                                          
                                      IQR (CV) : 56.7 (0.5)                             :                                          
                                                                                        :                                          

29   required_car_parking_spaces      Mean (sd) : 0.1 (0.2)      0 : 111974 (93.8%)     IIIIIIIIIIIIIIIIII     119390     0        
     [integer]                        min < med < max:           1 :   7383 ( 6.2%)     I                      (100.0%)   (0.0%)   
                                      0 < 0 < 8                  2 :     28 ( 0.0%)                                                
                                      IQR (CV) : 0 (3.9)         3 :      3 ( 0.0%)                                                
                                                                 8 :      2 ( 0.0%)                                                

30   total_of_special_requests        Mean (sd) : 0.6 (0.8)      0 : 70318 (58.9%)      IIIIIIIIIII            119390     0        
     [integer]                        min < med < max:           1 : 33226 (27.8%)      IIIII                  (100.0%)   (0.0%)   
                                      0 < 0 < 5                  2 : 12969 (10.9%)      II                                         
                                      IQR (CV) : 1 (1.4)         3 :  2497 ( 2.1%)                                                 
                                                                 4 :   340 ( 0.3%)                                                 
                                                                 5 :    40 ( 0.0%)                                                 

31   reservation_status               1. Canceled                43017 (36.0%)          IIIIIII                119390     0        
     [character]                      2. Check-Out               75166 (63.0%)          IIIIIIIIIIII           (100.0%)   (0.0%)   
                                      3. No-Show                  1207 ( 1.0%)                                                     

32   reservation_status_date          1. 2015-10-21                1461 ( 1.2%)                                119390     0        
     [character]                      2. 2015-07-06                 805 ( 0.7%)                                (100.0%)   (0.0%)   
                                      3. 2016-11-25                 790 ( 0.7%)                                                    
                                      4. 2015-01-01                 763 ( 0.6%)                                                    
                                      5. 2016-01-18                 625 ( 0.5%)                                                    
                                      6. 2015-07-02                 469 ( 0.4%)                                                    
                                      7. 2016-12-07                 450 ( 0.4%)                                                    
                                      8. 2015-12-18                 423 ( 0.4%)                                                    
                                      9. 2016-02-09                 412 ( 0.3%)                                                    
                                      10. 2016-04-04                382 ( 0.3%)                                                    
                                      [ 916 others ]             112810 (94.5%)         IIIIIIIIIIIIIIIIII                         
-----------------------------------------------------------------------------------------------------------------------------------

As can be seen in the above summary, hotel has only two values - Resort and City. The dataset has data from the year 2015 to the year 2017. It can be observed that about 37% of the bookings are canceled from the is_canceled column summary. We can see a large number of columns with categorical values like country, market_segment, distribution_channel, customer_type, etc. and how the percentage of these categorical values are distributed in the data set. This gives us insight on what analysis can be done on the data.

Provide Grouped Summary Statistics

I chose the market_segment column to group the data. The market segment would give us insight into the behavior of each segment. This can be further understood by looking at the tables below :-

Code
# Group dataset by Market Segment
dataGroupedByMS <- data %>%
  group_by(market_segment)

# Analyze the grouped segment's stays in weekdays and weekends using mean of the number of nights.
dataGroupedByMS %>%
  summarise(
    avg_stays_in_week_nights = mean(stays_in_week_nights, nr.rm = TRUE),
    avg_stays_in_weekend_nights = mean(stays_in_weekend_nights, nr.rm = TRUE)
  )
# A tibble: 8 × 3
  market_segment avg_stays_in_week_nights avg_stays_in_weekend_nights
  <chr>                             <dbl>                       <dbl>
1 Aviation                           2.51                       1.09 
2 Complementary                      1.30                       0.351
3 Corporate                          1.65                       0.438
4 Direct                             2.35                       0.856
5 Groups                             2.20                       0.789
6 Offline TA/TO                      2.85                       1.05 
7 Online TA                          2.58                       0.991
8 Undefined                          1                          0.5  
Code
# Analyze the grouped segment's booking status using percentage of distribution for each categorical value in reservation_status.
prop.table(table(select(dataGroupedByMS, reservation_status))) %>% 
  `*`(100) %>% 
  round(2)
               reservation_status
market_segment  Canceled Check-Out No-Show
  Aviation          0.03      0.15    0.01
  Complementary     0.07      0.54    0.01
  Corporate         0.77      3.60    0.06
  Direct            1.44      8.94    0.18
  Groups           10.07      6.46    0.06
  Offline TA/TO     6.77     13.32    0.19
  Online TA        16.88     29.93    0.50
  Undefined         0.00      0.00    0.00

Explain and Interpret

The first table gives insight into which market segment bookings come for weekdays and which ones come during weekends. This could help the hotels decide which events to organize or how to plan activities targeted to a certain segment during those days.

The second table gives percentage of bookings of each type of segment that were cancelled, no-shows or checked-out. This could help the hotels analyze which market segments gets the most cancellations and which ones get the least, and find ways to reduce the number of cancellations by targeted schemes or penalties towards certain segments.

Source Code
---
title: "Challenge 2"
author: "Kavya Harlalka"
desription: "Data wrangling: using group() and summarise()"
date: "09/24/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---

```{r Libraries}
#| label: setup
#| warning: false
#| message: false

# Import important libraries needed for functions
library(tidyverse)
library(rmarkdown)
library(summarytools)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Read in the Data


```{r Read In Data}
# Read data from the csv file
data <- read.csv('_data/hotel_bookings.csv')

# Printing the read data as a paged table
paged_table(data)
```

This dataset represents records of hotel bookings across different countries with details like booking data, booking status, no of adults/children/babies, etc.

## Describe the data

```{r Summary of Data}
# Get and print summary of the dataset to help us analyze it.
print(dfSummary(data))
```

As can be seen in the above summary, hotel has only two values - Resort and City. The dataset has data from the year 2015 to the year 2017. It can be observed that about 37% of the bookings are canceled from the is_canceled column summary. We can see a large number of columns with categorical values like country, market_segment, distribution_channel, customer_type, etc. and how the percentage of these categorical values are distributed in the data set. This gives us insight on what analysis can be done on the data.


## Provide Grouped Summary Statistics

I chose the market_segment column to group the data. The market segment would give us insight into the behavior of each segment. This can be further understood by looking at the tables below :-

```{r Analysis of Data}
# Group dataset by Market Segment
dataGroupedByMS <- data %>%
  group_by(market_segment)

# Analyze the grouped segment's stays in weekdays and weekends using mean of the number of nights.
dataGroupedByMS %>%
  summarise(
    avg_stays_in_week_nights = mean(stays_in_week_nights, nr.rm = TRUE),
    avg_stays_in_weekend_nights = mean(stays_in_weekend_nights, nr.rm = TRUE)
  )

# Analyze the grouped segment's booking status using percentage of distribution for each categorical value in reservation_status.
prop.table(table(select(dataGroupedByMS, reservation_status))) %>% 
  `*`(100) %>% 
  round(2)

```

### Explain and Interpret
The first table gives insight into which market segment bookings come for weekdays and which ones come during weekends. This could help the hotels decide which events to organize or how to plan activities targeted to a certain segment during those days.

The second table gives percentage of bookings of each type of segment that were cancelled, no-shows or checked-out. This could help the hotels analyze which market segments gets the most cancellations and which ones get the least, and find ways to reduce the number of cancellations by targeted schemes or penalties towards certain segments.