Challenge 2 Solution

challenge_2
hotel_bookings
Author

Tim Shores

Published

March 2, 2023

Code
my_packages <- c("tidyverse", "magrittr", "readxl", "summarytools", "knitr") # create vector of packages
invisible(lapply(my_packages, require, character.only = TRUE)) # load multiple packages

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

My content

Challenge 2 includes two tasks:

  1. read in a dataset and describe it, and

  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Let’s dig in!

Task 1) Read in and Describe the Data

I chose to read in hotel_bookings.csv. One of my oldest friends has worked in hospitality for years. We have fallen out of touch. This post is dedicated to him.

Code
hotelbks <- read.csv(file = "../posts/_data/hotel_bookings.csv") # read in data
hbrows <- prettyNum(nrow(hotelbks), big.mark = ",", scientific = FALSE)  # Apply comma-separated format
hbcols <- prettyNum(ncol(hotelbks), big.mark = ",", scientific = FALSE)
numtoprint <- 12
monthtoprint <- "January"
yeartoprint <- 2017

The hotel bookings set includes 119,390 observations under 32 variables. It shows operational business data from multiple hotels in multiple markets and countries, managed by multiple companies. The data is likely the product of market research produced by a third-party analyst or management consultant.

Here’s the first 12 rows from the January, 2017 subset of the hotel bookings data set, with a selection of columns to display market and demographic info.

Code
hotelsub <- subset(hotelbks, arrival_date_year == yeartoprint & arrival_date_month == monthtoprint, select = c(hotel, country, market_segment, adults, children, babies, is_canceled))
hotelsub[1:numtoprint,]
            hotel country market_segment adults children babies is_canceled
9776 Resort Hotel     PRT      Online TA      2        0      0           1
9777 Resort Hotel     AUT      Online TA      2        0      0           1
9778 Resort Hotel     AUT      Online TA      2        0      0           1
9779 Resort Hotel     AUT      Online TA      2        0      0           1
9780 Resort Hotel     PRT      Online TA      1        0      0           1
9781 Resort Hotel     ITA      Online TA      2        0      0           1
9782 Resort Hotel     BEL      Online TA      2        0      0           1
9783 Resort Hotel     GBR      Online TA      2        0      0           1
9784 Resort Hotel     DEU      Online TA      1        0      0           1
9785 Resort Hotel     PRT  Offline TA/TO      1        0      0           1
9786 Resort Hotel     PRT      Online TA      2        0      0           1
9787 Resort Hotel     PRT      Online TA      2        0      0           1

Task 2) Summary statistics and interpretations of groups within the data

The adr column lists average daily rate for a single booking. Since this is data for a business, it makes sense to look at the daily rate a hotel can expect to earn from other grouped variables. I’ve done this below by grouping number of adults and number of children by average daily rate.

Code
hotelbks %>% 
  group_by(adults) %>% 
  summarise(meanADR = mean(adr, na.rm = TRUE), count = n())
# A tibble: 14 × 3
   adults meanADR count
    <int>   <dbl> <int>
 1      0    49.6   403
 2      1    82.5 23027
 3      2   103.  89680
 4      3   157.   6202
 5      4   198.     62
 6      5     0       2
 7      6     0       1
 8     10     0       1
 9     20     0       2
10     26     0       5
11     27     0       2
12     40     0       1
13     50     0       1
14     55     0       1
Code
hotelbks %>% 
  group_by(children) %>% 
  summarise(meanADR = mean(adr, na.rm = TRUE), count = n())
# A tibble: 6 × 3
  children meanADR  count
     <int>   <dbl>  <int>
1        0    97.5 110796
2        1   142.    4861
3        2   179.    3652
4        3   199.      76
5       10   133.       1
6       NA    29.6      4

This analysis could help a hotel decide how much to invest in rooms and services appropriate for small or large numbers of adults or children. I can filter out groups with a small number of observations to make it easier to read:

Code
hotelbks %>% 
  group_by(adults) %>% 
  filter(n() > 50) %>% 
  summarise(meanADR = mean(adr, na.rm = TRUE), count = n())
# A tibble: 5 × 3
  adults meanADR count
   <int>   <dbl> <int>
1      0    49.6   403
2      1    82.5 23027
3      2   103.  89680
4      3   157.   6202
5      4   198.     62
Code
hotelbks %>% 
  group_by(children) %>% 
  filter(n() > 50) %>% 
  summarise(meanADR = mean(adr, na.rm = TRUE), count = n())
# A tibble: 4 × 3
  children meanADR  count
     <int>   <dbl>  <int>
1        0    97.5 110796
2        1   142.    4861
3        2   179.    3652
4        3   199.      76

I can do more work to explore the central tendency and dispersion of this data. Mean and median are built in. Finding mode takes a little more effort to create a function, but I’m fortunate that others who have encountered this problem before took the time to write up their solution on the interwebs.

Code
#define function to calculate mode
find_mode <- function(x) {
  u <- unique(x) # unique list as an index
  tab <- tabulate(match(x, u))  # count how many times each index member occurs
  u[tab == max(tab)] #  the max occurrence is the mode
  mean(u) # return mean in case the data is multimodal
}

hotelbks %>% 
  group_by(adults) %>% 
  filter(n() > 50) %>% 
  summarise(meanADR = mean(adr, na.rm = TRUE), medianADR = median(adr,na.rm = TRUE), modeADR = find_mode(adr), fiveNumADR = fivenum(adr, na.rm = TRUE), count = n()) %>%
  tbl_df %>% 
  print(n=40)
# A tibble: 25 × 6
   adults meanADR medianADR modeADR fiveNumADR count
    <int>   <dbl>     <dbl>   <dbl>      <dbl> <int>
 1      0    49.6      56.3    85.6       0      403
 2      0    49.6      56.3    85.6       0      403
 3      0    49.6      56.3    85.6      56.3    403
 4      0    49.6      56.3    85.6      91.8    403
 5      0    49.6      56.3    85.6     200      403
 6      1    82.5      79.2   100.        0    23027
 7      1    82.5      79.2   100.       56.5  23027
 8      1    82.5      79.2   100.       79.2  23027
 9      1    82.5      79.2   100.      107.   23027
10      1    82.5      79.2   100.      510    23027
11      2   103.       95     125.       -6.38 89680
12      2   103.       95     125.       72    89680
13      2   103.       95     125.       95    89680
14      2   103.       95     125.      125.   89680
15      2   103.       95     125.     5400    89680
16      3   157.      152.    162.        0     6202
17      3   157.      152.    162.      127.    6202
18      3   157.      152.    162.      152.    6202
19      3   157.      152.    162.      184.    6202
20      3   157.      152.    162.      402     6202
21      4   198.      204.    200.        0       62
22      4   198.      204.    200.      162.      62
23      4   198.      204.    200.      204.      62
24      4   198.      204.    200.      251       62
25      4   198.      204.    200.      357       62
Code
hotelbks %>% 
  group_by(children) %>% 
  filter(n() > 50) %>% 
  summarise(meanADR = mean(adr, na.rm = TRUE), medianADR = median(adr, na.rm = TRUE), modeADR = find_mode(adr), fiveNumADR = fivenum(adr, na.rm = TRUE), count = n()) %>%
  tbl_df %>% 
  print(n=40)
# A tibble: 20 × 6
   children meanADR medianADR modeADR fiveNumADR  count
      <int>   <dbl>     <dbl>   <dbl>      <dbl>  <int>
 1        0    97.5      90.9    117.      -6.38 110796
 2        0    97.5      90.9    117.      67    110796
 3        0    97.5      90.9    117.      90.9  110796
 4        0    97.5      90.9    117.     120    110796
 5        0    97.5      90.9    117.    5400    110796
 6        1   142.      134.     147.       0      4861
 7        1   142.      134.     147.     106.     4861
 8        1   142.      134.     147.     134.     4861
 9        1   142.      134.     147.     170.     4861
10        1   142.      134.     147.     402      4861
11        2   179.      180      181.       0      3652
12        2   179.      180      181.     140.     3652
13        2   179.      180      181.     180      3652
14        2   179.      180      181.     222      3652
15        2   179.      180      181.     452.     3652
16        3   199.      221.     202.       0        76
17        3   199.      221.     202.     145        76
18        3   199.      221.     202.     221.       76
19        3   199.      221.     202.     264.       76
20        3   199.      221.     202.     352.       76

The Tukey five number summary (minimum, lower quartile, median, upper quartile, and maximum) is interesting and easy to code with just one function, but fivenum() fits awkwardly into the grouped tibble format. It forces five rows for each group and it repeats the median value that’s already displayed under the third variable.

Since fivenum() returns a vector, I can rewrite my code to present each fivenum() vector element as its own variable. This creates a nicer table.

Code
hotelbks %>% 
  group_by(adults) %>% 
  filter(n() > 50) %>% 
  summarise(meanADR = mean(adr, na.rm = TRUE), modeADR = find_mode(adr), minADR = fivenum(adr, na.rm = TRUE)[1], lowHingeADR = fivenum(adr, na.rm = TRUE)[2], medianADR = median(adr,na.rm = TRUE), upHingeADR = fivenum(adr, na.rm = TRUE)[4], maxADR = fivenum(adr, na.rm = TRUE)[5], count = n())
# A tibble: 5 × 9
  adults meanADR modeADR minADR lowHingeADR medianADR upHingeADR maxADR count
   <int>   <dbl>   <dbl>  <dbl>       <dbl>     <dbl>      <dbl>  <dbl> <int>
1      0    49.6    85.6   0            0        56.3       91.8    200   403
2      1    82.5   100.    0           56.5      79.2      107.     510 23027
3      2   103.    125.   -6.38        72        95        125.    5400 89680
4      3   157.    162.    0          127.      152.       184.     402  6202
5      4   198.    200.    0          162.      204.       251      357    62
Code
hotelbks %>% 
  group_by(children) %>% 
  filter(n() > 50) %>% 
  summarise(meanADR = mean(adr, na.rm = TRUE), modeADR = find_mode(adr), minADR = fivenum(adr, na.rm = TRUE)[1], lowHingeADR = fivenum(adr, na.rm = TRUE)[2], medianADR = median(adr,na.rm = TRUE), upHingeADR = fivenum(adr, na.rm = TRUE)[4], maxADR = fivenum(adr, na.rm = TRUE)[5], count = n())
# A tibble: 4 × 9
  children meanADR modeADR minADR lowHingeADR medianADR upHingeADR maxADR  count
     <int>   <dbl>   <dbl>  <dbl>       <dbl>     <dbl>      <dbl>  <dbl>  <int>
1        0    97.5    117.  -6.38         67       90.9       120   5400  110796
2        1   142.     147.   0           106.     134.        170.   402    4861
3        2   179.     181.   0           140.     180         222    452.   3652
4        3   199.     202.   0           145      221.        264.   352.     76

Finally, I can combine these tables, grouping both by adults and by children, and arranging by mean ADR to see which combinations of adult and children hotel guests bring in the highest and lowest daily rate.

Code
hotelbks %>% 
  group_by(adults, children) %>% 
  filter(n() > 50) %>% filter(adults != 0 || children != 0) %>% 
  summarise(meanADR = mean(adr, na.rm = TRUE), modeADR = find_mode(adr), minADR = fivenum(adr, na.rm = TRUE)[1], lowHingeADR = fivenum(adr, na.rm = TRUE)[2], medianADR = median(adr,na.rm = TRUE), upHingeADR = fivenum(adr, na.rm = TRUE)[4], maxADR = fivenum(adr, na.rm = TRUE)[5], count = n()) %>% 
  arrange(meanADR)
# A tibble: 11 × 10
# Groups:   adults [5]
   adults children meanADR modeADR minADR lowHing…¹ media…² upHin…³ maxADR count
    <int>    <int>   <dbl>   <dbl>  <dbl>     <dbl>   <dbl>   <dbl>  <dbl> <int>
 1      1        0    81.6    96.1   0         56      79      105.   510  22587
 2      0        2    83.2    92.3   0         77.1    90.0    106.   156.   208
 3      2        0    98.1   114.   -6.38      70      91.7    119   5400  82278
 4      1        1   113.    113.    0         78.2   105.     136.   283    279
 5      2        1   137.    143.    0        107.    132.     163.   392   4089
 6      3        0   153.    156.    0        126     150.     177    376.  5675
 7      1        2   158.    160.    0        109.    161.     197.   339.   157
 8      2        2   186.    188.    0        151.    186.     225    452.  3248
 9      4        0   198.    200.    0        167.    204.     248.   357     60
10      3        1   198.    201.    0        142.    194      249    402    487
11      2        3   228.    225.   79.3      199.    240.     267    352.    61
# … with abbreviated variable names ¹​lowHingeADR, ²​medianADR, ³​upHingeADR