Challenge 6 Solutions

challenge_6

hotel_bookings

Visualizing Time and Relationships

Author

Vishnupriya Varadharaju

Published

November 30, 2022

library(tidyverse)
library(lubridate)
library(ggplot2)
library(treemap)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Working with the Hotel Bookings Dataset

Read in data

# Reading in the CSV data
h_book <- read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)

h_book

# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Briefly describe the data

The following dataset has 119390 observations and 32 different fields. This is basically the data from two hotels - City Hotel & Resort Hotel. Each entry corresponds to a booking made by a customer. There are various information that are recorded with each booking. It includes the arrival date, number of days of stay, meal, type of room reserved, customer type, reservation status, number of adult/children and many more. The data consists of entries from countries all over the world.

Tidy Data (as needed)

# To check the number of unique values in all the columns
rapply(h_book,function(x)length(unique(x)))

                         hotel                    is_canceled 
                             2                              2 
                     lead_time              arrival_date_year 
                           479                              3 
            arrival_date_month       arrival_date_week_number 
                            12                             53 
     arrival_date_day_of_month        stays_in_weekend_nights 
                            31                             17 
          stays_in_week_nights                         adults 
                            35                             14 
                      children                         babies 
                             6                              5 
                          meal                        country 
                             5                            178 
                market_segment           distribution_channel 
                             8                              5 
             is_repeated_guest         previous_cancellations 
                             2                             15 
previous_bookings_not_canceled             reserved_room_type 
                            73                             10 
            assigned_room_type                booking_changes 
                            12                             21 
                  deposit_type                          agent 
                             3                            334 
                       company           days_in_waiting_list 
                           353                            128 
                 customer_type                            adr 
                             4                           8879 
   required_car_parking_spaces      total_of_special_requests 
                             5                              6 
            reservation_status        reservation_status_date 
                             3                            926

# To check the unique values of hotel
unique(h_book$hotel)

[1] "Resort Hotel" "City Hotel"

# in country we can see that there is a field call NULL that can be removed
table(h_book$country)


  ABW   AGO   AIA   ALB   AND   ARE   ARG   ARM   ASM   ATA   ATF   AUS   AUT 
    2   362     1    12     7    51   214     8     1     2     1   426  1263 
  AZE   BDI   BEL   BEN   BFA   BGD   BGR   BHR   BHS   BIH   BLR   BOL   BRA 
   17     1  2342     3     1    12    75     5     1    13    26    10  2224 
  BRB   BWA   CAF   CHE   CHL   CHN   CIV   CMR    CN   COL   COM   CPV   CRI 
    4     1     5  1730    65   999     6    10  1279    71     2    24    19 
  CUB   CYM   CYP   CZE   DEU   DJI   DMA   DNK   DOM   DZA   ECU   EGY   ESP 
    8     1    51   171  7287     1     1   435    14   103    27    32  8568 
  EST   ETH   FIN   FJI   FRA   FRO   GAB   GBR   GEO   GGY   GHA   GIB   GLP 
   83     3   447     1 10415     5     4 12129    22     3     4    18     2 
  GNB   GRC   GTM   GUY   HKG   HND   HRV   HUN   IDN   IMN   IND   IRL   IRN 
    9   128     4     1    29     1   100   230    35     2   152  3375    83 
  IRQ   ISL   ISR   ITA   JAM   JEY   JOR   JPN   KAZ   KEN   KHM   KIR   KNA 
   14    57   669  3766     6     8    21   197    19     6     2     1     2 
  KOR   KWT   LAO   LBN   LBY   LCA   LIE   LKA   LTU   LUX   LVA   MAC   MAR 
  133    16     2    31     8     1     3     7    81   287    55    16   259 
  MCO   MDG   MDV   MEX   MKD   MLI   MLT   MMR   MNE   MOZ   MRT   MUS   MWI 
    4     1    12    85    10     1    18     1     5    67     1     7     2 
  MYS   MYT   NAM   NCL   NGA   NIC   NLD   NOR   NPL  NULL   NZL   OMN   PAK 
   28     2     1     1    34     1  2104   607     1   488    74    18    14 
  PAN   PER   PHL   PLW   POL   PRI   PRT   PRY   PYF   QAT   ROU   RUS   RWA 
    9    29    40     1   919    12 48590     4     1    15   500   632     2 
  SAU   SDN   SEN   SGP   SLE   SLV   SMR   SRB   STP   SUR   SVK   SVN   SWE 
   48     1    11    39     1     2     1   101     2     5    65    57  1024 
  SYC   SYR   TGO   THA   TJK   TMP   TUN   TUR   TWN   TZA   UGA   UKR   UMI 
    2     3     2    59     9     3    39   248    51     5     2    68     1 
  URY   USA   UZB   VEN   VGB   VNM   ZAF   ZMB   ZWE 
   32  2097     4    26     1     8    80     2     4

h_book <- h_book %>% 
  filter(!(country == "NULL"))

In country, there are entries called NULL, which can be removed as it may not be useful for the analysis.

# to check the different types of the fields
head(h_book)

# A tibble: 6 × 32
  hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
  <chr>     <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 Resort…       0     342    2015 July         27       1       0       0      2
2 Resort…       0     737    2015 July         27       1       0       0      2
3 Resort…       0       7    2015 July         27       1       0       1      1
4 Resort…       0      13    2015 July         27       1       0       1      1
5 Resort…       0      14    2015 July         27       1       0       2      2
6 Resort…       0      14    2015 July         27       1       0       2      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

sapply(h_book, class)

                         hotel                    is_canceled 
                   "character"                      "numeric" 
                     lead_time              arrival_date_year 
                     "numeric"                      "numeric" 
            arrival_date_month       arrival_date_week_number 
                   "character"                      "numeric" 
     arrival_date_day_of_month        stays_in_weekend_nights 
                     "numeric"                      "numeric" 
          stays_in_week_nights                         adults 
                     "numeric"                      "numeric" 
                      children                         babies 
                     "numeric"                      "numeric" 
                          meal                        country 
                   "character"                    "character" 
                market_segment           distribution_channel 
                   "character"                    "character" 
             is_repeated_guest         previous_cancellations 
                     "numeric"                      "numeric" 
previous_bookings_not_canceled             reserved_room_type 
                     "numeric"                    "character" 
            assigned_room_type                booking_changes 
                   "character"                      "numeric" 
                  deposit_type                          agent 
                   "character"                    "character" 
                       company           days_in_waiting_list 
                   "character"                      "numeric" 
                 customer_type                            adr 
                   "character"                      "numeric" 
   required_car_parking_spaces      total_of_special_requests 
                     "numeric"                      "numeric" 
            reservation_status        reservation_status_date 
                   "character"                         "Date"

From the above analysis, we can see that two fields, Agents and Company have numerical values in them, but have the datatype marked as character. These NULL entries can be changed to NA and the datatype can be changed to numeric. Also, the arrival date in year, month and date can be combined into a single field called as arrival date.

# combining the arrival date into a single field
# to find the total number of guests in the hotel - add adults, children and babies

h_book_mut <- h_book %>% 
  mutate(arrival_date = str_c(arrival_date_day_of_month,
                              arrival_date_month,
                              arrival_date_year, sep="/"),
         arrival_date = dmy(arrival_date),
         total_guests = adults + children + babies) %>% 
  select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year))

h_book_mut

# A tibble: 118,902 × 31
   hotel     is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal 
   <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>
 1 Resort H…       0     342      27       0       0      2       0      0 BB   
 2 Resort H…       0     737      27       0       0      2       0      0 BB   
 3 Resort H…       0       7      27       0       1      1       0      0 BB   
 4 Resort H…       0      13      27       0       1      1       0      0 BB   
 5 Resort H…       0      14      27       0       2      2       0      0 BB   
 6 Resort H…       0      14      27       0       2      2       0      0 BB   
 7 Resort H…       0       0      27       0       2      2       0      0 BB   
 8 Resort H…       0       9      27       0       2      2       0      0 FB   
 9 Resort H…       1      85      27       0       3      2       0      0 BB   
10 Resort H…       1      75      27       0       3      2       0      0 HB   
# … with 118,892 more rows, 21 more variables: country <chr>,
#   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …

Changing the datatype of agent and company from char to numeric

# mutating the datatype of the agent and company field from char to numeric

h_book_mut <- h_book_mut %>%
  mutate(across(c(agent, company),~ replace(.,str_detect(., "NULL"), NA))) %>% mutate_at(vars(agent, company),as.numeric)

is.numeric(h_book_mut$agent)

[1] TRUE

is.numeric(h_book_mut$company)

[1] TRUE

Time Dependent Visualization

For the time dependent visualization I want to see how many guests are arriving over a period of time on a daily basis.

plot1_data <- h_book_mut %>% select(total_guests, arrival_date) %>%
  group_by(arrival_date) %>%
  summarise(net_guests = sum(total_guests, na.rm=TRUE))

plot1_data

# A tibble: 793 × 2
   arrival_date net_guests
   <date>            <dbl>
 1 2015-07-01          220
 2 2015-07-02          190
 3 2015-07-03          116
 4 2015-07-04          181
 5 2015-07-05          113
 6 2015-07-06          152
 7 2015-07-07          114
 8 2015-07-08          139
 9 2015-07-09          166
10 2015-07-10          118
# … with 783 more rows

# finding the date ranges of the arrival date of this data

summary(plot1_data$arrival_date)

        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2015-07-01" "2016-01-15" "2016-07-31" "2016-07-31" "2017-02-14" "2017-08-31"

From above, we can see that the arrival dates lie between July 2015 - August 2017.

p <- ggplot(plot1_data, aes(x=arrival_date, y=net_guests)) + ggtitle("Guest Count Vs. Arrival Date Trend: Jul 2015 - Jun 2016") +
  geom_line(color="maroon") + 
  xlab("") + theme(axis.text.x=element_text(angle=60, hjust=1)) +
  scale_x_date(date_labels = "%Y %b %d", date_minor_breaks = "1 month", limit=c(as.Date("2015-07-01"),as.Date("2016-07-01")))

p

Here, I am visualizing the trends for a period of one year from Jul 2015 - Jul 2016 by using a line graph as it can precisely be used to see trends for time series data. In the above visualization we can see the total guests who have arrived on a particular day. Furthermore, we can see the monthly breaks as well. The influx of guests peaked during the first week of December 2015 and dipped sharply the following week. During the months of summer i.e. April, May, June, the number of incoming guests seem to be consistent. This can be because of the vacation that many families might take during the summer break.

Visualizing Part-Whole Relationships

For visualizing part-whole relationship, I want to see the proportion of guests who have chosen different categories for meal. Firstly, I want need to group-by to get the count of each meal category.

unique(h_book_mut$meal)

[1] "BB"        "FB"        "HB"        "SC"        "Undefined"

There are 5 meal categories here. BB - ‘Bed & Breakfast’ FB - ‘Full Board’ - Breakfast, Lunch, Dinner HB - ‘Half Board’ - Breakfast & Dinner SC - ‘Self-Catering’ - No meals included Undefined - which can be ignored

I want to gather arrival date data of one year from Jul 2015 to Jun 2016 and exclude reservation status which says canceled.

plot2_data <- h_book_mut %>% 
  filter(reservation_status != 'Canceled', arrival_date >= as.Date("2015-07-01") & arrival_date < as.Date("2016-07-01")) %>%
  select(meal) %>%
  group_by(meal) %>%
  summarise(total_count = n(), .groups = 'drop') %>%
  filter(meal != "Undefined")

plot2_data

# A tibble: 4 × 2
  meal  total_count
  <chr>       <int>
1 BB          25353
2 FB            210
3 HB           4170
4 SC           1827

Plotting the above data using a tree map.

treemap(plot2_data,
            index="meal",
            vSize="total_count",
            type="index",
            title = "Meal Cateogries :Jul 2015 - Jun 2016",
            palette = "Set3",
            border.lwds = 1.5
            )

The most preferred meal scheme was BB, followed by HB, SC and finally FB. The reason for choosing a treemap is to give an easy overview of which category is the most preferred in the meal type. This can help with the initial analysis and we can further dive deep into why BB is preferred most, in which countries and by which demography. Maybe guests who have children tend to choose that option or guests who are just staying only for a night, etc.