Challenge 7

challenge_7
Priyanka Perumalla
hotel_bookings.csv
Visualizing Multiple Dimensions
Author

Priyanka Perumalla

Published

May 16, 2023

library(tidyverse)
library(ggplot2)
library(lubridate)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. Recreate at least two graphs from previous exercises, but introduce at least one additional dimension that you omitted before using ggplot functionality (color, shape, line, facet, etc) The goal is not to create unneeded chart ink (Tufte), but to concisely capture variation in additional dimensions that were collapsed in your earlier 2 or 3 dimensional graphs.
  • Explain why you choose the specific graph type
  1. If you haven’t tried in previous weeks, work this week to make your graphs “publication” ready with titles, captions, and pretty axis labels and other viewer-friendly features

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code. And anyone not familiar with Edward Tufte should check out his fantastic books and courses on data visualizaton.

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • eggs ⭐
  • abc_poll ⭐⭐
  • australian_marriage ⭐⭐
  • hotel_bookings ⭐⭐⭐
  • air_bnb ⭐⭐⭐
  • us_hh ⭐⭐⭐⭐
  • faostat ⭐⭐⭐⭐⭐
hotel_data <- read.csv("_data/hotel_bookings.csv")
data<-hotel_data
dim(hotel_data)
[1] 119390     32
head(hotel_data)
         hotel is_canceled lead_time arrival_date_year arrival_date_month
1 Resort Hotel           0       342              2015               July
2 Resort Hotel           0       737              2015               July
3 Resort Hotel           0         7              2015               July
4 Resort Hotel           0        13              2015               July
5 Resort Hotel           0        14              2015               July
6 Resort Hotel           0        14              2015               July
  arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
1                       27                         1                       0
2                       27                         1                       0
3                       27                         1                       0
4                       27                         1                       0
5                       27                         1                       0
6                       27                         1                       0
  stays_in_week_nights adults children babies meal country market_segment
1                    0      2        0      0   BB     PRT         Direct
2                    0      2        0      0   BB     PRT         Direct
3                    1      1        0      0   BB     GBR         Direct
4                    1      1        0      0   BB     GBR      Corporate
5                    2      2        0      0   BB     GBR      Online TA
6                    2      2        0      0   BB     GBR      Online TA
  distribution_channel is_repeated_guest previous_cancellations
1               Direct                 0                      0
2               Direct                 0                      0
3               Direct                 0                      0
4            Corporate                 0                      0
5                TA/TO                 0                      0
6                TA/TO                 0                      0
  previous_bookings_not_canceled reserved_room_type assigned_room_type
1                              0                  C                  C
2                              0                  C                  C
3                              0                  A                  C
4                              0                  A                  A
5                              0                  A                  A
6                              0                  A                  A
  booking_changes deposit_type agent company days_in_waiting_list customer_type
1               3   No Deposit  NULL    NULL                    0     Transient
2               4   No Deposit  NULL    NULL                    0     Transient
3               0   No Deposit  NULL    NULL                    0     Transient
4               0   No Deposit   304    NULL                    0     Transient
5               0   No Deposit   240    NULL                    0     Transient
6               0   No Deposit   240    NULL                    0     Transient
  adr required_car_parking_spaces total_of_special_requests reservation_status
1   0                           0                         0          Check-Out
2   0                           0                         0          Check-Out
3  75                           0                         0          Check-Out
4  75                           0                         0          Check-Out
5  98                           0                         1          Check-Out
6  98                           0                         1          Check-Out
  reservation_status_date
1              2015-07-01
2              2015-07-01
3              2015-07-02
4              2015-07-02
5              2015-07-03
6              2015-07-03

Briefly describe the data

This dataset contains information about hotel bookings. It includes various attributes such as the hotel type, booking cancellation status, lead time (number of days between booking and arrival), arrival date details (year, month, week number, and day of the month), duration of stay (weekend nights and weeknights), number of adults, children, and babies in the booking, meal type, country of origin,..etc. The dataset provides a comprehensive overview of various aspects related to hotel bookings, including guest demographics, booking patterns, cancellation behavior, and reservation details. It can be used for various analytical purposes, such as understanding booking trends, predicting cancellations, analyzing guest preferences, and exploring relationships between different variables.

str(hotel_data)
'data.frame':   119390 obs. of  32 variables:
 $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
 $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
 $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
 $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
 $ arrival_date_month            : chr  "July" "July" "July" "July" ...
 $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
 $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
 $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
 $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ meal                          : chr  "BB" "BB" "BB" "BB" ...
 $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
 $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
 $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
 $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
 $ reserved_room_type            : chr  "C" "C" "A" "A" ...
 $ assigned_room_type            : chr  "C" "C" "C" "A" ...
 $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
 $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
 $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
 $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
 $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
 $ adr                           : num  0 0 75 75 98 ...
 $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
 $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
 $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...

Varities of countries present in the data:

unique_country <- unique(hotel_data$country)
cat(paste(unique_country, collapse = ", "))
PRT, GBR, USA, ESP, IRL, FRA, NULL, ROU, NOR, OMN, ARG, POL, DEU, BEL, CHE, CN, GRC, ITA, NLD, DNK, RUS, SWE, AUS, EST, CZE, BRA, FIN, MOZ, BWA, LUX, SVN, ALB, IND, CHN, MEX, MAR, UKR, SMR, LVA, PRI, SRB, CHL, AUT, BLR, LTU, TUR, ZAF, AGO, ISR, CYM, ZMB, CPV, ZWE, DZA, KOR, CRI, HUN, ARE, TUN, JAM, HRV, HKG, IRN, GEO, AND, GIB, URY, JEY, CAF, CYP, COL, GGY, KWT, NGA, MDV, VEN, SVK, FJI, KAZ, PAK, IDN, LBN, PHL, SEN, SYC, AZE, BHR, NZL, THA, DOM, MKD, MYS, ARM, JPN, LKA, CUB, CMR, BIH, MUS, COM, SUR, UGA, BGR, CIV, JOR, SYR, SGP, BDI, SAU, VNM, PLW, QAT, EGY, PER, MLT, MWI, ECU, MDG, ISL, UZB, NPL, BHS, MAC, TGO, TWN, DJI, STP, KNA, ETH, IRQ, HND, RWA, KHM, MCO, BGD, IMN, TJK, NIC, BEN, VGB, TZA, GAB, GHA, TMP, GLP, KEN, LIE, GNB, MNE, UMI, MYT, FRO, MMR, PAN, BFA, LBY, MLI, NAM, BOL, PRY, BRB, ABW, AIA, SLV, DMA, PYF, GUY, LCA, ATA, GTM, ASM, MRT, NCL, KIR, SDN, ATF, SLE, LAO

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Filtering the data for only required columns county, lead_time and meal. Aggregating the values in bookings for total no of bookings grouping by county, lead_time and meal type for 5 countries: USA, BEL, GBR, DEU, FRA.

# Tidy the data (if required)
tidy_data <- hotel_data %>%
  select(country, lead_time, meal) %>%
  drop_na()
hotel_data <- tidy_data %>%filter(country %in% c("USA", "BEL", "GBR", "DEU", "FRA"))
head(hotel_data, n = 5)
  country lead_time meal
1     GBR         7   BB
2     GBR        13   BB
3     GBR        14   BB
4     GBR        14   BB
5     USA        68   BB
# Group by country, lead_time, and meal, and calculate the number of bookings made
bookings_data <- hotel_data %>% group_by(country, lead_time, meal) %>% 
  summarise(bookings = n())

Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Some of the columns can be mutated in the main data. The first new column, total_stays, is created by adding the stays_in_weekend_nights and stays_in_week_nights columns together. The second new column, arrival_month_num, is created by converting the arrival_date_month column into a numerical value using the match() function and the month.name vector.

# Mutate columns to create a new column for total stays
data <- data %>%
  mutate(total_stays = stays_in_weekend_nights + stays_in_week_nights)

# Mutate columns to convert arrival_date_month into a numerical value
data <- data %>%
  mutate(arrival_month_num = match(arrival_date_month, month.name))

Visualization with Multiple Dimensions

This visualization provides an overview of the booking patterns for different countries and the influence of lead time and meal type on the bookings. It can be useful for identifying trends and patterns and for making informed decisions related to pricing, promotions, and marketing strategies for different countries and meal types.

# Plot the data using ggplot2
ggplot(bookings_data, aes(x = lead_time, y = bookings, color = meal)) +
  geom_point(size = 2) +
  facet_wrap(~country, ncol = 2) +
  labs(title = "Bookings by Lead Time, Meal, and Country",
       x = "Lead Time",
       y = "Bookings",
       color = "Meal Type")