HW3

Homework 3

Author

Prajakti Kapade

Published

December 3, 2022

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.2.2

#Read Dataset We will load the hotel_bookings dataset.

hotel_data <- read_csv('_data/hotel_bookings.csv')

Error in read_csv("_data/hotel_bookings.csv"): could not find function "read_csv"

hotel_data

Error in eval(expr, envir, enclos): object 'hotel_data' not found

colnames(hotel_data)

Error in is.data.frame(x): object 'hotel_data' not found

The dataset contains 32 columns and 119,390 observations. The columns are shown above. Some of the categorical columns are : hotel, deposit_type, company, customer_type The columns related to time are : “lead_time” ,“arrival_date_year” , “arrival_date_month” “arrival_date_week_number” ,“arrival_date_day_of_month” ,“stays_in_weekend_nights”, “stays_in_week_nights” The target column according to this dataset is : “is_cancelled” which is binary 1 for YES and 0 for NO.

#Dataset Statistics

summary(hotel_data)

Error in summary(hotel_data): object 'hotel_data' not found

Summary provides the statistics for each column.As, we can see this dataset is well-defined in terms of rows and columns, thus, they need no pivoting.

unique(hotel_data$hotel)

Error in unique(hotel_data$hotel): object 'hotel_data' not found

unique(hotel_data$deposit_type)

Error in unique(hotel_data$deposit_type): object 'hotel_data' not found

unique(hotel_data$customer_type)

Error in unique(hotel_data$customer_type): object 'hotel_data' not found

There are two types of hotels : “Resort Hotel”, “City Hotel”, three types of Deposits : “No Deposit” ,“Refundable” ,“Non Refund”, and four types of customers in this dataset : “Transient”,“Contract”,“Transient-Party”,“Group”.

We will look at the frequency for categorical columns.

hotel_data%>%
  group_by(customer_type)%>%
  summarise(num=n())%>%
  arrange(desc(num))

Error in group_by(., customer_type): object 'hotel_data' not found

hotel_data%>%
  group_by(market_segment)%>%
  summarise(num=n())%>%
  arrange(desc(num))

Error in group_by(., market_segment): object 'hotel_data' not found

hotel_data%>%
  group_by(hotel)%>%
  summarise(num=n())%>%
  arrange(desc(num))

Error in group_by(., hotel): object 'hotel_data' not found

hotel_data%>%
  group_by(distribution_channel)%>%
  summarise(num=n())%>%
  arrange(desc(num))

Error in group_by(., distribution_channel): object 'hotel_data' not found

#Dataset Cleaning Using the code below to see the null values

sapply(hotel_data, function(x) sum(is.na(x)))

Error in lapply(X = X, FUN = FUN, ...): object 'hotel_data' not found

As, we just have null values in one column i.e. Children. We can substitue those 4 values with the values in babies columns, as they mean the same or are giving same context.

n <- length(hotel_data$children)

Error in eval(expr, envir, enclos): object 'hotel_data' not found

for (i in 1:n) {
  if (is.na(hotel_data$children[i]))
    hotel_data$children[i] <- hotel_data$babies[i]
}

Error in 1:n: NA/NaN argument

Now, we will look at one column which does not clearly define itself : “adr”. It should be “average daily rate”. According to the summary, the maximum value is very high while the mean lies in 100s. We will look at how many records are actually in the high range.

hotel_data%>%
  filter(adr>2000)

Error in filter(., adr > 2000): object 'hotel_data' not found

hotel_data%>%
  filter(adr>1000)

Error in filter(., adr > 1000): object 'hotel_data' not found

We can see, there is only one record which can be an outlier, so we can replace the value by mean of the column “adr”.

hotel_data = hotel_data%>%
  mutate(adr = replace(adr, adr>1000, mean(adr)))

Error in mutate(., adr = replace(adr, adr > 1000, mean(adr))): object 'hotel_data' not found

#Data Analysis

ggplot(data = hotel_data, aes(x = hotel)) +
  geom_bar(stat = "count",fill='lightblue') +
  labs(title = "Booking by Hotel type",
       x = "Hotel type",
       y = "No. of bookings")

Error in ggplot(data = hotel_data, aes(x = hotel)): object 'hotel_data' not found

The visualization shows that City Hotels have higher booking than Resorts.

ggplot(data = hotel_data, aes(x = customer_type)) +
  geom_bar(stat = "count",fill='purple') +
  labs(title = "Booking by customer type",
       x = "Customer type",
       y = "No. of bookings")

Error in ggplot(data = hotel_data, aes(x = customer_type)): object 'hotel_data' not found

The visualization suggests that the Customers having maximum bookings are Transient, followed by Transient-Party, then Contract and lastly Groups.

ggplot(data = hotel_data, aes(x = market_segment)) +
  geom_bar(stat = "count",fill='lightblue') +
  labs(title = "Booking by Market Segment",
       x = "Market Segment",
       y = "No. of bookings")

Error in ggplot(data = hotel_data, aes(x = market_segment)): object 'hotel_data' not found

The visualization suggests that the Market Segment with highest booking is Online TA, while the ones with least bookings are in Aviation Segment ( If we ignore Undefined)

ggplot(data = hotel_data, aes(x = distribution_channel)) +
  geom_bar(stat = "count",fill='purple') +
  labs(title = "Booking by Distribution channel",
       x = "Distribution channel",
       y = "No. of bookings")

Error in ggplot(data = hotel_data, aes(x = distribution_channel)): object 'hotel_data' not found

The visualization suggests that the Distribution channel with maximum bookings is TA/TO while the one with least bookings is GDS.

#Visualizations

Below are visualizing deposit type, customer type and market segment for each Hotel type.

#plot distribution of deposit type
hotel_data %>% select(hotel, deposit_type) %>% 
                group_by( hotel, deposit_type) %>% 
                summarise(n = n()) %>% 
                ggplot(aes(reorder(deposit_type, n), n, fill = hotel)) + 
                geom_bar(stat = "identity", position = position_dodge()) + xlab("") + 
                ylab("# of bookings") + labs(fill = "Hotel type")

Error in select(., hotel, deposit_type): object 'hotel_data' not found

#plot distribution of deposit type
hotel_data %>% select(hotel, customer_type) %>% 
                group_by( hotel, customer_type) %>% 
                summarise(n = n()) %>% 
                ggplot(aes(reorder(customer_type, n), n, fill = hotel)) + 
                geom_bar(stat = "identity", position = position_dodge()) + xlab("") + 
                ylab("# of bookings") + labs(fill = "Hotel type")

Error in select(., hotel, customer_type): object 'hotel_data' not found

#plot distribution of deposit type
hotel_data %>% select(hotel, market_segment) %>% 
                group_by( hotel, market_segment) %>% 
                summarise(n = n()) %>% 
                ggplot(aes(reorder(market_segment, n), n, fill = hotel)) + 
                geom_bar(stat = "identity", position = position_dodge()) + xlab("") + 
                ylab("# of bookings") + labs(fill = "Hotel type")

Error in select(., hotel, market_segment): object 'hotel_data' not found

Below are the visualizations related to our Target Variable “is_cancelled” ( 1 for YES and 0 for NO) These visualizations show the distribution of cancelled bookings per hotel type and customer type.

#plot distribution of deposit type
hotel_data %>% select(is_canceled, hotel) %>% 
                group_by( is_canceled, hotel) %>% 
                summarise(n = n()) %>% 
                ggplot(aes(reorder(hotel, n), n, fill = is_canceled)) + 
                geom_bar(stat = "identity", position = position_dodge()) + xlab("") + 
                ylab("# of bookings") + labs(fill = "Hotel type")

Error in select(., is_canceled, hotel): object 'hotel_data' not found

#plot distribution of deposit type
hotel_data %>% select(is_canceled, customer_type) %>% 
                group_by( is_canceled, customer_type) %>% 
                summarise(n = n()) %>% 
                ggplot(aes(reorder(customer_type, n), n, fill = is_canceled)) + 
                geom_bar(stat = "identity", position = position_dodge()) + xlab("") + 
                ylab("# of bookings") + labs(fill = "Hotel type")

Error in select(., is_canceled, customer_type): object 'hotel_data' not found

#plot distribution of deposit type
hotel_data %>% select(is_canceled, is_repeated_guest) %>% 
                group_by( is_canceled, is_repeated_guest) %>% 
                summarise(n = n()) %>% 
                ggplot(aes(reorder(is_repeated_guest, n), n, fill = is_canceled)) + 
                geom_bar(stat = "identity", position = position_dodge()) + xlab("") + 
                ylab("# of bookings") + labs(fill = "Hotel type")

Error in select(., is_canceled, is_repeated_guest): object 'hotel_data' not found

We can summarize the following from the plots above: 1. The number of bookings for Resorts are half the number of bookings for the City Hotels. 2. Majority of the City Hotels have No Deposit for the booking, while there are 0 City Hotels asking for Refundable Deposits. 3. Majority of customers booking the City Hotels are Transient, followed by Transient-Party. 4. Most cancellations were made for City Hotels. 5. Most cancellations were made by Transient Customers. 6. Repeated guests have very low percentage of cancellations.

Now, we will look at the distribution of hotels per country

hotel_data%>%
  group_by(country)%>%
  summarise(num=n())%>%
  arrange(desc(num))

Error in group_by(., country): object 'hotel_data' not found

As, “is_canceled” is our target variable, let us see the cancelations per country.

hotel_data %>% 
  select(country, is_canceled) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(is_canceled)) %>% 
  head(n=10)

Error in select(., country, is_canceled): object 'hotel_data' not found

The maximum bookings are made in : PRT,GBR,FRA,ESP,DEU The maximum cancelations are made in : PRT,GBR,ESP,FRA,ITA

As seen, FRA has higher bookings than ESP, and the cancelations are less as compared the ESP. Thus, the hit rate i.e. the probability for booking to happen is high in FRA.

hotel_data %>% 
  select(country, stays_in_week_nights) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, mean, na.rm = TRUE) %>% 
  arrange(desc(stays_in_week_nights)) %>% 
  head(n=10)

Error in select(., country, stays_in_week_nights): object 'hotel_data' not found

hotel_data %>% 
  select(country, stays_in_weekend_nights) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, mean, na.rm = TRUE) %>% 
  arrange(desc(stays_in_weekend_nights)) %>% 
  head(n=10)

Error in select(., country, stays_in_weekend_nights): object 'hotel_data' not found

Also, we can see the stay in week nights and weekend nights follow almost similar pattern in terms of the countries, with FRO and SEN being the top 2.

The visualizations fail to answer: - The difference between Resort and City Hotels? - How does arrival date/month/year affect the bookings and the cancellations? - How is the “adr” field affecting the bookings? Do they depict something?