HW2

Homework 2

Author

Prajakti Kapade

Published

October 26, 2022

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.2.2

#Read Dataset We will load the hotel_bookings dataset.

hotel_data <- read_csv('_data/hotel_bookings.csv')

Error in read_csv("_data/hotel_bookings.csv"): could not find function "read_csv"

hotel_data

Error in eval(expr, envir, enclos): object 'hotel_data' not found

colnames(hotel_data)

Error in is.data.frame(x): object 'hotel_data' not found

The dataset contains 32 columns and 119,390 observations. The columns are shown above. Some of the categorical columns are : hotel, deposit_type, company, customer_type The columns related to time are : “lead_time” ,“arrival_date_year” , “arrival_date_month” “arrival_date_week_number” ,“arrival_date_day_of_month” ,“stays_in_weekend_nights”, “stays_in_week_nights” The target column according to this dataset is : “is_cancelled”

#Summary of the dataset

summary(hotel_data)

Error in summary(hotel_data): object 'hotel_data' not found

As, we can see this dataset is well-defined in terms of rows and columns, thus, they need no pivoting.

unique(hotel_data$hotel)

Error in unique(hotel_data$hotel): object 'hotel_data' not found

unique(hotel_data$deposit_type)

Error in unique(hotel_data$deposit_type): object 'hotel_data' not found

unique(hotel_data$customer_type)

Error in unique(hotel_data$customer_type): object 'hotel_data' not found

There are two types of hotels : “Resort Hotel”, “City Hotel”, three types of Deposits : “No Deposit” ,“Refundable” ,“Non Refund”, and four types of customers in this dataset : “Transient”,“Contract”,“Transient-Party”,“Group”.

#Dataset Cleaning Using the code below to see the null values

sapply(hotel_data, function(x) sum(is.na(x)))

Error in lapply(X = X, FUN = FUN, ...): object 'hotel_data' not found

As, we just have null values in one column i.e. Children. We can substitue those 4 values with the values in babies columns, as they mean the same or are giving same context.

n <- length(hotel_data$children)

Error in eval(expr, envir, enclos): object 'hotel_data' not found

for (i in 1:n) {
  if (is.na(hotel_data$children[i]))
    hotel_data$children[i] <- hotel_data$babies[i]
}

Error in 1:n: NA/NaN argument

Now, we will look at one column which does not clearly define itself : “adr”. It should be “average daily rate”. According to the summary, the maximum value is very high while the mean lies in 100s. We will look at how many records are actually in the high range.

hotel_data%>%
  filter(adr>2000)

Error in filter(., adr > 2000): object 'hotel_data' not found

hotel_data%>%
  filter(adr>1000)

Error in filter(., adr > 1000): object 'hotel_data' not found

We can see, there is only one record which can be an outlier, so we can replace the value by mean of the column “adr”.

hotel_data = hotel_data%>%
  mutate(adr = replace(adr, adr>1000, mean(adr)))

Error in mutate(., adr = replace(adr, adr > 1000, mean(adr))): object 'hotel_data' not found

#Dataset Analysis

ggplot(data = hotel_data, aes(x = hotel)) +
  geom_bar(stat = "count",fill='lightblue') +
  labs(title = "Booking by Hotel type",
       x = "Hotel type",
       y = "No. of bookings")

Error in ggplot(data = hotel_data, aes(x = hotel)): object 'hotel_data' not found

#plot distribution of deposit type
hotel_data %>% select(hotel, deposit_type) %>% 
                group_by( hotel, deposit_type) %>% 
                summarise(n = n()) %>% 
                ggplot(aes(reorder(deposit_type, n), n, fill = hotel)) + 
                geom_bar(stat = "identity", position = position_dodge()) + xlab("") + 
                ylab("# of bookings") + labs(fill = "Hotel type")

Error in select(., hotel, deposit_type): object 'hotel_data' not found

#plot distribution of deposit type
hotel_data %>% select(hotel, customer_type) %>% 
                group_by( hotel, customer_type) %>% 
                summarise(n = n()) %>% 
                ggplot(aes(reorder(customer_type, n), n, fill = hotel)) + 
                geom_bar(stat = "identity", position = position_dodge()) + xlab("") + 
                ylab("# of bookings") + labs(fill = "Hotel type")

Error in select(., hotel, customer_type): object 'hotel_data' not found

We can summarize the following from the plots above: 1. The number of bookings for Resorts are half the number of bookings for the City Hotels. 2. Majority of the City Hotels have No Deposit for the booking, while there are 0 City Hotels asking for Refundable Deposits. 3. Majority of customers booking the City Hotels are Transient, followed by Transient-Party.

Now, we will look at the distribution of hotels per country

hotel_data%>%
  group_by(country)%>%
  summarise(num=n())%>%
  arrange(desc(num))

Error in group_by(., country): object 'hotel_data' not found

As, “is_canceled” is our target variable, let us see the cancelations per country.

hotel_data %>% 
  select(country, is_canceled) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(is_canceled)) %>% 
  head(n=10)

Error in select(., country, is_canceled): object 'hotel_data' not found

The maximum bookings are made in : PRT,GBR,FRA,ESP,DEU The maximum cancelations are made in : PRT,GBR,ESP,FRA,ITA

As seen, FRA has higher bookings than ESP, and the cancelations are less as compared the ESP. Thus, the hit rate i.e. the probability for booking to happen is high in FRA.

hotel_data %>% 
  select(country, stays_in_week_nights) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, mean, na.rm = TRUE) %>% 
  arrange(desc(stays_in_week_nights)) %>% 
  head(n=10)

Error in select(., country, stays_in_week_nights): object 'hotel_data' not found

hotel_data %>% 
  select(country, stays_in_weekend_nights) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, mean, na.rm = TRUE) %>% 
  arrange(desc(stays_in_weekend_nights)) %>% 
  head(n=10)

Error in select(., country, stays_in_weekend_nights): object 'hotel_data' not found

Also, we can see the stay in week nights and weekend nights follow almost similar pattern in terms of the countries, with FRO and SEN being the top 2.

Research Question: 1. Does the cancellation of booking depend upon the dates of stays? 2. Do the number of adults and number of children affect the stay length? 3. Does the deposit type have any impact on the cancellation rate? 4. Does parking space affect the number of bookings made in the hotel? 5. How does the market segment affect the bookings?