hw2
hotel_bookings.csv
Author

Siddharth Goel

Published

January 16, 2023

Importing the libraries

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(ggplot2)
library(dplyr)

Reading the dataset

Code
data = read_csv("_data/hotel_bookings.csv")
Rows: 119390 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
date  (1): reservation_status_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Taking an initial look into the data

Code
# looking at the schema
spec(data)
cols(
  hotel = col_character(),
  is_canceled = col_double(),
  lead_time = col_double(),
  arrival_date_year = col_double(),
  arrival_date_month = col_character(),
  arrival_date_week_number = col_double(),
  arrival_date_day_of_month = col_double(),
  stays_in_weekend_nights = col_double(),
  stays_in_week_nights = col_double(),
  adults = col_double(),
  children = col_double(),
  babies = col_double(),
  meal = col_character(),
  country = col_character(),
  market_segment = col_character(),
  distribution_channel = col_character(),
  is_repeated_guest = col_double(),
  previous_cancellations = col_double(),
  previous_bookings_not_canceled = col_double(),
  reserved_room_type = col_character(),
  assigned_room_type = col_character(),
  booking_changes = col_double(),
  deposit_type = col_character(),
  agent = col_character(),
  company = col_character(),
  days_in_waiting_list = col_double(),
  customer_type = col_character(),
  adr = col_double(),
  required_car_parking_spaces = col_double(),
  total_of_special_requests = col_double(),
  reservation_status = col_character(),
  reservation_status_date = col_date(format = "")
)
Code
# looking at the data values
head(data)

We got the descriptors of each column by using the spec method. The complete schema of the data was shown in this command. The noticeable points were the varying data types of the columns.
Columns like hotel and meal have the data type col_character which indicates a text data type. lead_time is of the type col_double which is used for numerical values and the column reservation_status_date is of the type col_date which is used for date type values.
We can also see that the dataset has 119390 total rows and 32 columns.

Cleaning the data

By looking at the top values in the dataset, character type columns with values such as NULL can be noticed. Also, 0 valued numerical columns can be seen in the dataset as well. It will be great to find out the percentage of these values in the columns and remove these columns if the percentages are high.

Code
# get zero percentage in data
zero_percent <- (colSums(data == 0) / nrow(data)) * 100
# get null percent in data
null_percent <- sapply(data, function(x) sum(str_detect(x, "NULL")) / length(x))

aggregated_df <- data.frame(null_percent = null_percent, zero_percent = zero_percent)

arrange(aggregated_df, desc(null_percent), desc(zero_percent))

As we can see from the stats above, it is safe to remove columns company and babies due to the high percentage of insignificant "NULL" and 0 values.

Code
filtered_data = select(data, -company, -babies)
spec(filtered_data)
NULL

As we can see now, the 2 columns have been removed from the data and the data is ready for interpretations . ## Research opportunities within the data

Following are the columns and the corresponding research questions in the data:

  • children and adults: These columns will help determine the ratio between adults and children. If the ratio points out that more children stay in the hotel, then those hotels can be made more kids friendly to increase ratings
  • stays_in_week_nights and stays_in_weekend_nights: These columns can be used to calculate the footfall of customers during weekdays and weekends and together with the customer_type column, we can find out the demographic which tye of customers come in during what time of the week. This can be used to determine pricing
  • meal: Meal column can be used to anticipate the type of meals the guests prefer
  • is_repeated_guest: This column can be used to provide special offers to repeat guests
  • previous_cancellations and previous_bookings_not_canceled: These columns can be used to anticipate which type of guests are more likely to cancel their booking_changes
  • agent: This column can be used to calculate agent performance for bonus and commission related purposes