hw2
hotel_bookings.csv
Author

Tanmay Agrawal

Published

January 6, 2023

We first load the neccessary libraries

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(ggplot2)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE)

Now we load the data. I sorted the datasets by size and picked the largest one which is “hotel_bookings.csv”. I will use the read_csv() function as it’s the recommend way to load files.

Code
data = read_csv("_data/hotel_bookings.csv")
Rows: 119390 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
date  (1): reservation_status_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# We spec the columns in this dataset using the spec command
spec(data)
cols(
  hotel = col_character(),
  is_canceled = col_double(),
  lead_time = col_double(),
  arrival_date_year = col_double(),
  arrival_date_month = col_character(),
  arrival_date_week_number = col_double(),
  arrival_date_day_of_month = col_double(),
  stays_in_weekend_nights = col_double(),
  stays_in_week_nights = col_double(),
  adults = col_double(),
  children = col_double(),
  babies = col_double(),
  meal = col_character(),
  country = col_character(),
  market_segment = col_character(),
  distribution_channel = col_character(),
  is_repeated_guest = col_double(),
  previous_cancellations = col_double(),
  previous_bookings_not_canceled = col_double(),
  reserved_room_type = col_character(),
  assigned_room_type = col_character(),
  booking_changes = col_double(),
  deposit_type = col_character(),
  agent = col_character(),
  company = col_character(),
  days_in_waiting_list = col_double(),
  customer_type = col_character(),
  adr = col_double(),
  required_car_parking_spaces = col_double(),
  total_of_special_requests = col_double(),
  reservation_status = col_character(),
  reservation_status_date = col_date(format = "")
)

Using the spec command we get the descriptions of the data types of each column.

Here all the column names are followed by their data type. We see col_double which is a double precision number typically used for continuous values but can be used for rankings or categories described using numbers like the years in a date range or month in a year. There is also a col_character type which is a character data type used for non-numerical values (generally categorical) or col_date which is a reserved data type for dates.

We look at the first few rows of the data to get a visual outline of the dataset.

Code
head(data)

Cleaning the data

We observe that some columns like agent and company have “NULL” values, the challenge here is that these values are still character data types and have to be detected using different methods. We can simply look at each column and calculate the percentage of “NULL” values it has and get rid of the ones with more that 10% of these values.

Code
# we use the stringr package for this
library(stringr)

# get the null percentages
null_percentages <- sapply(data, function(x) sum(str_detect(x, "NULL"))/length(x))

# put it into a dataframe so that it's visually appealing

null_percentages_df <- data.frame(column = names(data), null_percentage = null_percentages)

null_percentages_df %>%
  arrange(desc(null_percentage))

We can get rid of the columns company and agent since they have a high percentage of “NULL” values.

Code
tidy_data = select(data, -company, -agent)
head(tidy_data)

Our data is cleaner now.

Potential research questions

We can see that it describes hotel bookings across different countries and records some interesting columns like lead_time, market_segment, distribution_channel, is_repeated_guest, previous_cancellations, booking_changes which could be used as predictive signals to determine the potential value of a customer. For instance, if a customer is making too many changes, have a history of cancelling, and from a particular distribution channel that might lead us to think that they are likely to make a last minute cancellation. This could be used to make predictions about availability and dynamic pricing of rooms – akin to what the airline industry does.