library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 9
Challenge Overview
Today’s challenge is simple. Create a function, and use it to perform a data analysis / cleaning / visualization task:
Examples of such functions are: 1) A function that reads in and cleans a dataset.
2) A function that computes summary statistics (e.g., computes the z score for a variable).
3) A function that plots a histogram.
That’s it!
clean_data
<- function(file_path, drop_columns = NULL) {
clean_data # Read the CSV file
<- read_csv(file_path)
data
# Drop rows with NA values in the 'children' column
$children[is.na(data$children)] <- 0
data
# Combine year, month, and day of month into a single date column and then remove the original columns related to date
<- data %>%
data mutate(arrival_date = as.Date(paste(arrival_date_year, arrival_date_month, arrival_date_day_of_month, sep = "-"), format = "%Y-%B-%d")) %>%
select(-arrival_date_year, -arrival_date_month, -arrival_date_week_number, -arrival_date_day_of_month)
# Convert the variables 'is_canceled' and 'is_repeated_guest' to logical (boolean) values
<- data %>%
data mutate(is_canceled = as.logical(is_canceled), is_repeated_guest = as.logical(is_repeated_guest))
# Convert categorical variables into factors
<- c('hotel', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type',
cat_cols 'assigned_room_type', 'deposit_type', 'customer_type', 'reservation_status')
<- lapply(data[cat_cols], factor)
data[cat_cols]
# Drop specified columns if provided
if (!is.null(drop_columns)) {
<- data %>% select(-all_of(drop_columns))
data
}
return(data)
}
# Use the function to read and clean the data
<- clean_data("_data/hotel_bookings.csv", drop_columns = "reservation_status_date")
hotel_bookings head(hotel_bookings)
This script defines a function clean_data
for loading and preprocessing a dataset from a CSV file. The function takes as input the file path and a vector of column names to drop. It then reads the file, fills missing values in the ‘children’ column, combines date-related columns into one, converts certain columns to logical type, and transforms categorical variables into factors. If specified, it also drops certain columns. The cleaned data is then returned. The script applies this function to the ‘hotel_bookings.csv’ file and displays the first few rows of the cleaned data.
calculate_z_score
<- function(data, column_name) {
calculate_z_score # Check if the column is numeric
if (!is.numeric(data[[column_name]])) {
message(paste0("The column '", column_name, "' is not numeric. Please choose a numeric column."))
return(data) # return the original data frame without changes
}
# Calculate the mean of the column
<- mean(data[[column_name]], na.rm = TRUE)
mean_value
# Calculate the standard deviation of the column
<- sd(data[[column_name]], na.rm = TRUE)
sd_value
# Calculate the z scores
<- (data[[column_name]] - mean_value) / sd_value
z_scores
# Create a new data frame with the z scores
<- data %>%
data_z_scores select(!!column_name) %>% # select the column for which z score is calculated
mutate(!!paste0(column_name, "_z_score") := z_scores) # add the z scores as a new column
return(data_z_scores)
}<- calculate_z_score(hotel_bookings, "lead_time")
lead_time_z_scores lead_time_z_scores
<- calculate_z_score(hotel_bookings, "hotel") hotel_bookings
This script defines a function named calculate_z_score
that calculates the z-score of a numeric column in a data frame. The function takes as input a data frame and a column name. It first checks if the column is numeric. If not, it prints a message and returns the original data frame without changes. If the column is numeric, it calculates the mean and standard deviation of the column, and then calculates the z-scores. It then creates a new data frame containing the original column and a new column with the z-scores.
The script then applies this function to the ‘lead_time’ column of the ‘hotel_bookings’ data frame and assigns the resulting data frame to ‘lead_time_z_scores’.
Lastly, the script tries to apply the calculate_z_score
function to the ‘hotel’ column of the ‘hotel_bookings’ data frame. However, as ‘hotel’ is not a numeric column, the function will print a message stating “The column ‘hotel’ is not numeric. Please choose a numeric column.” and will return the original ‘hotel_bookings’ data frame without changes.
plot_histogram
<- function(data, column_name) {
plot_histogram # Check if the column is numeric
if (!is.numeric(data[[column_name]])) {
message(paste0("The column '", column_name, "' is not numeric. Please choose a numeric column."))
return(NULL) # return NULL because a histogram can't be plotted for non-numeric data
}
# Create a histogram of the column
<- ggplot(data, aes_string(column_name)) +
plot geom_histogram(binwidth = 30, fill = "#69b3a2", color = "#e9ecef", alpha = 0.9) + # change binwidth as needed
theme_minimal() +
labs(title = paste0("Histogram of ", column_name),
x = column_name,
y = "Count")
return(plot)
}plot_histogram(hotel_bookings, "lead_time")
This code defines plot_histogram
, a function that generates a histogram for a given numeric column from a data frame. If the column isn’t numeric, a message is returned indicating the need for numeric data. The histogram visualizes data distribution and uses ggplot2’s geom_histogram
function with specified aesthetics. The final plot is returned by the function. In the end, the function is used to create a histogram for the ‘lead_time’ column of the ‘hotel_bookings’ data frame.