library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 9 solution
Challenge Overview
Today’s challenge is simple. Create a function, and use it to perform a data analysis / cleaning / visualization task:
Examples of such functions are: 1) A function that reads in and cleans a dataset.
2) A function that computes summary statistics (e.g., computes the z score for a variable).
3) A function that plots a histogram.
That’s it!
# Specify the file path
<- "_data/eggs_tidy.csv"
file_path
# 1. Read and clean a dataset
<- function(file_path) {
read_and_clean # Read the data
<- read.csv(file_path, header = FALSE)
data
# Set column names
colnames(data) <- c('Month', 'Year', 'Val1', 'Val2', 'Val3', 'Val4')
# Convert numeric columns to numeric data type
$Val1 <- as.numeric(as.character(data$Val1))
data$Val2 <- as.numeric(as.character(data$Val2))
data$Val3 <- as.numeric(as.character(data$Val3))
data$Val4 <- as.numeric(as.character(data$Val4))
data
# Handle potential non-numeric values
$Val1[is.na(data$Val1)] <- mean(data$Val1, na.rm = TRUE)
data$Val2[is.na(data$Val2)] <- mean(data$Val2, na.rm = TRUE)
data$Val3[is.na(data$Val3)] <- mean(data$Val3, na.rm = TRUE)
data$Val4[is.na(data$Val4)] <- mean(data$Val4, na.rm = TRUE)
data
# Return the clean data
return(data)
}
# Read and clean the data
<- read_and_clean(file_path)
clean_data
# 2. Compute summary statistics (e.g., computes the z score for a variable)
<- function(data, column_name) {
compute_z_score <- data[[column_name]]
column <- (column - mean(column)) / sd(column)
z_scores return(z_scores)
}
# Compute z-scores for 'Val1'
<- compute_z_score(clean_data, 'Val1')
val1_z_scores
# 3. Plot a histogram
<- function(data, column_name) {
plot_histogram hist(data[[column_name]], main = paste("Histogram of", column_name),
xlab = column_name)
}
# Plot histogram for 'Val1'
plot_histogram(clean_data, 'Val1')
Let’s break this down. The code does three main things to analyze a dataset about egg prices.
First up, read_and_clean function. Think of it like a bouncer at a club. It reads the guest list (our data), then ensures everyone’s age (data type) is correct and that no one (no data) is missing.
Next, the compute_z_score function. This is like your school teacher who marks your test and tells you how far off you were from the class average. But here, we’re looking at things like egg prices instead of test scores.
Last, but not least, the plot_histogram function. This is like creating a bar chart of all the students’ heights in your class. But instead of student heights, we’re looking at whatever variable we’re interested in - could be egg prices, could be something else.