Challenge 10 Instructions

challenge_10

purrr

Author

Xinpeng Liu

Published

July 6, 2023

Challenge Overview

The purrr package is a powerful tool for functional programming. It allows the user to apply a single function across multiple objects. It can replace for loops with a more readable (and often faster) simple function call.

For example, we can draw n random samples from 10 different distributions using a vector of 10 means.

n <- 100 # sample size
m <- seq(1,10) # means 
samps <- map(m,rnorm,n=n)

We can then use map_dbl to verify that this worked correctly by computing the mean for each sample.

samps %>%
  map_dbl(mean)

 [1] 1.114790 2.016048 3.056087 3.906096 5.023842 5.968324 6.968416 8.094015
 [9] 8.871233 9.912330

purrr is tricky to learn (but beyond useful once you get a handle on it). Therefore, it’s imperative that you complete the purr and map readings before attempting this challenge.

The challenge

Use purrr with a function to perform some data science task. What this task is is up to you. It could involve computing summary statistics, reading in multiple datasets, running a random process multiple times, or anything else you might need to do in your work as a data analyst. You might consider using purrr with a function you wrote for challenge 9.

we choose - egg_tidy.csv ⭐⭐

calculate_mean <- function(column) {
    return(mean(column, na.rm = TRUE))
}
calculate_means <- function(data) {
    numeric_cols <- select_if(data, is.numeric)
    means <- map_dbl(numeric_cols, calculate_mean)
    return(means)
}
data <- read.csv("_data/eggs_tidy.csv")
means <- calculate_means(data)
print(means)

                  year       large_half_dozen            large_dozen 
             2008.5000               155.1656               254.1979 
extra_large_half_dozen      extra_large_dozen 
              164.2235               266.7958

Analysis

1. The calculate_mean function: This function simply takes a column of numbers (like a list of your grades) and finds the average (mean) value. It ignores any missing values, because we tell it to do so with na.rm = TRUE. Imagine you have 5 test scores: 80, 90, 85, NA, 95. The NA means you didn’t take one test. The average would then be (80+90+85+95)/4 = 87.5.
1. The calculate_means function: This function uses our earlier calculate_mean function, but it applies it to many columns at once. First, it identifies which columns are numeric (like finding out which of your subjects are graded). Then, it calculates the average grade for each subject.

Now, let’s understand our data:

1. Year: The mean here (2008.5) isn’t that meaningful because ‘year’ is not a typical numerical variable. It’s more like a category. But technically, this means the middle year in our data is 2008.
1. Large_half_dozen, Large_dozen, Extra_large_half_dozen, Extra_large_dozen: These are the average prices of different egg packages over all the years and months in the data. For instance, the average price of a half-dozen large eggs was around 155.16 units (maybe cents or dollars, depending on the data).

That’s it! We’ve used our functions to get a quick sense of the average prices in our egg data.

--- title: "Challenge 10 Instructions" author: "Xinpeng Liu" description: "purrr" date: "7/6/2023" format: html: toc: true code-copy: true code-tools: true categories: - challenge_10 --- ```{r} #| label: setup #| warning: false #| message: false #| include: false library(tidyverse) library(ggplot2) library(purrr) library(dplyr) knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) ``` ## Challenge Overview The [purrr](https://purrr.tidyverse.org/) package is a powerful tool for functional programming. It allows the user to apply a single function across multiple objects. It can replace for loops with a more readable (and often faster) simple function call. For example, we can draw `n` random samples from 10 different distributions using a vector of 10 means. ```{r} n <- 100 # sample size m <- seq(1,10) # means samps <- map(m,rnorm,n=n) ``` We can then use `map_dbl` to verify that this worked correctly by computing the mean for each sample. ```{r} samps %>% map_dbl(mean) ``` `purrr` is tricky to learn (but beyond useful once you get a handle on it). Therefore, it's imperative that you complete the `purr` and `map` readings before attempting this challenge. ## The challenge Use `purrr` with a function to perform *some* data science task. What this task is is up to you. It could involve computing summary statistics, reading in multiple datasets, running a random process multiple times, or anything else you might need to do in your work as a data analyst. You might consider using `purrr` with a function you wrote for challenge 9. we choose - egg_tidy.csv ⭐⭐ ```{r} calculate_mean <- function(column) { return(mean(column, na.rm = TRUE)) } calculate_means <- function(data) { numeric_cols <- select_if(data, is.numeric) means <- map_dbl(numeric_cols, calculate_mean) return(means) } data <- read.csv("_data/eggs_tidy.csv") means <- calculate_means(data) print(means) ``` ## Analysis - 1. The calculate_mean function: This function simply takes a column of numbers (like a list of your grades) and finds the average (mean) value. It ignores any missing values, because we tell it to do so with na.rm = TRUE. Imagine you have 5 test scores: 80, 90, 85, NA, 95. The NA means you didn't take one test. The average would then be (80+90+85+95)/4 = 87.5. - 2. The calculate_means function: This function uses our earlier calculate_mean function, but it applies it to many columns at once. First, it identifies which columns are numeric (like finding out which of your subjects are graded). Then, it calculates the average grade for each subject. Now, let's understand our data: - 1. `Year`: The mean here (2008.5) isn't that meaningful because 'year' is not a typical numerical variable. It's more like a category. But technically, this means the middle year in our data is 2008. - 2. `Large_half_dozen`, `Large_dozen`, `Extra_large_half_dozen`, `Extra_large_dozen`: These are the average prices of different egg packages over all the years and months in the data. For instance, the average price of a half-dozen large eggs was around 155.16 units (maybe cents or dollars, depending on the data). That's it! We've used our functions to get a quick sense of the average prices in our egg data.