Challenge 10

challenge_10

purrr

Author

Henry Mitrano

Published

January 31, 2023

library(tidyverse)
library(ggplot2)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

The purrr package is a powerful tool for functional programming. It allows the user to apply a single function across multiple objects. It can replace for loops with a more readable (and often faster) simple function call.

For example, we can draw n random samples from 10 different distributions using a vector of 10 means.

n <- 100 # sample size
m <- seq(1,10) # means 
samps <- map(m,rnorm,n=n)

We can then use map_dbl to verify that this worked correctly by computing the mean for each sample.

samps %>%
  map_dbl(mean)

 [1]  0.8445498  2.0705627  2.7996149  4.0622924  4.9630348  6.0612032
 [7]  7.0852765  7.9385544  9.0805295 10.0614436

purrr is tricky to learn (but beyond useful once you get a handle on it). Therefore, it’s imperative that you complete the purr and map readings before attempting this challenge.

The challenge

Use purrr with a function to perform some data science task. What this task is is up to you. It could involve computing summary statistics, reading in multiple datasets, running a random process multiple times, or anything else you might need to do in your work as a data analyst. You might consider using purrr with a function you wrote for challenge 9.

library(purrr)


  get_summary_statistics <- function(x) {
    mean <- mean(x)
    median <- median(x)
    std_dev <- sd(x)
    min_val <- min(x)
    max_val <- max(x)
    result <- c(mean, median, std_dev, min_val, max_val)
    names(result) <- c("Mean", "Median", "Standard Deviation", "Minimum", "Maximum")
  return(result)
}
test_scores <- c(90, 85, 77, 69, 41, 71, 80, 65, 83, 29, 53, 95, 94, 90, 87, 71, 70, 65, 58, 59, 98, 90, 87, 76)
get_summary_statistics(test_scores)

              Mean             Median Standard Deviation            Minimum 
          74.29167           76.50000           17.48162           29.00000 
           Maximum 
          98.00000

remove_outliers <- function(x) {
  result <- keep(x, x >= 50)
  return(result)
}
updated_scores <- remove_outliers(test_scores) 

get_summary_statistics(updated_scores)

              Mean             Median Standard Deviation            Minimum 
          77.86364           78.50000           13.07231           53.00000 
           Maximum 
          98.00000

Above, we use purr and one of its key functions, “keep” to mainpulate a similar data object as we did in challenge 9 where we had to create a function. We are using that same function as we did in 9 to calculate the summary statistics of a group of test scores, but this time we are manipulating the test scores vector before we run the summary statistics function on it. We want to disregard all outliers in our data set, so we can assess the test on its meri without taking into consideration the really low scores some students who didn’t study got. We want to see if our test was fair to those that WERE adequately prepared, so we us the purrr keep function that’s used to manipulate lists to remove all test score values that were substantially worse than the mean- any test score below 50.

After, we run our summary statistics function on our updated data, and we see a more fair test- the mean and median scores went up and the standard dev went down!

--- title: "Challenge 10" author: "Henry Mitrano" description: "purrr" date: "1/31/2023" format: html: toc: true code-copy: true code-tools: true categories: - challenge_10 --- ```{r} #| label: setup #| warning: false #| message: false library(tidyverse) library(ggplot2) knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) ``` ## Challenge Overview The [purrr](https://purrr.tidyverse.org/) package is a powerful tool for functional programming. It allows the user to apply a single function across multiple objects. It can replace for loops with a more readable (and often faster) simple function call. For example, we can draw `n` random samples from 10 different distributions using a vector of 10 means. ```{r} n <- 100 # sample size m <- seq(1,10) # means samps <- map(m,rnorm,n=n) ``` We can then use `map_dbl` to verify that this worked correctly by computing the mean for each sample. ```{r} samps %>% map_dbl(mean) ``` `purrr` is tricky to learn (but beyond useful once you get a handle on it). Therefore, it's imperative that you complete the `purr` and `map` readings before attempting this challenge. ## The challenge Use `purrr` with a function to perform *some* data science task. What this task is is up to you. It could involve computing summary statistics, reading in multiple datasets, running a random process multiple times, or anything else you might need to do in your work as a data analyst. You might consider using `purrr` with a function you wrote for challenge 9. ```{r} library(purrr) get_summary_statistics <- function(x) { mean <- mean(x) median <- median(x) std_dev <- sd(x) min_val <- min(x) max_val <- max(x) result <- c(mean, median, std_dev, min_val, max_val) names(result) <- c("Mean", "Median", "Standard Deviation", "Minimum", "Maximum") return(result) } test_scores <- c(90, 85, 77, 69, 41, 71, 80, 65, 83, 29, 53, 95, 94, 90, 87, 71, 70, 65, 58, 59, 98, 90, 87, 76) get_summary_statistics(test_scores) remove_outliers <- function(x) { result <- keep(x, x >= 50) return(result) } updated_scores <- remove_outliers(test_scores) get_summary_statistics(updated_scores) ``` Above, we use purr and one of its key functions, "keep" to mainpulate a similar data object as we did in challenge 9 where we had to create a function. We are using that same function as we did in 9 to calculate the summary statistics of a group of test scores, but this time we are manipulating the test scores vector before we run the summary statistics function on it. We want to disregard all outliers in our data set, so we can assess the test on its meri without taking into consideration the really low scores some students who didn't study got. We want to see if our test was fair to those that WERE adequately prepared, so we us the purrr keep function that's used to manipulate lists to remove all test score values that were substantially worse than the mean- any test score below 50. After, we run our summary statistics function on our updated data, and we see a more fair test- the mean and median scores went up and the standard dev went down!