purrr
Author

Audrey Bertin

Published

July 1, 2023

For this challenge, I’ll be using the function I wrote in challenge 9 that calculates z scores and apply it multiple times.

A common use of z-scores is in anomaly detection. In this practice, we compare the most recent value in a sequence to all the values that came before to see if that value is an anomaly or not.

We can use a built in dataset for this, called airquality, which stores time series air quality information:

data(airquality)
head(airquality)
ABCDEFGHIJ0123456789
 
 
Ozone
<int>
Solar.R
<int>
Wind
<dbl>
Temp
<int>
Month
<int>
Day
<int>
1411907.46751
2361188.07252
31214912.67453
41831311.56254
5NANA14.35655
628NA14.96656

Our original function looks as follows:

z_score <- function(baseline, value){
  mean <- mean(baseline)
  sd <- sd(baseline)
  z_score <- abs((value - mean)/sd)
  
  results = tibble(mean = mean, sd = sd, input_value = value, z_score = z_score)
  return(results)
}

We can rewrite this so that it determines the baseline and value itself, and instead takes a vector as input:

z_score <- function(vec){
  baseline = vec %>% head(-1)
  value = vec %>% tail(1)
  
  mean <- mean(baseline, na.rm=TRUE)
  sd <- sd(baseline, na.rm=TRUE)
  z_score <- abs((value - mean)/sd)
  
  results = tibble(baseline_mean = mean, baseline_sd = sd, most_recent_value = value, z_score = z_score)
  return(results)
}

Running this on a single column we get:

z_score(airquality$Temp)
ABCDEFGHIJ0123456789
baseline_mean
<dbl>
baseline_sd
<dbl>
most_recent_value
<int>
z_score
<dbl>
77.947379.462221681.051272

We can use purrr::map to compute this for multiple columns and join them into a single dataframe:

cols = list(airquality$Ozone, airquality$Wind, airquality$Temp)

map_dfr(cols, z_score)
ABCDEFGHIJ0123456789
baseline_mean
<dbl>
baseline_sd
<dbl>
most_recent_value
<dbl>
z_score
<dbl>
42.32173933.06679820.00.6750499
9.9473683.53240311.50.4395397
77.9473689.46222168.01.0512720