Challenge 9 Solution - Functions

challenge_9
Linus Jen
wildbirds
Creating a function
Author

Linus Jen

Published

July 4, 2023

library(tidyverse)
library(ggplot2)
library(here)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

For this challenge, I will create a function that quickly summarizes a single numerical column via a 5 (or 6) number summary. I will also allow the function to have an optional group_by argument in case the user wanted to get summaries based on a specific category.

numeric_summary <- function(dataframe, num_var, group_by_var) {
  # Given the column of a dataframe, find the min, Q1, median, mean, Q3, max, and st. dev. for that column
  # Optional: allow for group_bys
  dataframe %>%
    group_by(pick({{ group_by_var }})) %>%
    summarise("min_{{ num_var }}" := min({{ num_var }}, na.rm=TRUE),
              "q1_{{ num_var }}" := quantile({{ num_var }}, 0.25, 1, na.rm=TRUE),
              "mean_{{ num_var }}" := round(mean({{ num_var }}, na.rm=TRUE), 4),
              "med_{{ num_var }}" := median({{ num_var }}, na.rm=TRUE),
              "q3_{{ num_var }}" := quantile({{ num_var }}, 0.75, 1, na.rm=TRUE),
              "max_{{ num_var }}" := max({{ num_var }}, na.rm=TRUE),
              "sd_{{ num_var}}" := sd({{ num_var }}, na.rm=TRUE))
}

For my function numeric_summary, it accepts 3 arguments: the dataframe, a numeric variable to summarize over, and the group by variable if the user wants to aggregate variables by a certain category. This function gives the minimum, 1st quartile, mean, median, 3rd quartile, maximum, and standard deviation of a numeric variable. I chose to make this function because while working on my project, I found it really annoying to copy/paste / reuse the same code over and over again, and figured it would be interesting to quickly create a function for this.

Dataset

To test this function, I used the birds.csv file, which is a collection of information about specific birds and their populations around the world at various points in time.

# Pull in data
birds <- read_csv(here("posts", "_data", "birds.csv"))

# View the data
birds
# A tibble: 30,977 × 14
   `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
   <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
 1 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 2 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 3 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 4 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 5 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 6 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 7 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 8 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 9 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
10 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
# ℹ 30,967 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>

First, let’s see how our function performs with only a numeric variable.

# Test this out without a group_by
birds %>% numeric_summary(num_var = Value)
# A tibble: 1 × 7
  min_Value q1_Value mean_Value med_Value q3_Value max_Value sd_Value
      <dbl>    <dbl>      <dbl>     <dbl>    <dbl>     <dbl>    <dbl>
1         0      171     99411.      1800    15404  23707134  720611.

The function works without any concerns! This now provides us a quick glance at the distribution of the Value column in this dataset.

Next, let’s see how this function performs with given additional variables to group by.

# Test this with a group_by
birds %>% numeric_summary(num_var = Value, group_by_var = Area)
# A tibble: 248 × 8
   Area      min_Value q1_Value mean_Value med_Value q3_Value max_Value sd_Value
   <chr>         <dbl>    <dbl>      <dbl>     <dbl>    <dbl>     <dbl>    <dbl>
 1 Afghanis…      4700   6222.      8099.     6700      10565     14414   2.82e3
 2 Africa         1213   5926.    196561.    12910.     25395   1897326   4.36e5
 3 Albania         200    520       2278.     1300       3752      9494   2.27e3
 4 Algeria          10     24      17621.       42.5     2072    136078   3.88e4
 5 American…        24     35         41.4      38         40        80   1.40e1
 6 Americas        553   7497     856356.    66924.    552649   5796289   1.54e6
 7 Angola         3400   4925       9453.     6075       6930     36500   8.93e3
 8 Antigua …        43     60.2       93.6      85        130       160   3.97e1
 9 Argentina        75    530      18844.     2355      10350    118300   3.36e4
10 Armenia         120    209.      2062.     1528.      3865      8934   2.04e3
# ℹ 238 more rows
# Add in another variable
birds %>% numeric_summary(num_var = Value, group_by_var = c(Area, Item))
# A tibble: 601 × 9
# Groups:   Area [248]
   Area        Item   min_Value q1_Value mean_Value med_Value q3_Value max_Value
   <chr>       <chr>      <dbl>    <dbl>      <dbl>     <dbl>    <dbl>     <dbl>
 1 Afghanistan Chick…      4700    6222.      8099.     6700    10565      14414
 2 Africa      Chick…    274201  470654.    936779.   865156. 1331023.   1897326
 3 Africa      Ducks       6231    8126      13639.    12557    17740.     25428
 4 Africa      Geese…      3882    4777.     12164.     8192.   18207.     29158
 5 Africa      Pigeo…      2168    4650      11222.     9946.   13308.     36963
 6 Africa      Turke…      1213    2156.      9004.     5496    11993      27341
 7 Albania     Chick…      1580    2412.      4055.     3820.    4939       9494
 8 Albania     Ducks        290     352.       558.      410.     732       1100
 9 Albania     Geese…       200     241        396.      278.     463.       800
10 Albania     Turke…       403     570        750.      674      879       1300
# ℹ 591 more rows
# ℹ 1 more variable: sd_Value <dbl>

The tables above show how flexible functions can be. Given any number of variables to group over, the function would know what to group by, and produce the same numeric summaries over each aggregate.