library(tidyverse)
library(ggplot2)
library(here)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 9 Solution - Functions
Challenge Overview
For this challenge, I will create a function that quickly summarizes a single numerical column via a 5 (or 6) number summary. I will also allow the function to have an optional group_by
argument in case the user wanted to get summaries based on a specific category.
<- function(dataframe, num_var, group_by_var) {
numeric_summary # Given the column of a dataframe, find the min, Q1, median, mean, Q3, max, and st. dev. for that column
# Optional: allow for group_bys
%>%
dataframe group_by(pick({{ group_by_var }})) %>%
summarise("min_{{ num_var }}" := min({{ num_var }}, na.rm=TRUE),
"q1_{{ num_var }}" := quantile({{ num_var }}, 0.25, 1, na.rm=TRUE),
"mean_{{ num_var }}" := round(mean({{ num_var }}, na.rm=TRUE), 4),
"med_{{ num_var }}" := median({{ num_var }}, na.rm=TRUE),
"q3_{{ num_var }}" := quantile({{ num_var }}, 0.75, 1, na.rm=TRUE),
"max_{{ num_var }}" := max({{ num_var }}, na.rm=TRUE),
"sd_{{ num_var}}" := sd({{ num_var }}, na.rm=TRUE))
}
For my function numeric_summary
, it accepts 3 arguments: the dataframe, a numeric variable to summarize over, and the group by variable if the user wants to aggregate variables by a certain category. This function gives the minimum, 1st quartile, mean, median, 3rd quartile, maximum, and standard deviation of a numeric variable. I chose to make this function because while working on my project, I found it really annoying to copy/paste / reuse the same code over and over again, and figured it would be interesting to quickly create a function for this.
Dataset
To test this function, I used the birds.csv
file, which is a collection of information about specific birds and their populations around the world at various points in time.
# Pull in data
<- read_csv(here("posts", "_data", "birds.csv"))
birds
# View the data
birds
# A tibble: 30,977 × 14
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anim… 2 Afgh… 5112 Stocks 1057
2 QA Live Anim… 2 Afgh… 5112 Stocks 1057
3 QA Live Anim… 2 Afgh… 5112 Stocks 1057
4 QA Live Anim… 2 Afgh… 5112 Stocks 1057
5 QA Live Anim… 2 Afgh… 5112 Stocks 1057
6 QA Live Anim… 2 Afgh… 5112 Stocks 1057
7 QA Live Anim… 2 Afgh… 5112 Stocks 1057
8 QA Live Anim… 2 Afgh… 5112 Stocks 1057
9 QA Live Anim… 2 Afgh… 5112 Stocks 1057
10 QA Live Anim… 2 Afgh… 5112 Stocks 1057
# ℹ 30,967 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
First, let’s see how our function performs with only a numeric variable.
# Test this out without a group_by
%>% numeric_summary(num_var = Value) birds
# A tibble: 1 × 7
min_Value q1_Value mean_Value med_Value q3_Value max_Value sd_Value
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 171 99411. 1800 15404 23707134 720611.
The function works without any concerns! This now provides us a quick glance at the distribution of the Value
column in this dataset.
Next, let’s see how this function performs with given additional variables to group by.
# Test this with a group_by
%>% numeric_summary(num_var = Value, group_by_var = Area) birds
# A tibble: 248 × 8
Area min_Value q1_Value mean_Value med_Value q3_Value max_Value sd_Value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanis… 4700 6222. 8099. 6700 10565 14414 2.82e3
2 Africa 1213 5926. 196561. 12910. 25395 1897326 4.36e5
3 Albania 200 520 2278. 1300 3752 9494 2.27e3
4 Algeria 10 24 17621. 42.5 2072 136078 3.88e4
5 American… 24 35 41.4 38 40 80 1.40e1
6 Americas 553 7497 856356. 66924. 552649 5796289 1.54e6
7 Angola 3400 4925 9453. 6075 6930 36500 8.93e3
8 Antigua … 43 60.2 93.6 85 130 160 3.97e1
9 Argentina 75 530 18844. 2355 10350 118300 3.36e4
10 Armenia 120 209. 2062. 1528. 3865 8934 2.04e3
# ℹ 238 more rows
# Add in another variable
%>% numeric_summary(num_var = Value, group_by_var = c(Area, Item)) birds
# A tibble: 601 × 9
# Groups: Area [248]
Area Item min_Value q1_Value mean_Value med_Value q3_Value max_Value
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Chick… 4700 6222. 8099. 6700 10565 14414
2 Africa Chick… 274201 470654. 936779. 865156. 1331023. 1897326
3 Africa Ducks 6231 8126 13639. 12557 17740. 25428
4 Africa Geese… 3882 4777. 12164. 8192. 18207. 29158
5 Africa Pigeo… 2168 4650 11222. 9946. 13308. 36963
6 Africa Turke… 1213 2156. 9004. 5496 11993 27341
7 Albania Chick… 1580 2412. 4055. 3820. 4939 9494
8 Albania Ducks 290 352. 558. 410. 732 1100
9 Albania Geese… 200 241 396. 278. 463. 800
10 Albania Turke… 403 570 750. 674 879 1300
# ℹ 591 more rows
# ℹ 1 more variable: sd_Value <dbl>
The tables above show how flexible functions can be. Given any number of variables to group over, the function would know what to group by, and produce the same numeric summaries over each aggregate.