For this challenge, I will use the function I created in Challenge 9 with map from purrr to get summary statistics for various numeric variables. That was an obvious shortcoming of the previous function (only getting summary statistics for one numeric variable per function call), and map would address this concern.
Function
numeric_summary <-function(dataframe, num_var) {# Given the column of a dataframe, find the min, Q1, median, mean, Q3, max, and st. dev. for that column# Change variable to match dplyr dataframe %>%summarise(min =min(get(num_var), na.rm=TRUE),#q1 = quantile(get(num_var), 0.25, na.rm=TRUE),mean =round(mean(get(num_var), na.rm=TRUE), 4),med =median(get(num_var), na.rm=TRUE),#q3 = quantile(get(num_var), 0.75, na.rm=TRUE),max_num_var =max(get(num_var), na.rm=TRUE),sd =sd(get(num_var), na.rm=TRUE))}
My function numeric_summary accepts 2 arguments: the dataframe, and a numeric variable to summarize over. This function gives the minimum, 1st quartile, mean, median, 3rd quartile, maximum, and standard deviation of a numeric variable. I chose to make this function because while working on my project, I found it really annoying to copy/paste and reuse the same code over and over again, and figured it would be interesting to quickly create a function for this.
One thing to note is that I removed the functionality for the optional group_by argument. I spent way too long trying to add in a third argument to the group by, and trying to make it optional, but the pmap function kept returning odd errors. As a result, I decided to remove that from ability from the function.
Application
Again, I will use the birds.csv file, which is a collection of information about specific birds and their populations around the world at various points in time.
# Pull in databirds <-read_csv(here("posts", "_data", "birds.csv"))# View the databirds
# A tibble: 30,977 × 14
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anim… 2 Afgh… 5112 Stocks 1057
2 QA Live Anim… 2 Afgh… 5112 Stocks 1057
3 QA Live Anim… 2 Afgh… 5112 Stocks 1057
4 QA Live Anim… 2 Afgh… 5112 Stocks 1057
5 QA Live Anim… 2 Afgh… 5112 Stocks 1057
6 QA Live Anim… 2 Afgh… 5112 Stocks 1057
7 QA Live Anim… 2 Afgh… 5112 Stocks 1057
8 QA Live Anim… 2 Afgh… 5112 Stocks 1057
9 QA Live Anim… 2 Afgh… 5112 Stocks 1057
10 QA Live Anim… 2 Afgh… 5112 Stocks 1057
# ℹ 30,967 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
# use dfSummaryprint(summarytools::dfSummary(birds,varnumbers=FALSE,plain.ascii=FALSE,style="grid",graph.magnif =0.70,valid.col=FALSE),method="render",table.classes="table-condensed")
Data Frame Summary
birds
Dimensions: 30977 x 14
Duplicates: 0
Variable
Stats / Values
Freqs (% of Valid)
Graph
Missing
Domain Code [character]
1. QA
30977
(
100.0%
)
0 (0.0%)
Domain [character]
1. Live Animals
30977
(
100.0%
)
0 (0.0%)
Area Code [numeric]
Mean (sd) : 1201.7 (2099.4)
min ≤ med ≤ max:
1 ≤ 156 ≤ 5504
IQR (CV) : 152 (1.7)
248 distinct values
0 (0.0%)
Area [character]
1. Africa
2. Asia
3. Eastern Asia
4. Egypt
5. Europe
6. France
7. Greece
8. Myanmar
9. Northern Africa
10. South-eastern Asia
[ 238 others ]
290
(
0.9%
)
290
(
0.9%
)
290
(
0.9%
)
290
(
0.9%
)
290
(
0.9%
)
290
(
0.9%
)
290
(
0.9%
)
290
(
0.9%
)
290
(
0.9%
)
290
(
0.9%
)
28077
(
90.6%
)
0 (0.0%)
Element Code [numeric]
1 distinct value
5112
:
30977
(
100.0%
)
0 (0.0%)
Element [character]
1. Stocks
30977
(
100.0%
)
0 (0.0%)
Item Code [numeric]
Mean (sd) : 1066.5 (9)
min ≤ med ≤ max:
1057 ≤ 1068 ≤ 1083
IQR (CV) : 15 (0)
1057
:
13074
(
42.2%
)
1068
:
6909
(
22.3%
)
1072
:
4136
(
13.4%
)
1079
:
5693
(
18.4%
)
1083
:
1165
(
3.8%
)
0 (0.0%)
Item [character]
1. Chickens
2. Ducks
3. Geese and guinea fowls
4. Pigeons, other birds
5. Turkeys
13074
(
42.2%
)
6909
(
22.3%
)
4136
(
13.4%
)
1165
(
3.8%
)
5693
(
18.4%
)
0 (0.0%)
Year Code [numeric]
Mean (sd) : 1990.6 (16.7)
min ≤ med ≤ max:
1961 ≤ 1992 ≤ 2018
IQR (CV) : 29 (0)
58 distinct values
0 (0.0%)
Year [numeric]
Mean (sd) : 1990.6 (16.7)
min ≤ med ≤ max:
1961 ≤ 1992 ≤ 2018
IQR (CV) : 29 (0)
58 distinct values
0 (0.0%)
Unit [character]
1. 1000 Head
30977
(
100.0%
)
0 (0.0%)
Value [numeric]
Mean (sd) : 99410.6 (720611.4)
min ≤ med ≤ max:
0 ≤ 1800 ≤ 23707134
IQR (CV) : 15233 (7.2)
11495 distinct values
1036 (3.3%)
Flag [character]
1. *
2. A
3. F
4. Im
5. M
1494
(
7.4%
)
6488
(
32.1%
)
10007
(
49.5%
)
1213
(
6.0%
)
1002
(
5.0%
)
10773 (34.8%)
Flag Description [character]
1. Aggregate, may include of
2. Data not available
3. FAO data based on imputat
4. FAO estimate
5. Official data
6. Unofficial figure
6488
(
20.9%
)
1002
(
3.2%
)
1213
(
3.9%
)
10007
(
32.3%
)
10773
(
34.8%
)
1494
(
4.8%
)
0 (0.0%)
Generated by summarytools 1.0.1 (R version 4.3.0) 2023-07-04
Given our dataset and function, let’s say we’re interested in seeing summary statistics for The Value, Year, and Item Code columns.
[[1]]
# A tibble: 1 × 5
min mean med max_num_var sd
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 99411. 1800 23707134 720611.
[[2]]
# A tibble: 1 × 5
min mean med max_num_var sd
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1961 1991. 1992 2018 16.7
[[3]]
# A tibble: 1 × 5
min mean med max_num_var sd
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1057 1066. 1068 1083 9.03
Source Code
---title: "Challenge 10 Solution - Purrr"author: "Linus Jen"description: "purrr"date: "7/6/2023"format: html: toc: true code-copy: true code-tools: truecategories: - challenge_10 - Linus Jen - wildbirds---```{r}#| label: setup#| warning: false#| message: false#| include: falselibrary(tidyverse)library(ggplot2)library(here)knitr::opts_chunk$set(echo =TRUE, warning=FALSE, message=FALSE)```## Challenge OverviewFor this challenge, I will use the function I created in Challenge 9 with `map` from `purrr` to get summary statistics for various numeric variables. That was an obvious shortcoming of the previous function (only getting summary statistics for one numeric variable per function call), and `map` would address this concern.### Function```{r}numeric_summary <-function(dataframe, num_var) {# Given the column of a dataframe, find the min, Q1, median, mean, Q3, max, and st. dev. for that column# Change variable to match dplyr dataframe %>%summarise(min =min(get(num_var), na.rm=TRUE),#q1 = quantile(get(num_var), 0.25, na.rm=TRUE),mean =round(mean(get(num_var), na.rm=TRUE), 4),med =median(get(num_var), na.rm=TRUE),#q3 = quantile(get(num_var), 0.75, na.rm=TRUE),max_num_var =max(get(num_var), na.rm=TRUE),sd =sd(get(num_var), na.rm=TRUE))}```My function `numeric_summary` accepts 2 arguments: the dataframe, and a numeric variable to summarize over. This function gives the minimum, 1st quartile, mean, median, 3rd quartile, maximum, and standard deviation of a numeric variable. I chose to make this function because while working on my project, I found it really annoying to copy/paste and reuse the same code over and over again, and figured it would be interesting to quickly create a function for this.One thing to note is that I removed the functionality for the optional `group_by` argument. I spent way too long trying to add in a third argument to the group by, and trying to make it optional, but the `pmap` function kept returning odd errors. As a result, I decided to remove that from ability from the function.## ApplicationAgain, I will use the `birds.csv` file, which is a collection of information about specific birds and their populations around the world at various points in time.```{r}# Pull in databirds <-read_csv(here("posts", "_data", "birds.csv"))# View the databirds# use dfSummaryprint(summarytools::dfSummary(birds,varnumbers=FALSE,plain.ascii=FALSE,style="grid",graph.magnif =0.70,valid.col=FALSE),method="render",table.classes="table-condensed")```Given our dataset and function, let's say we're interested in seeing summary statistics for The `Value`, `Year`, and `Item Code` columns.```{r}# Apply mapmap2(list(birds, birds, birds),list("Value", "Year", "Item Code"), numeric_summary)```