Challenge 10 Solution - Purrr

challenge_10

Linus Jen

wildbirds

purrr

Author

Linus Jen

Published

July 6, 2023

Challenge Overview

For this challenge, I will use the function I created in Challenge 9 with map from purrr to get summary statistics for various numeric variables. That was an obvious shortcoming of the previous function (only getting summary statistics for one numeric variable per function call), and map would address this concern.

Function

numeric_summary <- function(dataframe, num_var) {
  # Given the column of a dataframe, find the min, Q1, median, mean, Q3, max, and st. dev. for that column
  # Change variable to match dplyr
  
  dataframe %>%
    summarise(min = min(get(num_var), na.rm=TRUE),
              #q1 = quantile(get(num_var), 0.25, na.rm=TRUE),
              mean = round(mean(get(num_var), na.rm=TRUE), 4),
              med = median(get(num_var), na.rm=TRUE),
              #q3 = quantile(get(num_var), 0.75, na.rm=TRUE),
              max_num_var = max(get(num_var), na.rm=TRUE),
              sd = sd(get(num_var), na.rm=TRUE))
}

My function numeric_summary accepts 2 arguments: the dataframe, and a numeric variable to summarize over. This function gives the minimum, 1st quartile, mean, median, 3rd quartile, maximum, and standard deviation of a numeric variable. I chose to make this function because while working on my project, I found it really annoying to copy/paste and reuse the same code over and over again, and figured it would be interesting to quickly create a function for this.

One thing to note is that I removed the functionality for the optional group_by argument. I spent way too long trying to add in a third argument to the group by, and trying to make it optional, but the pmap function kept returning odd errors. As a result, I decided to remove that from ability from the function.

Application

Again, I will use the birds.csv file, which is a collection of information about specific birds and their populations around the world at various points in time.

# Pull in data
birds <- read_csv(here("posts", "_data", "birds.csv"))

# View the data
birds

# A tibble: 30,977 × 14
   `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
   <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
 1 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 2 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 3 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 4 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 5 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 6 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 7 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 8 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 9 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
10 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
# ℹ 30,967 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>

# use dfSummary
print(summarytools::dfSummary(birds,
                              varnumbers=FALSE,
                              plain.ascii=FALSE,
                              style="grid",
                              graph.magnif = 0.70,
                              valid.col=FALSE),
      method="render",
      table.classes="table-condensed")

Data Frame Summary

birds

Dimensions: 30977 x 14
Duplicates: 0

Domain Code [character]

1. QA

30977	(	100.0%	)

0 (0.0%)

Domain [character]

1. Live Animals

30977	(	100.0%	)

0 (0.0%)

Area Code [numeric]

Mean (sd) : 1201.7 (2099.4)
min ≤ med ≤ max:
1 ≤ 156 ≤ 5504
IQR (CV) : 152 (1.7)

248 distinct values

0 (0.0%)

Area [character]

1. Africa
2. Asia
3. Eastern Asia
4. Egypt
5. Europe
6. France
7. Greece
8. Myanmar
9. Northern Africa
10. South-eastern Asia
[ 238 others ]

290	(	0.9%	)
290	(	0.9%	)
290	(	0.9%	)
290	(	0.9%	)
290	(	0.9%	)
290	(	0.9%	)
290	(	0.9%	)
290	(	0.9%	)
290	(	0.9%	)
290	(	0.9%	)
28077	(	90.6%	)

0 (0.0%)

Element Code [numeric]

1 distinct value

5112	:	30977	(	100.0%	)

0 (0.0%)

Element [character]

1. Stocks

30977	(	100.0%	)

0 (0.0%)

Item Code [numeric]

Mean (sd) : 1066.5 (9)
min ≤ med ≤ max:
1057 ≤ 1068 ≤ 1083
IQR (CV) : 15 (0)

1057	:	13074	(	42.2%	)
1068	:	6909	(	22.3%	)
1072	:	4136	(	13.4%	)
1079	:	5693	(	18.4%	)
1083	:	1165	(	3.8%	)

0 (0.0%)

Item [character]

1. Chickens
2. Ducks
3. Geese and guinea fowls
4. Pigeons, other birds
5. Turkeys

13074	(	42.2%	)
6909	(	22.3%	)
4136	(	13.4%	)
1165	(	3.8%	)
5693	(	18.4%	)

0 (0.0%)

Year Code [numeric]

Mean (sd) : 1990.6 (16.7)
min ≤ med ≤ max:
1961 ≤ 1992 ≤ 2018
IQR (CV) : 29 (0)

58 distinct values

0 (0.0%)

Year [numeric]

Mean (sd) : 1990.6 (16.7)
min ≤ med ≤ max:
1961 ≤ 1992 ≤ 2018
IQR (CV) : 29 (0)

58 distinct values

0 (0.0%)

Unit [character]

1. 1000 Head

30977	(	100.0%	)

0 (0.0%)

Value [numeric]

Mean (sd) : 99410.6 (720611.4)
min ≤ med ≤ max:
0 ≤ 1800 ≤ 23707134
IQR (CV) : 15233 (7.2)

11495 distinct values

1036 (3.3%)

Flag [character]

1. *
2. A
3. F
4. Im
5. M

1494	(	7.4%	)
6488	(	32.1%	)
10007	(	49.5%	)
1213	(	6.0%	)
1002	(	5.0%	)

10773 (34.8%)

Flag Description [character]

1. Aggregate, may include of
2. Data not available
3. FAO data based on imputat
4. FAO estimate
5. Official data
6. Unofficial figure

6488	(	20.9%	)
1002	(	3.2%	)
1213	(	3.9%	)
10007	(	32.3%	)
10773	(	34.8%	)
1494	(	4.8%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.0)
2023-07-04

Given our dataset and function, let’s say we’re interested in seeing summary statistics for The Value, Year, and Item Code columns.

# Apply map
map2(list(birds, birds, birds),
     list("Value", "Year", "Item Code"),
     numeric_summary)

[[1]]
# A tibble: 1 × 5
    min   mean   med max_num_var      sd
  <dbl>  <dbl> <dbl>       <dbl>   <dbl>
1     0 99411.  1800    23707134 720611.

[[2]]
# A tibble: 1 × 5
    min  mean   med max_num_var    sd
  <dbl> <dbl> <dbl>       <dbl> <dbl>
1  1961 1991.  1992        2018  16.7

[[3]]
# A tibble: 1 × 5
    min  mean   med max_num_var    sd
  <dbl> <dbl> <dbl>       <dbl> <dbl>
1  1057 1066.  1068        1083  9.03