library(tidyverse)
library(here)
library(readr)
library(stringr)
library(purrr)
library(glue)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 10
Challenge Overview
Use purrr
with a function to perform some data science task. What this task is is up to you. It could involve computing summary statistics, reading in multiple datasets, running a random process multiple times, or anything else you might need to do in your work as a data analyst. You might consider using purrr
with a function you wrote for challenge 9.
Implementation
For this challenge, I will use the SNL dataset that was used in Challenge 8. The main focus will be to use the purrr
package to accomplish two tasks:
Import multiple datasets at once
Fill-in missing rows
Goal 1: Read in multiple data files
To begin, I am interested in learning how purrr
can be used to read in multiple data files at the same time. To do this, we will begin by creating a list of the file names corresponding to the SNL dataset. This can be done by using list.files()
to view all files in the _data
folder and using purrr::keep()
with stringr::str_detect
to filter to only file names containing 'snl_'
.
# list all files
# use purrr::keep() to only keep the file names containing snl_
<- here('posts', '_data')
dir_path <- keep(list.files(dir_path), ~str_detect(.x, 'snl_'))
snl_files
snl_files
[1] "snl_actors.csv" "snl_casts.csv" "snl_seasons.csv"
Next, we will import each of the data frames at once. To do this, we will use the purrr::map()
function along with readr::read_csv
to save each data frame as an element of an R list. As a pre-processing step, we will also use purrr::map()
to convert the individual file names into full file paths.
# first we will use purrr::map() to prepend the `_data` directory to each file name
<- map(snl_files, ~glue("{dir_path}/{.x}"))
snl_files
# then, we read in each df to a list
<- snl_files %>%
snl_df_list map(read_csv)
We can then easily, filter the list to get the separate data frames, which can be joined in the same way as Challenge 8.
<- snl_df_list[[1]]
actors <- snl_df_list[[2]]
cast <- snl_df_list[[3]]
seasons
<- cast %>%
joined left_join(actors, by = 'aid') %>%
left_join(seasons, by = 'sid', suffix = c('_actor', '_season'))
joined
Goal 2: Fill in “Missing” Rows
When working on the analysis portion of Challenge 8, I noticed that, when computing summary statistics, if a combination of year
and cast_type
did not appear in the raw data table, the point would be missing from the graph. In the figure below, this is illustrated by the line for Part-Season
missing several points on the graph.
%>%
joined # limit to cast members
filter(type == 'cast') %>%
# create a full-time cast indicator var
mutate(full_time_cast = factor(ifelse(season_fraction == 1, 'Full-Season', 'Part-Season'))) %>%
# get the count of actors for each year and cast type
distinct(year, aid, full_time_cast) %>%
group_by(year, full_time_cast) %>%
summarize(count = n()) %>%
# plot the change over time
ggplot(aes(x = year, y = count, color = full_time_cast)) +
geom_line() +
geom_point() +
theme_minimal() +
labs(title = 'SNL Cast Size by Year (1975-2020)',
x = 'Year',
y = 'Cast Size',
color = 'Cast Type') +
scale_color_manual(values = c('#2B598E', 'darkgrey'))
When computing the summary statistics (where we are essentially counting the number of rows when grouping by year
and cast_type
), if there are 0 rows, we would the count to be 0
instead of just not appearing in the data. We can use purrr
to help us fill in these rows.
First, we use purrr::cross_df()
to create a data frame containing all possible combinations of year
and cast_type
. Because there are 46 years covered by the dataset, we should expect to see 92 rows in the resulting data frame–which we do!
# create vector of distinct years
= joined %>%
years distinct(year) %>%
pull(year)
# create a vector of distinct cast types
= c('Full-Season', 'Part-Season')
cast_types
# use purrr::cross_df() to create a df containing all possible combos of year and cast type.
<- list(year = years, cast_type = cast_types)
combos <- cross_df(combos) %>%
year_cast_combos arrange(year, cast_type)
year_cast_combos
Now, we can construct a data frame (counts
) that contains the counts of cast members by year
and cast_type
. The counts
data frame can be left-joined to the year_cast_combos
data frame. If a year
-cast_type
combination was missing from the original data frame, it will have an NA
count.
# get summary counts df
<- joined %>%
counts # limit to cast members
filter(type == 'cast') %>%
# create a full-time cast indicator var
mutate(cast_type = factor(ifelse(season_fraction == 1, 'Full-Season', 'Part-Season'))) %>%
# get the count of actors for each year and cast type
distinct(year, aid, cast_type) %>%
group_by(year, cast_type) %>%
summarize(count = n())
# start with all possible years
<- year_cast_combos %>%
full_counts # join the published data
left_join(counts, c('year', 'cast_type'))
full_counts
The NA
values can then be filled with 0
. Now, when we re-plot the figure, we see that the same number of points appear for both Full-Time
and Part-Time
cast.
%>%
full_counts mutate(count = replace_na(count, 0)) %>%
ggplot(aes(x = year, y = count, color = cast_type)) +
geom_line() +
geom_point() +
theme_minimal() +
labs(title = 'SNL Cast Size by Year (1975-2020)',
x = 'Year',
y = 'Cast Size',
color = 'Cast Type') +
scale_color_manual(values = c('#2B598E', 'darkgrey'))