library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8
I am choosing the “SNL” dataset.
- snl ⭐⭐⭐⭐⭐
<- read.csv("_data/snl_actors.csv")
actors <- read.csv("_data/snl_casts.csv")
casts <- read.csv("_data/snl_seasons.csv") seasons
Seasons
This table shows sid
(season ID), year, first and last episode IDs, and number of episodes.
Let’s bookmark the existence of epid
as a uniquie identifiers for episodes. It looks like they are secretly dates!
E.g., 19751011 was 10/11/1975.
I will keep them as epid
values for now, at least until the join.
seasons
Casts
This includes the cast members, with one row per season they appeared. The epid
s have returned.
Between first_epid
and last_epid
, we unfortunately can’t see every episode the castmember was in. But we do see n_episodes
- the number of episodes they were in.
last_epid
is not listed if the castmember is active. The update anchor
row tells us if they played a news anchor on “Weekend Update”, the famous satirical news show.
casts
print(summarytools::dfSummary(casts,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
Data Frame Summary
casts
Dimensions: 614 x 8Duplicates: 0
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aid [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
sid [integer] |
|
46 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
featured [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
first_epid [numeric] |
|
35 distinct values | 564 (91.9%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
last_epid [numeric] |
|
17 distinct values | 597 (97.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
update_anchor [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
n_episodes [integer] |
|
22 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
season_fraction [numeric] |
|
36 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-26
Interestingly, there are more last_epid
s than first_epid
s. Harry Shearer is an example of this phenomenon:
%>% filter(str_detect(aid, "Harry Shearer")) casts
Actors
*Actors** includes stage names, band names, a URL column, and unreliable gender information.
actors
The Casts dataset seems to be a subset of Actors. It also includes crewmembers.
First, here is a summary table:
print(summarytools::dfSummary(actors,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
Data Frame Summary
actors
Dimensions: 2306 x 4Duplicates: 0
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aid [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
url [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
type [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
gender [character] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-26
In the data, the confusing “andy” gender stands out:
%>% distinct(gender) actors
<- actors %>% filter(str_detect(gender, "andy"))
actors actors
It seems like all these people are guests. I can’t just rule out all gender data as guests as inaccurate. 83% of the rows are guests.
There are only 21 “andy”-gendered people so I am just going to write NAs to those observations.
I am not replacing “andy” with “unknown”. I feel there is a difference between “unknown to the publisher of the dataset” and “unknown to me, the person reviewing it”.
<- actors %>% mutate(gender = na_if(gender, "andy"))
actors actors
Joining Data
This joins casts (614 values) and seasons (46 values) based on season ID:
<- left_join(casts, seasons,
merged_casts_seasons by = "sid",
suffix = c("_castmember", "_full_season"),)
merged_casts_seasons
As expected, it returns all 614 rows from casts, with extra season information at the end.
After that, this merges the result with actors, adding their gender and type (e.g. “cast”)
<- left_join(merged_casts_seasons,actors,
all_joined by = "aid")
all_joined
No castmembers are listed as actors, so anti_join
ing them confirms that:
<- anti_join(merged_casts_seasons,actors,
anti_joined by = "aid")
anti_joined
As far as I know, this is the most “joined” the data can be.