library(tidyverse)
library(ggplot2)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Challenge 8
I am choosing the “SNL” dataset.
- snl ⭐⭐⭐⭐⭐
actors <- read.csv("_data/snl_actors.csv")
casts <- read.csv("_data/snl_casts.csv")
seasons <- read.csv("_data/snl_seasons.csv")Seasons
This table shows sid (season ID), year, first and last episode IDs, and number of episodes.
Let’s bookmark the existence of epid as a uniquie identifiers for episodes. It looks like they are secretly dates!
E.g., 19751011 was 10/11/1975.
I will keep them as epid values for now, at least until the join.
seasonsCasts
This includes the cast members, with one row per season they appeared. The epids have returned.
Between first_epid and last_epid, we unfortunately can’t see every episode the castmember was in. But we do see n_episodes - the number of episodes they were in.
last_epid is not listed if the castmember is active. The update anchor row tells us if they played a news anchor on “Weekend Update”, the famous satirical news show.
castsprint(summarytools::dfSummary(casts,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')Data Frame Summary
casts
Dimensions: 614 x 8Duplicates: 0
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| aid [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| sid [integer] |
|
46 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| featured [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| first_epid [numeric] |
|
35 distinct values | 564 (91.9%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| last_epid [numeric] |
|
17 distinct values | 597 (97.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| update_anchor [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| n_episodes [integer] |
|
22 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| season_fraction [numeric] |
|
36 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-26
Interestingly, there are more last_epids than first_epids. Harry Shearer is an example of this phenomenon:
casts %>% filter(str_detect(aid, "Harry Shearer"))Actors
*Actors** includes stage names, band names, a URL column, and unreliable gender information.
actorsThe Casts dataset seems to be a subset of Actors. It also includes crewmembers.
First, here is a summary table:
print(summarytools::dfSummary(actors,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')Data Frame Summary
actors
Dimensions: 2306 x 4Duplicates: 0
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| aid [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| url [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| type [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| gender [character] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-26
In the data, the confusing “andy” gender stands out:
actors %>% distinct(gender)actors <- actors %>% filter(str_detect(gender, "andy"))
actorsIt seems like all these people are guests. I can’t just rule out all gender data as guests as inaccurate. 83% of the rows are guests.
There are only 21 “andy”-gendered people so I am just going to write NAs to those observations.
I am not replacing “andy” with “unknown”. I feel there is a difference between “unknown to the publisher of the dataset” and “unknown to me, the person reviewing it”.
actors <- actors %>% mutate(gender = na_if(gender, "andy"))
actorsJoining Data
This joins casts (614 values) and seasons (46 values) based on season ID:
merged_casts_seasons <- left_join(casts, seasons,
by = "sid",
suffix = c("_castmember", "_full_season"),)
merged_casts_seasonsAs expected, it returns all 614 rows from casts, with extra season information at the end.
After that, this merges the result with actors, adding their gender and type (e.g. “cast”)
all_joined <- left_join(merged_casts_seasons,actors,
by = "aid")
all_joinedNo castmembers are listed as actors, so anti_joining them confirms that:
anti_joined <- anti_join(merged_casts_seasons,actors,
by = "aid")
anti_joinedAs far as I know, this is the most “joined” the data can be.