Challenge 8

challenge_8
snl
Joining Data
Author

Steve O’Neill

Published

August 25, 2022

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

I am choosing the “SNL” dataset.

actors <- read.csv("_data/snl_actors.csv")
casts <- read.csv("_data/snl_casts.csv")
seasons <- read.csv("_data/snl_seasons.csv")

Seasons

This table shows sid (season ID), year, first and last episode IDs, and number of episodes.

Let’s bookmark the existence of epid as a uniquie identifiers for episodes. It looks like they are secretly dates!

E.g., 19751011 was 10/11/1975.

I will keep them as epid values for now, at least until the join.

seasons

Casts

This includes the cast members, with one row per season they appeared. The epids have returned.

Between first_epid and last_epid, we unfortunately can’t see every episode the castmember was in. But we do see n_episodes - the number of episodes they were in.

last_epid is not listed if the castmember is active. The update anchor row tells us if they played a news anchor on “Weekend Update”, the famous satirical news show.

casts
print(summarytools::dfSummary(casts,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

casts

Dimensions: 614 x 8
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 146 others ]
18(2.9%)
14(2.3%)
13(2.1%)
11(1.8%)
11(1.8%)
10(1.6%)
10(1.6%)
9(1.5%)
9(1.5%)
9(1.5%)
500(81.4%)
0 (0.0%)
sid [integer]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 0 (0.0%)
featured [character]
1. False
2. True
451(73.5%)
163(26.5%)
0 (0.0%)
first_epid [numeric]
Mean (sd) : 19909635 (111264.8)
min ≤ med ≤ max:
19770115 ≤ 19901110 ≤ 20141025
IQR (CV) : 156624.2 (0)
35 distinct values 564 (91.9%)
last_epid [numeric]
Mean (sd) : 19944038 (126122.2)
min ≤ med ≤ max:
19751011 ≤ 19950225 ≤ 20140201
IQR (CV) : 190005 (0)
17 distinct values 597 (97.2%)
update_anchor [character]
1. False
2. True
541(88.1%)
73(11.9%)
0 (0.0%)
n_episodes [integer]
Mean (sd) : 18.7 (4)
min ≤ med ≤ max:
1 ≤ 20 ≤ 24
IQR (CV) : 2 (0.2)
22 distinct values 0 (0.0%)
season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-26

Interestingly, there are more last_epids than first_epids. Harry Shearer is an example of this phenomenon:

casts %>% filter(str_detect(aid, "Harry Shearer"))

Actors

*Actors** includes stage names, band names, a URL column, and unreliable gender information.

actors

The Casts dataset seems to be a subset of Actors. It also includes crewmembers.

First, here is a summary table:

print(summarytools::dfSummary(actors,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

actors

Dimensions: 2306 x 4
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. 'N Sync
2. 070 Shake
3. 10,000 Maniacs
4. 14 Karat Soul
5. 2 Chainz
6. 3-D
7. 3RDEYEGIRL
8. 50 Cent
9. A Tribe Called Quest
10. A$AP Rocky
[ 2296 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
2296(99.6%)
0 (0.0%)
url [character]
1. (Empty string)
2. /Cast/?AbEl
3. /Cast/?AdSa
4. /Cast/?AiBr
5. /Cast/?AlFr
6. /Cast/?AlMo
7. /Cast/?AlZw
8. /Cast/?AmPo
9. /Cast/?AnDi
10. /Cast/?AnGa
[ 2240 others ]
57(2.5%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
2240(97.1%)
0 (0.0%)
type [character]
1. cast
2. crew
3. guest
4. unknown
154(6.7%)
170(7.4%)
1926(83.5%)
56(2.4%)
0 (0.0%)
gender [character]
1. andy
2. female
3. male
4. unknown
21(0.9%)
671(29.1%)
1226(53.2%)
388(16.8%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-26

In the data, the confusing “andy” gender stands out:

actors %>% distinct(gender)
actors <- actors %>% filter(str_detect(gender, "andy"))
actors

It seems like all these people are guests. I can’t just rule out all gender data as guests as inaccurate. 83% of the rows are guests.

There are only 21 “andy”-gendered people so I am just going to write NAs to those observations.

I am not replacing “andy” with “unknown”. I feel there is a difference between “unknown to the publisher of the dataset” and “unknown to me, the person reviewing it”.

actors <- actors %>% mutate(gender = na_if(gender, "andy"))
actors

Joining Data

This joins casts (614 values) and seasons (46 values) based on season ID:

merged_casts_seasons <- left_join(casts, seasons,
          by = "sid",
          suffix = c("_castmember", "_full_season"),)
merged_casts_seasons

As expected, it returns all 614 rows from casts, with extra season information at the end.

After that, this merges the result with actors, adding their gender and type (e.g. “cast”)

all_joined <- left_join(merged_casts_seasons,actors,
          by = "aid")
all_joined

No castmembers are listed as actors, so anti_joining them confirms that:

anti_joined <- anti_join(merged_casts_seasons,actors,
          by = "aid")
anti_joined

As far as I know, this is the most “joined” the data can be.