Challenge 8 - SNL

challenge_8
Joseph Vincent
snl
Joining Data
Author

Joseph Vincent

Published

April 26, 2023

library(tidyverse)
library(ggplot2)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in snl data

  • snl ⭐⭐⭐⭐⭐
snl_actors <- read.csv("_data/snl_actors.csv")
dim(snl_actors)
[1] 2306    4
head(snl_actors)
             aid           url  type  gender
1  Kate McKinnon   /Cast/?KaMc  cast  female
2    Alex Moffat   /Cast/?AlMo  cast    male
3     Ego Nwodim   /Cast/?EgNw  cast unknown
4     Chris Redd   /Cast/?ChRe  cast    male
5 Kenan Thompson   /Cast/?KeTh  cast    male
6 Carey Mulligan /Guests/?3677 guest    andy
snl_casts <- read.csv("_data/snl_casts.csv")
dim(snl_casts)
[1] 614   8
head(snl_casts)
               aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown  11     True   19860222        NA         False          8
2 A. Whitney Brown  12     True         NA        NA         False         20
3 A. Whitney Brown  13     True         NA        NA         False         13
4 A. Whitney Brown  14     True         NA        NA         False         20
5 A. Whitney Brown  15     True         NA        NA         False         20
6 A. Whitney Brown  16     True         NA        NA         False         20
  season_fraction
1       0.4444444
2       1.0000000
3       1.0000000
4       1.0000000
5       1.0000000
6       1.0000000
snl_seasons <-read.csv("_data/snl_seasons.csv")
dim(snl_seasons)
[1] 46  5
head(snl_seasons)
  sid year first_epid last_epid n_episodes
1   1 1975   19751011  19760731         24
2   2 1976   19760918  19770521         22
3   3 1977   19770924  19780520         20
4   4 1978   19781007  19790526         20
5   5 1979   19791013  19800524         20
6   6 1980   19801115  19810411         13

Briefly describe the data

The SNL data consists of three data sets:

SNL actors is a list of all cast and guest members who have been on SNL (2306). There is a link out to more information, what type of actor they were (cast vs guest) and their gender. Each row is an actor.

SNL casts, on the other hand, is a much more robust data set in terms of variables. While it only includes the cast members (614) (not guests), it includes information on the seasons they were on the show, how many episodes in each season, and the dates of the first and last episodes in a given season that they were in. Most values for first and last episode are NA, as they are only included if they differ from the first or last episode of that season. This will be dealt with later on.

The SNL casts data has the general format we would like to use for analysis. Each row represents an “actor-year”. This is what each case should be in our final data set.

Finally, the SNL seasons data contains information on the seasons (46 total), including year, dates of the first and last episode, and episode count per season. Each row is a season.

SNL actors is a list of all cast and guest members who have been on SNL (2306). There is a link out to more information, what type of actor they were (cast vs guest) and their gender. Each row is an actor.

SNL casts, on the other hand, is a much more robust data set in terms of variables. While it only includes the cast members (614) (not guests), it includes information on the seasons they were on the show, how many episodes in each season, and the dates of the first and last episodes in a given season that they were in. Most values for first and last episode are NA, as they are only included if they differ from the first or last episode of that season. This will be dealt with later on.

The SNL casts data has the general format we would like to use for analysis. Each row represents an “actor-year”. This is what each case should be in our final data set.

Finally, the SNL seasons data contains information on the seasons (46 total), including year, dates of the first and last episode, and episode count per season. Each row is a season.

Tidy and mutation before joining

I will do enough tidy-ing to join the data, and then work on some additional mutations after to clean the final data set.

Before moving on, I’m going to filter out guests from the actors data set, as we don’t have any information about them except for gender to use in an analysis.

#filtering out guests from actors data
snl_actors <- snl_actors %>% 
  filter(type == "cast") %>%
  select(aid, gender)

As you can see, their are two columns (whether the cast member was an update anchor, and whether they were featured) which are currently characters, but should be logical. This has been changed below.

#showing character class before mutation
class(snl_casts$update_anchor)
[1] "character"
class(snl_casts$featured)
[1] "character"
#mutating to change to logical
snl_casts <- snl_casts %>%
  mutate(`update_anchor` = case_when(
    `update_anchor` == "True" ~ TRUE,
    `update_anchor` == "False" ~ FALSE)) %>%
  mutate(`featured` = case_when(
    `featured` == "True" ~ TRUE,
    `featured` == "False" ~ FALSE))

#showing logical class after mutation
class(snl_casts$update_anchor)
[1] "logical"
class(snl_casts$featured)
[1] "logical"

Join Data

As we join the three data sets, I will use the SNL casts data as the primary data set, and add information from the other data sets into this. So my case count at the end should be 614 and each case will represent an “actor-season”.

First, I will add the data about each season into the casts data. This includes first and last date of the season, number of episodes, and year.

Next, I use the SNL actors data to add gender of the cast member into my SNL casts data.

#combining seasons data INTO casts data
snl_castsandseasons <- left_join(snl_casts, snl_seasons, by = "sid")

#combining actors data INTO casts and seasons data
snl_castsseasonsandactors <- left_join(snl_castsandseasons, snl_actors, "aid")

head(snl_castsseasonsandactors)
               aid sid featured first_epid.x last_epid.x update_anchor
1 A. Whitney Brown  11     TRUE     19860222          NA         FALSE
2 A. Whitney Brown  12     TRUE           NA          NA         FALSE
3 A. Whitney Brown  13     TRUE           NA          NA         FALSE
4 A. Whitney Brown  14     TRUE           NA          NA         FALSE
5 A. Whitney Brown  15     TRUE           NA          NA         FALSE
6 A. Whitney Brown  16     TRUE           NA          NA         FALSE
  n_episodes.x season_fraction year first_epid.y last_epid.y n_episodes.y
1            8       0.4444444 1985     19851109    19860524           18
2           20       1.0000000 1986     19861011    19870523           20
3           13       1.0000000 1987     19871017    19880227           13
4           20       1.0000000 1988     19881008    19890520           20
5           20       1.0000000 1989     19890930    19900519           20
6           20       1.0000000 1990     19900929    19910518           20
  gender
1   male
2   male
3   male
4   male
5   male
6   male
dim(snl_castsseasonsandactors)
[1] 614  13

My data now consists of all variables and is the correct number of rows. There are still some things to do in terms of tidy-ing.

Currently, their are 4 dates in the data. Two “first episodes” and two “last episodes”. Since our cases are “actor-seasons”, I’m going to combine these in a way where it uses the dates of the first and last episode of the season, unless the actor was only there for a partial season. In those cases, the date will reflect the first or last episode they were involved in. This will reduce column count down to 11, as you can see below.

I also needed to turn the numeric date columns into actual dates.

Finally, there are two episode counts, one for the number of episodes an actor was involved in and one for the number of episodes in a season. I’ve renamed these to be clearer.

#creating final combined dataset
snl_all <- snl_castsseasonsandactors %>%
  
  #combining multiple first and last episode date columns to reflect dates participated by actors
  mutate(first_episode = coalesce(first_epid.x, first_epid.y),
         last_episode = coalesce(last_epid.x, last_epid.y)) %>%
  
  #changing numeric values to be dates
  mutate(first_episode = ymd(first_episode),
         last_episode = ymd(last_episode)) %>%
  
  #removing unused date columns
  select(-c(first_epid.x, first_epid.y, last_epid.x, last_epid.y)) %>%
  
  #renaming for clarity
  rename("actor_episodes" = n_episodes.x) %>%
  rename("season_episodes" = n_episodes.y) 

#printing dimensions and summary
dim(snl_casts)
[1] 614   8
dim(snl_all)
[1] 614  11
head(snl_all)
               aid sid featured update_anchor actor_episodes season_fraction
1 A. Whitney Brown  11     TRUE         FALSE              8       0.4444444
2 A. Whitney Brown  12     TRUE         FALSE             20       1.0000000
3 A. Whitney Brown  13     TRUE         FALSE             13       1.0000000
4 A. Whitney Brown  14     TRUE         FALSE             20       1.0000000
5 A. Whitney Brown  15     TRUE         FALSE             20       1.0000000
6 A. Whitney Brown  16     TRUE         FALSE             20       1.0000000
  year season_episodes gender first_episode last_episode
1 1985              18   male    1986-02-22   1986-05-24
2 1986              20   male    1986-10-11   1987-05-23
3 1987              13   male    1987-10-17   1988-02-27
4 1988              20   male    1988-10-08   1989-05-20
5 1989              20   male    1989-09-30   1990-05-19
6 1990              20   male    1990-09-29   1991-05-18
print(summarytools::dfSummary(snl_all,
                              valid.col=FALSE), 
      method = 'render')

Data Frame Summary

snl_all

Dimensions: 614 x 11
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 146 others ]
18 ( 2.9% )
14 ( 2.3% )
13 ( 2.1% )
11 ( 1.8% )
11 ( 1.8% )
10 ( 1.6% )
10 ( 1.6% )
9 ( 1.5% )
9 ( 1.5% )
9 ( 1.5% )
500 ( 81.4% )
0 (0.0%)
2 sid [integer]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 0 (0.0%)
3 featured [logical]
1. FALSE
2. TRUE
451 ( 73.5% )
163 ( 26.5% )
0 (0.0%)
4 update_anchor [logical]
1. FALSE
2. TRUE
541 ( 88.1% )
73 ( 11.9% )
0 (0.0%)
5 actor_episodes [integer]
Mean (sd) : 18.7 (4)
min ≤ med ≤ max:
1 ≤ 20 ≤ 24
IQR (CV) : 2 (0.2)
22 distinct values 0 (0.0%)
6 season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 0 (0.0%)
7 year [integer]
Mean (sd) : 1999.5 (13.1)
min ≤ med ≤ max:
1975 ≤ 2000 ≤ 2020
IQR (CV) : 22 (0)
46 distinct values 0 (0.0%)
8 season_episodes [integer]
Mean (sd) : 19.7 (2.1)
min ≤ med ≤ max:
12 ≤ 20 ≤ 24
IQR (CV) : 1 (0.1)
12 : 12 ( 2.0% )
13 : 24 ( 3.9% )
17 : 30 ( 4.9% )
18 : 31 ( 5.0% )
19 : 37 ( 6.0% )
20 : 297 ( 48.4% )
21 : 111 ( 18.1% )
22 : 63 ( 10.3% )
24 : 9 ( 1.5% )
0 (0.0%)
9 gender [character]
1. female
2. male
3. unknown
194 ( 32.0% )
409 ( 67.4% )
4 ( 0.7% )
7 (1.1%)
10 first_episode [Date]
min : 1975-10-11
med : 2000-10-07
max : 2020-10-03
range : 44y 11m 22d
80 distinct values 0 (0.0%)
11 last_episode [Date]
min : 1975-10-11
med : 2001-05-19
max : 2021-04-10
range : 45y 5m 30d
63 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-04-26