challenge_8
Author

Thrishul

Published

May 5, 2023

Before we read in the data, we’ll need to load the dplyr, tidyr, ggplot2, and readr packages.

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(ggplot2)
library(lubridate)

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union
Code
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

Code
snl_actors <- read.csv("_data/snl_actors.csv")
dim(snl_actors)
[1] 2306    4
Code
head(snl_actors)  
             aid           url  type  gender
1  Kate McKinnon   /Cast/?KaMc  cast  female
2    Alex Moffat   /Cast/?AlMo  cast    male
3     Ego Nwodim   /Cast/?EgNw  cast unknown
4     Chris Redd   /Cast/?ChRe  cast    male
5 Kenan Thompson   /Cast/?KeTh  cast    male
6 Carey Mulligan /Guests/?3677 guest    andy
Code
snl_casts <- read.csv("_data/snl_casts.csv")
dim(snl_casts)
[1] 614   8
Code
head(snl_casts)  
               aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown  11     True   19860222        NA         False          8
2 A. Whitney Brown  12     True         NA        NA         False         20
3 A. Whitney Brown  13     True         NA        NA         False         13
4 A. Whitney Brown  14     True         NA        NA         False         20
5 A. Whitney Brown  15     True         NA        NA         False         20
6 A. Whitney Brown  16     True         NA        NA         False         20
  season_fraction
1       0.4444444
2       1.0000000
3       1.0000000
4       1.0000000
5       1.0000000
6       1.0000000
Code
snl_seasons <-read.csv("_data/snl_seasons.csv")
dim(snl_seasons)
[1] 46  5
Code
head(snl_seasons)
  sid year first_epid last_epid n_episodes
1   1 1975   19751011  19760731         24
2   2 1976   19760918  19770521         22
3   3 1977   19770924  19780520         20
4   4 1978   19781007  19790526         20
5   5 1979   19791013  19800524         20
6   6 1980   19801115  19810411         13

The SNL dataset is comprised of three separate data sources: SNL actors, SNL casts, and SNL seasons.

The SNL actors dataset consists of a list of all cast and guest members who have appeared on SNL (2306). Each actor is linked to additional information, such as their gender and whether they were a cast member or a guest. Each row in the dataset represents a single actor.

The SNL casts dataset, on the other hand, is much more comprehensive in terms of variables. It includes information on the cast members (614) rather than guests, and provides details on the seasons in which they appeared, the number of episodes in each season, and the dates of the first and last episodes of each season. Most of the values for the first and last episodes are NA, only being included if they differ from the first or last episode of that season. We will deal with this issue later in our analysis.

The SNL casts data has the desired format for our analysis, where each row represents an “actor-year.” This is what we will use as our final data set.

Lastly, the SNL seasons dataset contains information on the 46 seasons of SNL, including the year, dates of the first and last episodes, and the episode count per season. Each row in the dataset represents a single season.

Tidy & Mutate Data

To join the data sets, I will perform some tidying operations. After the join, I will conduct additional mutations to clean the final data set.

To start with, I will filter out the guest actors from the SNL actors data set, as we lack sufficient information about them, except for gender, to use in an analysis.

Code
#filtering out guests from actors data
snl_actors <- snl_actors %>% 
  filter(type == "cast") %>%
  select(aid, gender)

As you can see, there are two columns in the SNL casts data set (whether the cast member was an update anchor, and whether they were featured) which are currently of character data type, but they should be of logical data type. I have updated these columns to logical data type below.

Code
#showing character class before mutation
class(snl_casts$update_anchor)
[1] "character"
Code
#mutating to change to logical
snl_casts <- snl_casts %>%
  mutate(`update_anchor` = case_when(
    `update_anchor` == "True" ~ TRUE,
    `update_anchor` == "False" ~ FALSE)) %>%
  mutate(`featured` = case_when(
    `featured` == "True" ~ TRUE,
    `featured` == "False" ~ FALSE))

#showing logical class after mutation
class(snl_casts$update_anchor)
[1] "logical"
Code
class(snl_casts$featured)
[1] "logical"

Joining Data

As we join the three data sets, we will use the SNL casts data as the primary data set and add information from the other data sets into this. The final data set should have 614 cases, with each case representing an “actor-season”.

To do this, we will first add the information about each season into the SNL casts data set. This includes the first and last date of the season, the number of episodes, and the year of the season.

Next, we will use the SNL actors data set to add the gender of the cast member into our SNL casts data set.

Code
#combining seasons data INTO casts data
snl_castsandseasons <- left_join(snl_casts, snl_seasons, by = "sid")

#combining actors data INTO casts and seasons data
snl_castsseasonsandactors <- left_join(snl_castsandseasons, snl_actors, "aid")

head(snl_castsseasonsandactors)
               aid sid featured first_epid.x last_epid.x update_anchor
1 A. Whitney Brown  11     TRUE     19860222          NA         FALSE
2 A. Whitney Brown  12     TRUE           NA          NA         FALSE
3 A. Whitney Brown  13     TRUE           NA          NA         FALSE
4 A. Whitney Brown  14     TRUE           NA          NA         FALSE
5 A. Whitney Brown  15     TRUE           NA          NA         FALSE
6 A. Whitney Brown  16     TRUE           NA          NA         FALSE
  n_episodes.x season_fraction year first_epid.y last_epid.y n_episodes.y
1            8       0.4444444 1985     19851109    19860524           18
2           20       1.0000000 1986     19861011    19870523           20
3           13       1.0000000 1987     19871017    19880227           13
4           20       1.0000000 1988     19881008    19890520           20
5           20       1.0000000 1989     19890930    19900519           20
6           20       1.0000000 1990     19900929    19910518           20
  gender
1   male
2   male
3   male
4   male
5   male
6   male
Code
dim(snl_castsseasonsandactors)
[1] 614  13

After joining the three data sets, I have a data set with all the relevant variables, but it still needs some tidying up.

Currently, there are four date columns in the data, two for the first episode and two for the last episode. Since our cases are “actor-seasons,” I will combine these in a way that reflects the first and last episode dates of the season, unless the actor was only present for part of the season. In those cases, the dates will reflect the first or last episode in which they were involved. This will reduce the column count to 11.

Additionally, I have converted the numeric date columns into actual date format.

Finally, there are two episode count columns: one for the number of episodes a cast member was involved in and one for the number of episodes in a season. To make these column names clearer, I have renamed them.

Code
#creating final combined dataset
snl_all <- snl_castsseasonsandactors %>%
  
  #combining multiple first and last episode date columns to reflect dates participated by actors
  mutate(first_episode = coalesce(first_epid.x, first_epid.y),
         last_episode = coalesce(last_epid.x, last_epid.y)) %>%
  
  #changing numeric values to be dates
  mutate(first_episode = ymd(first_episode),
         last_episode = ymd(last_episode)) %>%
  
  #removing unused date columns
  select(-c(first_epid.x, first_epid.y, last_epid.x, last_epid.y)) %>%
  
  #renaming for clarity
  rename("actor_episodes" = n_episodes.x) %>%
  rename("season_episodes" = n_episodes.y) 

#printing dimensions and summary
dim(snl_casts)
[1] 614   8
Code
dim(snl_all)
[1] 614  11
Code
head(snl_all)
               aid sid featured update_anchor actor_episodes season_fraction
1 A. Whitney Brown  11     TRUE         FALSE              8       0.4444444
2 A. Whitney Brown  12     TRUE         FALSE             20       1.0000000
3 A. Whitney Brown  13     TRUE         FALSE             13       1.0000000
4 A. Whitney Brown  14     TRUE         FALSE             20       1.0000000
5 A. Whitney Brown  15     TRUE         FALSE             20       1.0000000
6 A. Whitney Brown  16     TRUE         FALSE             20       1.0000000
  year season_episodes gender first_episode last_episode
1 1985              18   male    1986-02-22   1986-05-24
2 1986              20   male    1986-10-11   1987-05-23
3 1987              13   male    1987-10-17   1988-02-27
4 1988              20   male    1988-10-08   1989-05-20
5 1989              20   male    1989-09-30   1990-05-19
6 1990              20   male    1990-09-29   1991-05-18
Code
print(summarytools::dfSummary(snl_all,
                              valid.col=FALSE), 
      method = 'render')

Data Frame Summary

snl_all

Dimensions: 614 x 11
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 146 others ]
18 ( 2.9% )
14 ( 2.3% )
13 ( 2.1% )
11 ( 1.8% )
11 ( 1.8% )
10 ( 1.6% )
10 ( 1.6% )
9 ( 1.5% )
9 ( 1.5% )
9 ( 1.5% )
500 ( 81.4% )
0 (0.0%)
2 sid [integer]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 0 (0.0%)
3 featured [logical]
1. FALSE
2. TRUE
451 ( 73.5% )
163 ( 26.5% )
0 (0.0%)
4 update_anchor [logical]
1. FALSE
2. TRUE
541 ( 88.1% )
73 ( 11.9% )
0 (0.0%)
5 actor_episodes [integer]
Mean (sd) : 18.7 (4)
min ≤ med ≤ max:
1 ≤ 20 ≤ 24
IQR (CV) : 2 (0.2)
22 distinct values 0 (0.0%)
6 season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 0 (0.0%)
7 year [integer]
Mean (sd) : 1999.5 (13.1)
min ≤ med ≤ max:
1975 ≤ 2000 ≤ 2020
IQR (CV) : 22 (0)
46 distinct values 0 (0.0%)
8 season_episodes [integer]
Mean (sd) : 19.7 (2.1)
min ≤ med ≤ max:
12 ≤ 20 ≤ 24
IQR (CV) : 1 (0.1)
12 : 12 ( 2.0% )
13 : 24 ( 3.9% )
17 : 30 ( 4.9% )
18 : 31 ( 5.0% )
19 : 37 ( 6.0% )
20 : 297 ( 48.4% )
21 : 111 ( 18.1% )
22 : 63 ( 10.3% )
24 : 9 ( 1.5% )
0 (0.0%)
9 gender [character]
1. female
2. male
3. unknown
194 ( 32.0% )
409 ( 67.4% )
4 ( 0.7% )
7 (1.1%)
10 first_episode [Date]
min : 1975-10-11
med : 2000-10-07
max : 2020-10-03
range : 44y 11m 22d
80 distinct values 0 (0.0%)
11 last_episode [Date]
min : 1975-10-11
med : 2001-05-19
max : 2021-04-10
range : 45y 5m 30d
63 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-05-06