challenge_8
Daniel Hannon
snl
Author

Daniel Hannon

Published

May 2, 2023

Code
library(tidyverse)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE)

Read in the data

Code
actors <- read.csv("_data/snl_actors.csv")

casts <- read.csv("_data/snl_casts.csv")

seasons <- read.csv("_data/snl_seasons.csv")

head(actors)  # (2,306 x 4)
             aid           url  type  gender
1  Kate McKinnon   /Cast/?KaMc  cast  female
2    Alex Moffat   /Cast/?AlMo  cast    male
3     Ego Nwodim   /Cast/?EgNw  cast unknown
4     Chris Redd   /Cast/?ChRe  cast    male
5 Kenan Thompson   /Cast/?KeTh  cast    male
6 Carey Mulligan /Guests/?3677 guest    andy
Code
casts <- casts %>%  # (614 x 8)
        mutate(first_epid = ymd(first_epid), last_epid = ymd(last_epid))

seasons <- seasons %>% # (46 x 5)
        mutate(first_epid = ymd(first_epid), last_epid = ymd(last_epid))

Tidy the data

The data in the data frames are all independently tidy. The actors Data Frame is a list of all the actors, guests, musical guests, and crew that have appeared on the show. Each observation is for one actor and tells what type of role they played and their gender. The casts data frame has each observation be an actor on the cast during a specific season, and has information about whether they were featured, the dates of their first and last episode, whether or not the were an anchor on weekend update and the number of episodes they were in during that season. The seasons data frame has each observation as a specific season, and contains data on what year it ran, the dates of the first and last episodes and the number of episodes.

Join the data

In order to join the data we have to look at the connecting keys. Casts and seasons are connected by the sid (season ID) column, while casts and actors are connected by the aid (actor ID) column. First we will join the extra information provided in the actors df into the casts df to gain information about the actor’s genders. We can also filter out all of the guests, musical guests, and crew because we are only going to be looking at data on the main cast.

Code
actors <-actors %>%
       select(-url) %>%
       filter(type == "cast")  ## 154 cast members here, missing David Spade and Gilbert Gottfried

casts %>%
      select(aid)%>%
      n_distinct()             ##Sanity Check: 156 cast members here 
[1] 156
Code
casts <- full_join(actors, casts, by = "aid") %>%
        select(c(aid, gender, sid, featured, update_anchor))
Warning in full_join(actors, casts, by = "aid"): Each row in `x` is expected to match at most 1 row in `y`.
ℹ Row 1 of `x` matches multiple rows.
ℹ If multiple matches are expected, set `multiple = "all"` to silence this
  warning.

Now that we have joined this information, we can do analysis on gender throughout the different seasons, as well as the gender breakdown of the weekend update hosts.

Code
head(casts) 
            aid gender sid featured update_anchor
1 Kate McKinnon female  37     True         False
2 Kate McKinnon female  38     True         False
3 Kate McKinnon female  39    False         False
4 Kate McKinnon female  40    False         False
5 Kate McKinnon female  41    False         False
6 Kate McKinnon female  42    False         False
Code
casts %>%
      filter(update_anchor == "True") %>%
      select(gender)%>%
      table() %>%
      prop.table() # 74% of all weekend update hosts are male when looking at every season 
gender
  female     male 
0.260274 0.739726 
Code
casts %>%
      filter(update_anchor == "True") %>%
      group_by(aid) %>%
      slice(1) %>%
      ungroup()%>%
      select(gender)%>%
      table() %>%
      prop.table()   #69.5% of actors who host weekend update are male 
gender
   female      male 
0.3043478 0.6956522 

Breaking down the gender of the weekend update hosts we can see that across all seasons 74% of the hosts are male, while looking at the percent of people that were hosts, 69.5% of them are male. In both cases we can see that the role of weekend update host is taken more by men than women.

Code
seasons <- casts %>%     #Create a Gender Percent breakdown for each season
        group_by(sid) %>%
        add_count(name = "total") %>%
        group_by(sid, gender)%>%
        add_count(name = "gender_total") %>%
        mutate(gender_percent = gender_total/total) %>%
        select(sid, gender, gender_percent) %>%
        slice(1) %>%
        full_join(seasons, by= "sid")

seasons
# A tibble: 102 × 7
# Groups:   sid, gender [102]
     sid gender gender_percent  year first_epid last_epid  n_episodes
   <int> <chr>           <dbl> <int> <date>     <date>          <int>
 1     1 female          0.333  1975 1975-10-11 1976-07-31         24
 2     1 male            0.667  1975 1975-10-11 1976-07-31         24
 3     2 female          0.375  1976 1976-09-18 1977-05-21         22
 4     2 male            0.625  1976 1976-09-18 1977-05-21         22
 5     3 female          0.333  1977 1977-09-24 1978-05-20         20
 6     3 male            0.667  1977 1977-09-24 1978-05-20         20
 7     4 female          0.333  1978 1978-10-07 1979-05-26         20
 8     4 male            0.667  1978 1978-10-07 1979-05-26         20
 9     5 female          0.2    1979 1979-10-13 1980-05-24         20
10     5 male            0.8    1979 1979-10-13 1980-05-24         20
# … with 92 more rows

We now we have added in to the seasons a percentage breakdown for the gender of the cast and we have successfully joined all the data.