Challenge 8

challenge_8
snl
Joining Data
Author

Paarth Tandon

Published

January 18, 2023

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

set.seed(42)
actors <- read_csv('_data/snl_actors.csv')
casts <- read_csv('_data/snl_casts.csv')
seasons <- read_csv('posts/_data/snl_seasons.csv')
Error: 'posts/_data/snl_seasons.csv' does not exist in current working directory ('/home/paarth/mega/grad/Winter 2023/DACSS601/601_Winter_2022-2023/posts').
head(actors)
aid url type gender
Kate McKinnon /Cast/?KaMc cast female
Alex Moffat /Cast/?AlMo cast male
Ego Nwodim /Cast/?EgNw cast unknown
Chris Redd /Cast/?ChRe cast male
Kenan Thompson /Cast/?KeTh cast male
Carey Mulligan /Guests/?3677 guest andy
head(casts)
aid sid featured first_epid last_epid update_anchor n_episodes season_fraction
A. Whitney Brown 11 TRUE 19860222 NA FALSE 8 0.4444444
A. Whitney Brown 12 TRUE NA NA FALSE 20 1.0000000
A. Whitney Brown 13 TRUE NA NA FALSE 13 1.0000000
A. Whitney Brown 14 TRUE NA NA FALSE 20 1.0000000
A. Whitney Brown 15 TRUE NA NA FALSE 20 1.0000000
A. Whitney Brown 16 TRUE NA NA FALSE 20 1.0000000
head(seasons)
Error in head(seasons): object 'seasons' not found

Briefly describe the data

This dataset contains three tables about the actors and seasons of SNL, a late night comedy show (that I have never seen). The first table, actors, contains the name, gender, and cast type of each actor. The second table, casts, contains the relation between an actor and which seasons they appeared in. It also contains how many episodes of the season they were in, and some analygous information. The third table, seasons, contains information about the year, number of episodes, and episode ids of each season. Essentially, actor and seasons contains rows that represent one entity, and casts describes the relation between the two tables of entities.

Tidy Data (as needed)

Data does not need to be preprocessed.

Join Data

It seems like the only data we can really work with that involves joins would be gender. Originally, I wanted to consider the ratio of guests per season, but it seems as though there is no guest data in the casts table.

Because of this, I will answer the question: What is the ratio of males to non-males per year?

First, let’s join the tables together. Then I will group by the year and a new column I created, is_male. This column is true when the actor is a male and false otherwise. Then, I count how many episodes land in each column.

gender_counts = actors %>%
    left_join(casts, by = 'aid') %>%
    left_join(seasons, by = 'sid') %>%
    mutate(is_male = `gender` == 'male') %>%
    select(sid, year, gender, is_male, n_episodes.x) %>%
    group_by(`year`, `is_male`) %>%
    summarise(ep_count = sum(n_episodes.x))
Error in is.data.frame(y): object 'seasons' not found
head(gender_counts, 10)
Error in head(gender_counts, 10): object 'gender_counts' not found

Now, I need the total counts per year. The way I did it is way too overengineerd (reused my code) but it made it easier than re-doing everything. For a one-time calculation I don’t think this is an issue.

season_counts = actors %>%
    left_join(casts, by = 'aid') %>%
    left_join(seasons, by = 'sid') %>%
    mutate(is_male = `gender` == 'male') %>%
    select(sid, year, gender, is_male, n_episodes.x, n_episodes.y) %>%
    group_by(`year`) %>%
    summarise(ep_count = sum(n_episodes.x))
Error in is.data.frame(y): object 'seasons' not found
head(season_counts, 10)
Error in head(season_counts, 10): object 'season_counts' not found

Finally, I calculate the ratio using both tables.

gender_ratio = gender_counts %>%
    left_join(season_counts, by = 'year') %>%
    mutate(ep_ratio = ep_count.x / ep_count.y)
Error in left_join(., season_counts, by = "year"): object 'gender_counts' not found
head(gender_ratio, 10)
Error in head(gender_ratio, 10): object 'gender_ratio' not found
ggplot(gender_ratio, aes(x = `year`, y = `ep_ratio`, col = `is_male`)) +
    geom_bar(position = "stack", stat = "identity") +
    ggtitle('Ratio of Males to Non-Males on SNL') +
    labs(x = 'Year', y = 'Ratio of Males to Non-Males', color = 'Is Male')
Error in ggplot(gender_ratio, aes(x = year, y = ep_ratio, col = is_male)): object 'gender_ratio' not found

As we can see, the disparity in gender is decreasing year on year. A stacked bar chart is probably a controversial choice here, but I think it helps visualize that the height of the plot is the whole cast. A line plot makes it seem like the empty space is no one or may cause other confusion. This is almost like a pie chart per year, which I like.