Challenge 8 Solution - SNL Gender Distribution Over Time

challenge_8
snl
Linus Jen
Joining Data
Author

Linus Jen

Published

June 27, 2023

library(tidyverse)
library(ggplot2)
library(here)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

For this challenge, I will be investigating the SNL related datasets (snl_actors.csv, snl_casts.csv, and snl_seasons.csv).

For my analysis, I want to look at how the gender diversity has changed over time.

actors <- read_csv(here("posts", "_data", "snl_actors.csv"))
casts <- read_csv(here("posts", "_data", "snl_casts.csv"))
seasons <- read_csv(here("posts", "_data", "snl_seasons.csv"))

# Rename some columns
casts <- casts %>%
  rename("n_eps_present" = "n_episodes",
         "first_ep_present" = "first_epid",
         "last_ep_present" = "last_epid")
seasons <- seasons %>%
  rename("s_first_epid" = "first_epid",
         "s_last_epid" = "last_epid",
         "s_n_episodes" = "n_episodes")

# Summary information
print(summarytools::dfSummary(actors,
                              varnumbers=FALSE,
                              plain.ascii=FALSE,
                              style="grid",
                              graph.magnif = 0.70,
                              valid.col=FALSE),
      method="render",
      table.classes="table-condensed")

Data Frame Summary

actors

Dimensions: 2306 x 4
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. 'N Sync
2. 070 Shake
3. 10,000 Maniacs
4. 14 Karat Soul
5. 2 Chainz
6. 3-D
7. 3RDEYEGIRL
8. 50 Cent
9. A Tribe Called Quest
10. A$AP Rocky
[ 2296 others ]
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
2296 ( 99.6% )
0 (0.0%)
url [character]
1. /Cast/?AbEl
2. /Cast/?AdSa
3. /Cast/?AiBr
4. /Cast/?AlFr
5. /Cast/?AlMo
6. /Cast/?AlZw
7. /Cast/?AmPo
8. /Cast/?AnDi
9. /Cast/?AnGa
10. /Cast/?AnMH
[ 2239 others ]
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
2239 ( 99.6% )
57 (2.5%)
type [character]
1. cast
2. crew
3. guest
4. unknown
154 ( 6.7% )
170 ( 7.4% )
1926 ( 83.5% )
56 ( 2.4% )
0 (0.0%)
gender [character]
1. andy
2. female
3. male
4. unknown
21 ( 0.9% )
671 ( 29.1% )
1226 ( 53.2% )
388 ( 16.8% )
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.0)
2023-06-18

print(summarytools::dfSummary(casts,
                              varnumbers=FALSE,
                              plain.ascii=FALSE,
                              style="grid",
                              graph.magnif = 0.70,
                              valid.col=FALSE),
      method="render",
      table.classes="table-condensed")

Data Frame Summary

casts

Dimensions: 614 x 8
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 146 others ]
18 ( 2.9% )
14 ( 2.3% )
13 ( 2.1% )
11 ( 1.8% )
11 ( 1.8% )
10 ( 1.6% )
10 ( 1.6% )
9 ( 1.5% )
9 ( 1.5% )
9 ( 1.5% )
500 ( 81.4% )
0 (0.0%)
sid [numeric]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 0 (0.0%)
featured [logical]
1. FALSE
2. TRUE
451 ( 73.5% )
163 ( 26.5% )
0 (0.0%)
first_ep_present [numeric]
Mean (sd) : 19909635 (111264.8)
min ≤ med ≤ max:
19770115 ≤ 19901110 ≤ 20141025
IQR (CV) : 156624.2 (0)
35 distinct values 564 (91.9%)
last_ep_present [numeric]
Mean (sd) : 19944038 (126122.2)
min ≤ med ≤ max:
19751011 ≤ 19950225 ≤ 20140201
IQR (CV) : 190005 (0)
17 distinct values 597 (97.2%)
update_anchor [logical]
1. FALSE
2. TRUE
541 ( 88.1% )
73 ( 11.9% )
0 (0.0%)
n_eps_present [numeric]
Mean (sd) : 18.7 (4)
min ≤ med ≤ max:
1 ≤ 20 ≤ 24
IQR (CV) : 2 (0.2)
22 distinct values 0 (0.0%)
season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.0)
2023-06-18

print(summarytools::dfSummary(seasons,
                              varnumbers=FALSE,
                              plain.ascii=FALSE,
                              style="grid",
                              graph.magnif = 0.70,
                              valid.col=FALSE),
      method="render",
      table.classes="table-condensed")

Data Frame Summary

seasons

Dimensions: 46 x 5
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
sid [numeric]
Mean (sd) : 23.5 (13.4)
min ≤ med ≤ max:
1 ≤ 23.5 ≤ 46
IQR (CV) : 22.5 (0.6)
46 distinct values 0 (0.0%)
year [numeric]
Mean (sd) : 1997.5 (13.4)
min ≤ med ≤ max:
1975 ≤ 1997.5 ≤ 2020
IQR (CV) : 22.5 (0)
46 distinct values 0 (0.0%)
s_first_epid [numeric]
Mean (sd) : 19975965 (134209.3)
min ≤ med ≤ max:
19751011 ≤ 19975927 ≤ 20201003
IQR (CV) : 224910.2 (0)
46 distinct values 0 (0.0%)
s_last_epid [numeric]
Mean (sd) : 19985509 (134223.9)
min ≤ med ≤ max:
19760731 ≤ 19985512 ≤ 20210410
IQR (CV) : 225066.2 (0)
46 distinct values 0 (0.0%)
s_n_episodes [numeric]
Mean (sd) : 19.7 (2.3)
min ≤ med ≤ max:
12 ≤ 20 ≤ 24
IQR (CV) : 1 (0.1)
12 : 1 ( 2.2% )
13 : 2 ( 4.3% )
17 : 2 ( 4.3% )
18 : 2 ( 4.3% )
19 : 3 ( 6.5% )
20 : 23 ( 50.0% )
21 : 7 ( 15.2% )
22 : 5 ( 10.9% )
24 : 1 ( 2.2% )
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.0)
2023-06-18

Briefly describe the data

This data contains information about the casts and actors that partook in any SNL show.

snl_actors.csv contains 2,306 rows of people who were part of SNL, and holds the name (under aid), the url for the person, the type of member (cast, guest, crew, or unknown, and gender (male, female, unknown, and andy (not sure what gender andy is)). This data is already tidy.

snl_casts.csv contains information about cast members, and is broken down by the the name (under aid) and the season (sid). It also tells if the member was featured in the show or season, the first episode and last episode of SNL they were on (first_epid and last_epid, respectively), if they were part of the weekend update_anchor, the n_epsiodes they were on, and what fraction of episodes for a particular season they were featured on (season_fraction). Both first_epid and last_epid seem to be dates in the form of YYYYMMDD, but because this seems to be an identifier, I won’t be changing it to a date type. This data also seems to already be tidy.

Lastly, snl_seasons.csv gives insight about each season, and there 46 years in this dataset (from 1975 to 2020). It includes a unique identifier (sid), year, IDs for the first and last episode (first_epid and last_epid), and n_episodes per season. both first_epid and last_epid seem to be dates in the form of YYYYMMDD, but because this seems to be an identifier, I won’t be changing it to a date type.

To join our data, I will be using the aid columns from actors and casts as the join condition, and then use the sid from casts and seasons to join these two datasets. I will use an inner join, as for my analysis comparing gender diversity across the history of SNL, I need gender information (found in actors) and year that each actor was present (found in seasons)

Joining and tidying the data

Given that casts is the dataframe connecting all datasets together, we expect to see each row representing a cast member and a season. Thus, there should be at most 614 rows (matching the rows of casts), though it can be less if there aren’t any joins. We also expect 15 columns (4 columns in actors, 8 columns in casts, and 5 columns in seasons, minus 2 ID columns from joining).

# Join datasets
full_df <- casts %>%
  inner_join(actors, by="aid") %>%
  inner_join(seasons, by="sid")

# View data together
print(summarytools::dfSummary(full_df,
                              varnumbers=FALSE,
                              plain.ascii=FALSE,
                              style="grid",
                              graph.magnif = 0.70,
                              valid.col=FALSE),
      method="render",
      table.classes="table-condensed")

Data Frame Summary

full_df

Dimensions: 614 x 15
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 146 others ]
18 ( 2.9% )
14 ( 2.3% )
13 ( 2.1% )
11 ( 1.8% )
11 ( 1.8% )
10 ( 1.6% )
10 ( 1.6% )
9 ( 1.5% )
9 ( 1.5% )
9 ( 1.5% )
500 ( 81.4% )
0 (0.0%)
sid [numeric]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 0 (0.0%)
featured [logical]
1. FALSE
2. TRUE
451 ( 73.5% )
163 ( 26.5% )
0 (0.0%)
first_ep_present [numeric]
Mean (sd) : 19909635 (111264.8)
min ≤ med ≤ max:
19770115 ≤ 19901110 ≤ 20141025
IQR (CV) : 156624.2 (0)
35 distinct values 564 (91.9%)
last_ep_present [numeric]
Mean (sd) : 19944038 (126122.2)
min ≤ med ≤ max:
19751011 ≤ 19950225 ≤ 20140201
IQR (CV) : 190005 (0)
17 distinct values 597 (97.2%)
update_anchor [logical]
1. FALSE
2. TRUE
541 ( 88.1% )
73 ( 11.9% )
0 (0.0%)
n_eps_present [numeric]
Mean (sd) : 18.7 (4)
min ≤ med ≤ max:
1 ≤ 20 ≤ 24
IQR (CV) : 2 (0.2)
22 distinct values 0 (0.0%)
season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 0 (0.0%)
url [character]
1. /Cast/?KeTh
2. /Cast/?DaHa
3. /Cast/?SeMe
4. /Cast/?AlFr
5. /Cast/?FrAr
6. /Cast/?KaMc
7. /Cast/?TiMe
8. /Cast/?AiBr
9. /Cast/?BoMo
10. /Cast/?CeSt
[ 144 others ]
18 ( 3.0% )
14 ( 2.3% )
13 ( 2.1% )
11 ( 1.8% )
11 ( 1.8% )
10 ( 1.6% )
10 ( 1.6% )
9 ( 1.5% )
9 ( 1.5% )
9 ( 1.5% )
493 ( 81.2% )
7 (1.1%)
type [character]
1. cast
2. unknown
607 ( 98.9% )
7 ( 1.1% )
0 (0.0%)
gender [character]
1. female
2. male
3. unknown
194 ( 31.6% )
416 ( 67.8% )
4 ( 0.7% )
0 (0.0%)
year [numeric]
Mean (sd) : 1999.5 (13.1)
min ≤ med ≤ max:
1975 ≤ 2000 ≤ 2020
IQR (CV) : 22 (0)
46 distinct values 0 (0.0%)
s_first_epid [numeric]
Mean (sd) : 19995688 (130607)
min ≤ med ≤ max:
19751011 ≤ 20001007 ≤ 20201003
IQR (CV) : 219994 (0)
46 distinct values 0 (0.0%)
s_last_epid [numeric]
Mean (sd) : 20005232 (130621.3)
min ≤ med ≤ max:
19760731 ≤ 20010519 ≤ 20210410
IQR (CV) : 220000 (0)
46 distinct values 0 (0.0%)
s_n_episodes [numeric]
Mean (sd) : 19.7 (2.1)
min ≤ med ≤ max:
12 ≤ 20 ≤ 24
IQR (CV) : 1 (0.1)
12 : 12 ( 2.0% )
13 : 24 ( 3.9% )
17 : 30 ( 4.9% )
18 : 31 ( 5.0% )
19 : 37 ( 6.0% )
20 : 297 ( 48.4% )
21 : 111 ( 18.1% )
22 : 63 ( 10.3% )
24 : 9 ( 1.5% )
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.0)
2023-06-18

From the summary statistics above, we see that there are 614 columns rows and 15 columns, which matches our predictions.

Analysis

As mentioned previously, my goal is to see how the gender demographics change over time. To do this, I will make a line plot with the x-axis representing the year, and the y-axis representing the proportion of a specific gender who made up the cast. Each line will represent a gender category.

gender_breakdown <- full_df %>%
  group_by(year, gender) %>%
  summarise(n_gender = n_distinct(aid)) %>%
  ungroup() %>%
  group_by(year) %>%
  mutate(prop_gender = n_gender / sum(n_gender))

# Show dataste
gender_breakdown
# A tibble: 95 × 4
# Groups:   year [46]
    year gender n_gender prop_gender
   <dbl> <chr>     <int>       <dbl>
 1  1975 female        3       0.333
 2  1975 male          6       0.667
 3  1976 female        3       0.375
 4  1976 male          5       0.625
 5  1977 female        3       0.333
 6  1977 male          6       0.667
 7  1978 female        3       0.333
 8  1978 male          6       0.667
 9  1979 female        3       0.2  
10  1979 male         12       0.8  
# ℹ 85 more rows
# Visualize data
ggplot(gender_breakdown, aes(x=year, y=prop_gender, color=gender)) +
  geom_line() +
  geom_point() +
  geom_text(aes(x=year, y=prop_gender, label=n_gender), vjust=-0.5) +
  theme_minimal() +
  scale_x_continuous(breaks=min(gender_breakdown$year):max(gender_breakdown$year), 
                     labels=min(gender_breakdown$year):max(gender_breakdown$year)) +
  scale_y_continuous(labels=scales::percent_format(accuracy=1)) +
  labs(title="SNL Gender Distribution Across the Year",
       x="Year",
       y="Percentage (%)") +
  theme(axis.text.x = element_text(angle=90)) +
  guides(fill=guide_legend(title="Gender"))

The graph above shows both the percentages and the counts the genders for the casts for each show, assuming that the data is correct. It’s interesting to see that the proportion of males on the show tends to decrease over time, while the proportion of women tends to increase over time. The number of cast members also seems to increase over time, as in 1975, there were only 9 distinct people in the entire season, but by 2020, there were 20 unique people on the show.