Homework_3

hw3
snl
Author

Jyoti Rani

Published

August 22, 2022

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tibble' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(lubridate)
Warning: package 'lubridate' was built under R version 4.2.2
Loading required package: timechange
Warning: package 'timechange' was built under R version 4.2.2

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

I decided to build off of my last challenge and use the SNL data sets.

library(readr)
casts <- read_csv("_data/snl_casts.csv", show_col_types = FALSE)

actors <- read_csv("_data/snl_actors.csv", show_col_types = FALSE)

seasons <- read_csv("_data/snl_seasons.csv", show_col_types = FALSE)

Briefly describe the data

Each data set contains different information about the show Saturday Night Live. “Actors” and “Casts” share the column “aid”, the names of actors who have appeared on the show. “Casts” and “Seasons” share the column “sid” or “season id”, the season number. I will be joining all three sets based on these shaerd variables.

Tidy Data (as needed)

I’m going to join to do an inner join, which I think is the best choice because it will remove excess data. Then I will filter out columns that don’t seem useful to me.

#join data sets
snl <- inner_join(actors, casts, "aid")

snl <- inner_join(snl, seasons, "sid")

#remove excess columns
snl <- snl %>%
  select(-first_epid.x, -last_epid.x, -url, -type, -n_episodes.y, -year, -last_epid.y)

#mutate date
snl <- snl %>%
  mutate(first_epid.y = ymd(first_epid.y))

snl
# A tibble: 614 × 8
   aid           gender   sid featured update_anchor n_epis…¹ seaso…² first_ep…³
   <chr>         <chr>  <dbl> <lgl>    <lgl>            <dbl>   <dbl> <date>    
 1 Kate McKinnon female    37 TRUE     FALSE                5   0.227 2011-09-24
 2 Kate McKinnon female    38 TRUE     FALSE               21   1     2012-09-15
 3 Kate McKinnon female    39 FALSE    FALSE               21   1     2013-09-28
 4 Kate McKinnon female    40 FALSE    FALSE               21   1     2014-09-27
 5 Kate McKinnon female    41 FALSE    FALSE               21   1     2015-10-03
 6 Kate McKinnon female    42 FALSE    FALSE               21   1     2016-10-01
 7 Kate McKinnon female    43 FALSE    FALSE               21   1     2017-09-30
 8 Kate McKinnon female    44 FALSE    FALSE               21   1     2018-09-29
 9 Kate McKinnon female    45 FALSE    FALSE               18   1     2019-09-28
10 Kate McKinnon female    46 FALSE    FALSE               17   1     2020-10-03
# … with 604 more rows, and abbreviated variable names ¹​n_episodes.x,
#   ²​season_fraction, ³​first_epid.y

Data Frame Summary

print(summarytools::dfSummary(snl,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

snl

Dimensions: 614 x 8
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 146 others ]
18(2.9%)
14(2.3%)
13(2.1%)
11(1.8%)
11(1.8%)
10(1.6%)
10(1.6%)
9(1.5%)
9(1.5%)
9(1.5%)
500(81.4%)
0 (0.0%)
gender [character]
1. female
2. male
3. unknown
194(31.6%)
416(67.8%)
4(0.7%)
0 (0.0%)
sid [numeric]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 0 (0.0%)
featured [logical]
1. FALSE
2. TRUE
451(73.5%)
163(26.5%)
0 (0.0%)
update_anchor [logical]
1. FALSE
2. TRUE
541(88.1%)
73(11.9%)
0 (0.0%)
n_episodes.x [numeric]
Mean (sd) : 18.7 (4)
min ≤ med ≤ max:
1 ≤ 20 ≤ 24
IQR (CV) : 2 (0.2)
22 distinct values 0 (0.0%)
season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 0 (0.0%)
first_epid.y [Date]
min : 1975-10-11
med : 2000-10-07
max : 2020-10-03
range : 44y 11m 22d
46 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

Descriptive Stats & Vizualizations

The first thing I decided to investigate was the gender composition of each season. I decided to make a new data frame which will calculate the percentage of each gender on each season, then illustrate this using a stacked bar chart.

gender_by_season <- snl %>%
  group_by(sid) %>%
  count(gender) %>%
  summarize(percent = n/sum(n),
            gender = gender,
            .groups = 'drop')

gender_by_season
# A tibble: 95 × 3
     sid percent gender
   <dbl>   <dbl> <chr> 
 1     1   0.333 female
 2     1   0.667 male  
 3     2   0.375 female
 4     2   0.625 male  
 5     3   0.333 female
 6     3   0.667 male  
 7     4   0.333 female
 8     4   0.667 male  
 9     5   0.2   female
10     5   0.8   male  
# … with 85 more rows
#Stacked Bar Chart
ggplot(gender_by_season, aes(fill=gender, y=percent, x=sid)) +
  geom_bar(position = "fill", stat = "identity")+
  labs(x ="Season", y="Percentage of Cast", 
       title = "SNL Cast - Gender Composition")+
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))+
  scale_fill_manual(values = c("#D60270", "#0038A8", "#9B4F96"))

The variables in this visualization include gender and season. We can see that the ratio of men to women has become closer to even over the years, but not dramatically. This is also skewed by the increase in people whose gender is marked unknown- it is unclear if this is due to incomplete data or if the data failed to include other gender identities.

Alternatively, we can represent this as an area chart:

gender_by_season2 <- snl %>%
  group_by(sid) %>%
  count(gender)
  
ggplot(gender_by_season2, aes(x=sid, y=n, fill=gender)) + 
    geom_area()+
  labs(x ="Season", y="Number", 
       title = "SNL Cast - Gender Composition")+
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))+
  scale_fill_manual(values = c("#D60270", "#0038A8", "#9B4F96"))

I also thought it would be interesting to compare cast members by the amount of time they spent on the show. Since the first episode and last episode data is incomplete, I chose to use season numbers. I will also only be comparing the top 10 cast members, since there are 156 in this data set.

longest_cast <- snl %>%
  group_by(aid) %>%
  summarize(seasons_appeared = n()) %>%
  arrange(desc(seasons_appeared)) %>%
  slice(1:10)

ggplot(longest_cast, aes(x = reorder(aid, -seasons_appeared), y = seasons_appeared))+
  geom_bar(stat = "identity")+
  labs(x ="Cast Member", y="# Seasons Appeared", 
       title = "Top 10 SNL Cast Members")+
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))+
  scale_x_discrete(guide = guide_axis(n.dodge=3)) #fix overlapping labels

This bar chart is limited in that it only represents a small group of cast members. I was also unable to figure out how to plot the average number of seasons for all cast members on the chart, which I think would have better demonstrated the significance of these cast members’ tenure. While we can see that the top 10 are mostly men and the top 5 are exclusively men, it would be interesting to compare the tenure of male and female cast members.