Challenge 8

challenge_8

Ishan Bhardwaj

snl

Joining Data Tables

Author

Ishan Bhardwaj

Published

May 21, 2023

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
mutate variables as needed (including sanity checks)
join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

military marriages ⭐⭐
faostat ⭐⭐
railroads ⭐⭐⭐
fed_rate ⭐⭐⭐
debt ⭐⭐⭐
us_hh ⭐⭐⭐⭐
snl ⭐⭐⭐⭐⭐

actors <- read_csv("_data/snl_actors.csv")
actors

# A tibble: 2,306 × 4
   aid            url           type  gender 
   <chr>          <chr>         <chr> <chr>  
 1 Kate McKinnon  /Cast/?KaMc   cast  female 
 2 Alex Moffat    /Cast/?AlMo   cast  male   
 3 Ego Nwodim     /Cast/?EgNw   cast  unknown
 4 Chris Redd     /Cast/?ChRe   cast  male   
 5 Kenan Thompson /Cast/?KeTh   cast  male   
 6 Carey Mulligan /Guests/?3677 guest andy   
 7 Marcus Mumford /Guests/?3679 guest male   
 8 Aidy Bryant    /Cast/?AiBr   cast  female 
 9 Steve Higgins  /Crew/?StHi   crew  male   
10 Mikey Day      /Cast/?MiDa   cast  male   
# ℹ 2,296 more rows

casts <- read_csv("_data/snl_casts.csv")
casts

# A tibble: 614 × 8
   aid                sid featured first_epid last_epid update_anchor n_episodes
   <chr>            <dbl> <lgl>         <dbl>     <dbl> <lgl>              <dbl>
 1 A. Whitney Brown    11 TRUE       19860222        NA FALSE                  8
 2 A. Whitney Brown    12 TRUE             NA        NA FALSE                 20
 3 A. Whitney Brown    13 TRUE             NA        NA FALSE                 13
 4 A. Whitney Brown    14 TRUE             NA        NA FALSE                 20
 5 A. Whitney Brown    15 TRUE             NA        NA FALSE                 20
 6 A. Whitney Brown    16 TRUE             NA        NA FALSE                 20
 7 Alan Zweibel         5 TRUE       19800409        NA FALSE                  5
 8 Sasheer Zamata      39 TRUE       20140118        NA FALSE                 11
 9 Sasheer Zamata      40 TRUE             NA        NA FALSE                 21
10 Sasheer Zamata      41 FALSE            NA        NA FALSE                 21
# ℹ 604 more rows
# ℹ 1 more variable: season_fraction <dbl>

seasons <- read_csv("_data/snl_seasons.csv")
seasons

# A tibble: 46 × 5
     sid  year first_epid last_epid n_episodes
   <dbl> <dbl>      <dbl>     <dbl>      <dbl>
 1     1  1975   19751011  19760731         24
 2     2  1976   19760918  19770521         22
 3     3  1977   19770924  19780520         20
 4     4  1978   19781007  19790526         20
 5     5  1979   19791013  19800524         20
 6     6  1980   19801115  19810411         13
 7     7  1981   19811003  19820522         20
 8     8  1982   19820925  19830514         20
 9     9  1983   19831008  19840512         19
10    10  1984   19841006  19850413         17
# ℹ 36 more rows

Briefly describe the data

colnames(actors)

[1] "aid"    "url"    "type"   "gender"

colnames(casts)

[1] "aid"             "sid"             "featured"        "first_epid"     
[5] "last_epid"       "update_anchor"   "n_episodes"      "season_fraction"

colnames(seasons)

[1] "sid"        "year"       "first_epid" "last_epid"  "n_episodes"

The actors dataset provides general cast and gender information for the cast on SNL. The casts dataset tells us which SNL seasons each actor participated in (marked by “sid” variable) and details about each actor’s participation in that season. The seasons dataset provides general information about different SNL seasons only, such as the year the season aired, the start and end date, and the number of episodes in that season.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

This data is already tidy because each relevant variable has its own column, each case its own row, and each value its own cell. One change we could possibly make is collapsing the first episode and last episode dates into one column for the casts and seasons datasets, but that would make us repeat actor names and season IDs which could look untidy.

Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

casts <- casts %>%
  mutate(first_epid = ymd(first_epid), last_epid = ymd(last_epid))
casts

# A tibble: 614 × 8
   aid               sid featured first_epid last_epid update_anchor n_episodes
   <chr>           <dbl> <lgl>    <date>     <date>    <lgl>              <dbl>
 1 A. Whitney Bro…    11 TRUE     1986-02-22 NA        FALSE                  8
 2 A. Whitney Bro…    12 TRUE     NA         NA        FALSE                 20
 3 A. Whitney Bro…    13 TRUE     NA         NA        FALSE                 13
 4 A. Whitney Bro…    14 TRUE     NA         NA        FALSE                 20
 5 A. Whitney Bro…    15 TRUE     NA         NA        FALSE                 20
 6 A. Whitney Bro…    16 TRUE     NA         NA        FALSE                 20
 7 Alan Zweibel        5 TRUE     1980-04-09 NA        FALSE                  5
 8 Sasheer Zamata     39 TRUE     2014-01-18 NA        FALSE                 11
 9 Sasheer Zamata     40 TRUE     NA         NA        FALSE                 21
10 Sasheer Zamata     41 FALSE    NA         NA        FALSE                 21
# ℹ 604 more rows
# ℹ 1 more variable: season_fraction <dbl>

seasons <- seasons %>%
  mutate(first_epid = ymd(first_epid), last_epid = ymd(last_epid))
seasons

# A tibble: 46 × 5
     sid  year first_epid last_epid  n_episodes
   <dbl> <dbl> <date>     <date>          <dbl>
 1     1  1975 1975-10-11 1976-07-31         24
 2     2  1976 1976-09-18 1977-05-21         22
 3     3  1977 1977-09-24 1978-05-20         20
 4     4  1978 1978-10-07 1979-05-26         20
 5     5  1979 1979-10-13 1980-05-24         20
 6     6  1980 1980-11-15 1981-04-11         13
 7     7  1981 1981-10-03 1982-05-22         20
 8     8  1982 1982-09-25 1983-05-14         20
 9     9  1983 1983-10-08 1984-05-12         19
10    10  1984 1984-10-06 1985-04-13         17
# ℹ 36 more rows

The first and last episode date columns in the casts and seasons datasets are not in the correct format. Hence, I mutated them into the lubridate ymd() format.

Join Data

merged <- actors %>%
  left_join(casts, by = "aid") %>%
  rename("actor_first_epid" = "first_epid", "actor_last_epid" = "last_epid", "n_actor_episodes" = "n_episodes") %>%
  left_join(seasons, by = "sid") %>%
  rename("season_first_epid" = "first_epid", "season_last_epid" = "last_epid") %>%
  select(1, 4, 3, 2, 5, 6, 12:14, 7:11) %>%
  arrange(season_first_epid)
merged

# A tibble: 2,764 × 14
   aid                gender type  url      sid featured  year season_first_epid
   <chr>              <chr>  <chr> <chr>  <dbl> <lgl>    <dbl> <date>           
 1 Chevy Chase        male   cast  /Cast…     1 FALSE     1975 1975-10-11       
 2 Dan Aykroyd        male   cast  /Cast…     1 FALSE     1975 1975-10-11       
 3 Garrett Morris     male   cast  /Cast…     1 FALSE     1975 1975-10-11       
 4 John Belushi       male   cast  /Cast…     1 FALSE     1975 1975-10-11       
 5 Michael O'Donoghue male   cast  /Cast…     1 FALSE     1975 1975-10-11       
 6 Jane Curtin        female cast  /Cast…     1 FALSE     1975 1975-10-11       
 7 Laraine Newman     female cast  /Cast…     1 FALSE     1975 1975-10-11       
 8 Gilda Radner       female cast  /Cast…     1 FALSE     1975 1975-10-11       
 9 George Coe         male   cast  /Cast…     1 FALSE     1975 1975-10-11       
10 Bill Murray        male   cast  /Cast…     2 FALSE     1976 1976-09-18       
# ℹ 2,754 more rows
# ℹ 6 more variables: season_last_epid <date>, actor_first_epid <date>,
#   actor_last_epid <date>, update_anchor <lgl>, n_actor_episodes <dbl>,
#   season_fraction <dbl>

We want to merge all these datasets into one. First, we match all the actor IDs in actors to those in casts and merge them using left_join(). We then rename the date and episode count columns so we can heave a clear differentiation when we make the next merge. Now we have a dataset with general actor information and their cast participation information for different SNL seasons. Finally, we match all the season IDs in this merged table to those in seasons and merge them using left_join(). We rename the newly added date columns such that they correspond to entries from the seasons dataset. Now, we have a fully joined dataset - merged - that contains general actor information, cast information for each actor, and general information for each season that each actor participated in. I have rearranged this dataset to look more coherent using select() and arranged the entries in ascending order of season start date.

Visualizations

merged %>%
  ggplot(aes(sid, fill = gender)) +
  geom_histogram(bins = 20) + 
  theme_minimal() +
  labs(title = "Bar Plot", subtitle = "Cast composition for different SNL seasons", y = "No. of cast members", x = "Season ID", fill = "Gender")

This graphs tells us the number of cast members for different SNL seasons (denoted by their season IDs) and the gender composition of the casts for each season.