Challenge 8 Instructions

challenge_8
SNL
Anirudh Lakkaraju
Joining Data
Author

Anirudh Lakkaraju

Published

May 15, 2023

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • military marriages ⭐⭐
  • faostat ⭐⭐
  • railroads ⭐⭐⭐
  • fed_rate ⭐⭐⭐
  • debt ⭐⭐⭐
  • us_hh ⭐⭐⭐⭐
  • snl ⭐⭐⭐⭐⭐
seasons_df <- read.csv("_data/snl_seasons.csv")
actors_df <- read.csv("_data/snl_actors.csv")
casts_df <- read.csv("_data/snl_casts.csv")

# Displaying first few rows of the datasets
head(seasons_df) 
  sid year first_epid last_epid n_episodes
1   1 1975   19751011  19760731         24
2   2 1976   19760918  19770521         22
3   3 1977   19770924  19780520         20
4   4 1978   19781007  19790526         20
5   5 1979   19791013  19800524         20
6   6 1980   19801115  19810411         13
head(actors_df)
             aid           url  type  gender
1  Kate McKinnon   /Cast/?KaMc  cast  female
2    Alex Moffat   /Cast/?AlMo  cast    male
3     Ego Nwodim   /Cast/?EgNw  cast unknown
4     Chris Redd   /Cast/?ChRe  cast    male
5 Kenan Thompson   /Cast/?KeTh  cast    male
6 Carey Mulligan /Guests/?3677 guest    andy
head(casts_df)
               aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown  11     True   19860222        NA         False          8
2 A. Whitney Brown  12     True         NA        NA         False         20
3 A. Whitney Brown  13     True         NA        NA         False         13
4 A. Whitney Brown  14     True         NA        NA         False         20
5 A. Whitney Brown  15     True         NA        NA         False         20
6 A. Whitney Brown  16     True         NA        NA         False         20
  season_fraction
1       0.4444444
2       1.0000000
3       1.0000000
4       1.0000000
5       1.0000000
6       1.0000000
# Dimensions of datasets
dim(seasons_df)
[1] 46  5
dim(actors_df)
[1] 2306    4
dim(casts_df)
[1] 614   8
summary(actors_df)
     aid                url                type              gender         
 Length:2306        Length:2306        Length:2306        Length:2306       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
summary(seasons_df)
      sid             year        first_epid         last_epid       
 Min.   : 1.00   Min.   :1975   Min.   :19751011   Min.   :19760731  
 1st Qu.:12.25   1st Qu.:1986   1st Qu.:19863512   1st Qu.:19872949  
 Median :23.50   Median :1998   Median :19975926   Median :19985512  
 Mean   :23.50   Mean   :1998   Mean   :19975965   Mean   :19985509  
 3rd Qu.:34.75   3rd Qu.:2009   3rd Qu.:20088423   3rd Qu.:20098015  
 Max.   :46.00   Max.   :2020   Max.   :20201003   Max.   :20210410  
   n_episodes  
 Min.   :12.0  
 1st Qu.:20.0  
 Median :20.0  
 Mean   :19.7  
 3rd Qu.:21.0  
 Max.   :24.0  
summary(casts_df)
     aid                 sid          featured           first_epid      
 Length:614         Min.   : 1.00   Length:614         Min.   :19770115  
 Class :character   1st Qu.:15.00   Class :character   1st Qu.:19801215  
 Mode  :character   Median :26.00   Mode  :character   Median :19901110  
                    Mean   :25.47                      Mean   :19909634  
                    3rd Qu.:37.00                      3rd Qu.:19957839  
                    Max.   :46.00                      Max.   :20141025  
                                                       NA's   :564       
   last_epid        update_anchor        n_episodes    season_fraction  
 Min.   :19751011   Length:614         Min.   : 1.00   Min.   :0.04167  
 1st Qu.:19850112   Class :character   1st Qu.:19.00   1st Qu.:1.00000  
 Median :19950225   Mode  :character   Median :20.00   Median :1.00000  
 Mean   :19944038                      Mean   :18.73   Mean   :0.94827  
 3rd Qu.:20040117                      3rd Qu.:21.00   3rd Qu.:1.00000  
 Max.   :20140201                      Max.   :24.00   Max.   :1.00000  
 NA's   :597                                                            

Briefly describe the data


The data frames contain organized and independent data. The actors Data Frame includes a comprehensive list of actors, guests, musical guests, and crew members from the show, with each entry representing an individual actor and providing details about their role type and gender.

The casts data frame focuses on actors who were part of a specific season’s cast. Each entry represents an actor and includes information like their featured status, the dates of their first and last episodes, whether they served as an anchor on weekend update, and the number of episodes they participated in during that season.

The seasons data frame provides information about each season individually. Each entry corresponds to a specific season and includes data such as the airing year, the dates of the first and last episodes, and the total number of episodes.

Tidy Data (as needed)

actors_df <- actors_df %>%
  drop_na()
seasons_df <- seasons_df %>%
  drop_na()
casts_df <- casts_df %>%
  drop_na()
colnames(actors_df)
[1] "aid"    "url"    "type"   "gender"
colnames(casts_df)
[1] "aid"             "sid"             "featured"        "first_epid"     
[5] "last_epid"       "update_anchor"   "n_episodes"      "season_fraction"
colnames(seasons_df)
[1] "sid"        "year"       "first_epid" "last_epid"  "n_episodes"
actors_df <- actors_df %>%
  mutate(appearances = rowSums(select(., starts_with("ep_")), na.rm = TRUE))

Join Data

df_casts_actors <- casts_df %>%
  left_join(actors_df, by = "aid") %>%
  select(sid, type, gender, featured, appearances)
df_data <- seasons_df %>%
  left_join(df_casts_actors, by = "sid")
colnames(df_data)
[1] "sid"         "year"        "first_epid"  "last_epid"   "n_episodes" 
[6] "type"        "gender"      "featured"    "appearances"
colnames(df_casts_actors)
[1] "sid"         "type"        "gender"      "featured"    "appearances"
joined_data <- casts_df %>%
  select(-update_anchor, -season_fraction, -n_episodes) %>%
  left_join(seasons_df, by="sid")

joined_data
[1] aid          sid          featured     first_epid.x last_epid.x 
[6] year         first_epid.y last_epid.y  n_episodes  
<0 rows> (or 0-length row.names)
colnames(joined_data)
[1] "aid"          "sid"          "featured"     "first_epid.x" "last_epid.x" 
[6] "year"         "first_epid.y" "last_epid.y"  "n_episodes"