Challenge 8

challenge_8

Abhinav Reddy Yadatha

snl

Joining Data

Author

Abhinav Reddy Yadatha

Published

May 15, 2023

library(tidyverse)
library(lubridate)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
mutate variables as needed (including sanity checks)
join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

military marriages ⭐⭐
faostat ⭐⭐
railroads ⭐⭐⭐
fed_rate ⭐⭐⭐
debt ⭐⭐⭐
us_hh ⭐⭐⭐⭐
snl ⭐⭐⭐⭐⭐

seasons_data <- read.csv("_data/snl_seasons.csv")
actors_data <- read.csv("_data/snl_actors.csv")
casts_data <- read.csv("_data/snl_casts.csv")

# Displaying first few rows of the datasets
head(seasons_data)

  sid year first_epid last_epid n_episodes
1   1 1975   19751011  19760731         24
2   2 1976   19760918  19770521         22
3   3 1977   19770924  19780520         20
4   4 1978   19781007  19790526         20
5   5 1979   19791013  19800524         20
6   6 1980   19801115  19810411         13

head(actors_data)

             aid           url  type  gender
1  Kate McKinnon   /Cast/?KaMc  cast  female
2    Alex Moffat   /Cast/?AlMo  cast    male
3     Ego Nwodim   /Cast/?EgNw  cast unknown
4     Chris Redd   /Cast/?ChRe  cast    male
5 Kenan Thompson   /Cast/?KeTh  cast    male
6 Carey Mulligan /Guests/?3677 guest    andy

head(casts_data)

               aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown  11     True   19860222        NA         False          8
2 A. Whitney Brown  12     True         NA        NA         False         20
3 A. Whitney Brown  13     True         NA        NA         False         13
4 A. Whitney Brown  14     True         NA        NA         False         20
5 A. Whitney Brown  15     True         NA        NA         False         20
6 A. Whitney Brown  16     True         NA        NA         False         20
  season_fraction
1       0.4444444
2       1.0000000
3       1.0000000
4       1.0000000
5       1.0000000
6       1.0000000

# Dimensions of datasets
dim(seasons_data)

[1] 46  5

dim(actors_data)

[1] 2306    4

dim(casts_data)

[1] 614   8

# Summaries of datasets
summary(seasons_data)

      sid             year        first_epid         last_epid       
 Min.   : 1.00   Min.   :1975   Min.   :19751011   Min.   :19760731  
 1st Qu.:12.25   1st Qu.:1986   1st Qu.:19863512   1st Qu.:19872949  
 Median :23.50   Median :1998   Median :19975926   Median :19985512  
 Mean   :23.50   Mean   :1998   Mean   :19975965   Mean   :19985509  
 3rd Qu.:34.75   3rd Qu.:2009   3rd Qu.:20088423   3rd Qu.:20098015  
 Max.   :46.00   Max.   :2020   Max.   :20201003   Max.   :20210410  
   n_episodes  
 Min.   :12.0  
 1st Qu.:20.0  
 Median :20.0  
 Mean   :19.7  
 3rd Qu.:21.0  
 Max.   :24.0

summary(actors_data)

     aid                url                type              gender         
 Length:2306        Length:2306        Length:2306        Length:2306       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character

summary(casts_data)

     aid                 sid          featured           first_epid      
 Length:614         Min.   : 1.00   Length:614         Min.   :19770115  
 Class :character   1st Qu.:15.00   Class :character   1st Qu.:19801215  
 Mode  :character   Median :26.00   Mode  :character   Median :19901110  
                    Mean   :25.47                      Mean   :19909634  
                    3rd Qu.:37.00                      3rd Qu.:19957839  
                    Max.   :46.00                      Max.   :20141025  
                                                       NA's   :564       
   last_epid        update_anchor        n_episodes    season_fraction  
 Min.   :19751011   Length:614         Min.   : 1.00   Min.   :0.04167  
 1st Qu.:19850112   Class :character   1st Qu.:19.00   1st Qu.:1.00000  
 Median :19950225   Mode  :character   Median :20.00   Median :1.00000  
 Mean   :19944038                      Mean   :18.73   Mean   :0.94827  
 3rd Qu.:20040117                      3rd Qu.:21.00   3rd Qu.:1.00000  
 Max.   :20140201                      Max.   :24.00   Max.   :1.00000  
 NA's   :597

Briefly describe the data

The data frames contain independent and well-organized data. The actors Data Frame consists of a comprehensive list of actors, guests, musical guests, and crew members who have appeared on the show. Each observation pertains to an individual actor and includes details about their role type and gender.

On the other hand, the casts data frame focuses on actors who were part of the cast during a specific season. Each observation represents an actor and includes information such as their featured status, dates of the first and last episodes they appeared in, whether they served as an anchor on weekend update, and the number of episodes they participated in during that season.

Lastly, the seasons data frame encompasses information about each specific season. Each observation corresponds to a particular season and includes data such as the year it aired, the dates of the first and last episodes, and the total number of episodes.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

# Mutating seasons dataset 

seasons_data <- seasons_data %>% # (46 x 5)
        mutate(first_epid = ymd(first_epid), last_epid = ymd(last_epid))

# Mutating casts dataset 
casts_data <- casts_data %>%  
        mutate(first_epid = ymd(first_epid), last_epid = ymd(last_epid))

Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

Join Data

Be sure to include a sanity check, and double-check that case count is correct!

# Sanity chack : cast members count :154 

actors_data <-actors_data %>%
       select(-url) %>%
       filter(type == "cast")  



# Saanity Check:  cast members count : 156
casts_data %>%
      select(aid)%>%
      n_distinct()

[1] 156

fullyjoined_data <- full_join(actors_data, casts_data, by = "aid") %>%
        select(c(aid, gender, sid, featured, update_anchor))

Join of both the tables have been completed. Let us perform some analysis.

head(fullyjoined_data)

            aid gender sid featured update_anchor
1 Kate McKinnon female  37     True         False
2 Kate McKinnon female  38     True         False
3 Kate McKinnon female  39    False         False
4 Kate McKinnon female  40    False         False
5 Kate McKinnon female  41    False         False
6 Kate McKinnon female  42    False         False

fullyjoined_data %>%
      filter(update_anchor == "True") %>%
      select(gender)%>%
      table() %>%
      prop.table()

gender
  female     male 
0.260274 0.739726

# When considering all seasons, approximately 74% of the hosts for weekend update are male.