Challenge 8

challenge_8
snl
Joining Data
Author

Xinyang Mao

Published

May 22, 2023

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command. - snl ⭐⭐⭐⭐⭐

actors <- read.csv("_data/snl_actors.csv")
seasons <- read.csv("_data/snl_seasons.csv")
casts <- read.csv("_data/snl_casts.csv")

Briefly describe the data

show dimension of actors

dim(actors)
[1] 2306    4

Show the columns of actors

colnames(actors)
[1] "aid"    "url"    "type"   "gender"
head(actors)
             aid           url  type  gender
1  Kate McKinnon   /Cast/?KaMc  cast  female
2    Alex Moffat   /Cast/?AlMo  cast    male
3     Ego Nwodim   /Cast/?EgNw  cast unknown
4     Chris Redd   /Cast/?ChRe  cast    male
5 Kenan Thompson   /Cast/?KeTh  cast    male
6 Carey Mulligan /Guests/?3677 guest    andy

By examining the dimensions of these dataframes, we can observe that “Saturday Night Live” spanned across 46 seasons and involved a total of 2306 actors.The “actors” dataset focuses on specific information related to the actors themselves.

show dimension of seasons

dim(seasons)
[1] 46  5

Show the columns of seasons

colnames(seasons)
[1] "sid"        "year"       "first_epid" "last_epid"  "n_episodes"
head(seasons)
  sid year first_epid last_epid n_episodes
1   1 1975   19751011  19760731         24
2   2 1976   19760918  19770521         22
3   3 1977   19770924  19780520         20
4   4 1978   19781007  19790526         20
5   5 1979   19791013  19800524         20
6   6 1980   19801115  19810411         13

For seasons dataset,the “seasons” dataset provides comprehensive details about each season of the show, including the premiere and finale episodes, the initial airing year of each season, and the overall number of episodes in each season`

show dimension of casts

dim(casts)
[1] 614   8

Show the columns of casts

colnames(casts)
[1] "aid"             "sid"             "featured"        "first_epid"     
[5] "last_epid"       "update_anchor"   "n_episodes"      "season_fraction"
head(casts)
               aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown  11     True   19860222        NA         False          8
2 A. Whitney Brown  12     True         NA        NA         False         20
3 A. Whitney Brown  13     True         NA        NA         False         13
4 A. Whitney Brown  14     True         NA        NA         False         20
5 A. Whitney Brown  15     True         NA        NA         False         20
6 A. Whitney Brown  16     True         NA        NA         False         20
  season_fraction
1       0.4444444
2       1.0000000
3       1.0000000
4       1.0000000
5       1.0000000
6       1.0000000

The “casts” dataset offers insights into the cast members’ participation on the show, including details such as their notable episodes, appearance counts, and other relevant data. ## Tidy data Check if own missing values.

m_actors <- colSums(is.na(actors))
m_actors
   aid    url   type gender 
     0      0      0      0 
m_seasons <- colSums(is.na(seasons))
m_seasons
       sid       year first_epid  last_epid n_episodes 
         0          0          0          0          0 
m_casts <- colSums(is.na(casts))
m_casts
            aid             sid        featured      first_epid       last_epid 
              0               0               0             564             597 
  update_anchor      n_episodes season_fraction 
              0               0               0 

actors and seasons datasets have no missing values, casts have missing values for first_epid and last_epid, but they are not used here, so I removed these two variables.

casts <- subset(casts,select = -c(first_epid,last_epid))
colnames(casts)
[1] "aid"             "sid"             "featured"        "update_anchor"  
[5] "n_episodes"      "season_fraction"

Join Data

Inner join the 3 datasets according to aid and sid

a_c <- inner_join(actors,casts, by="aid")
a_c_s <- inner_join(a_c,seasons, by="sid")
colnames(a_c_s)
 [1] "aid"             "url"             "type"            "gender"         
 [5] "sid"             "featured"        "update_anchor"   "n_episodes.x"   
 [9] "season_fraction" "year"            "first_epid"      "last_epid"      
[13] "n_episodes.y"   
head(a_c_s)
            aid         url type gender sid featured update_anchor n_episodes.x
1 Kate McKinnon /Cast/?KaMc cast female  37     True         False            5
2 Kate McKinnon /Cast/?KaMc cast female  38     True         False           21
3 Kate McKinnon /Cast/?KaMc cast female  39    False         False           21
4 Kate McKinnon /Cast/?KaMc cast female  40    False         False           21
5 Kate McKinnon /Cast/?KaMc cast female  41    False         False           21
6 Kate McKinnon /Cast/?KaMc cast female  42    False         False           21
  season_fraction year first_epid last_epid n_episodes.y
1       0.2272727 2011   20110924  20120519           22
2       1.0000000 2012   20120915  20130518           21
3       1.0000000 2013   20130928  20140517           21
4       1.0000000 2014   20140927  20150516           21
5       1.0000000 2015   20151003  20160521           21
6       1.0000000 2016   20161001  20170520           21

Check if the new generated dataframe has missing value.

m_n <- colSums(is.na(a_c_s))
m_n
            aid             url            type          gender             sid 
              0               0               0               0               0 
       featured   update_anchor    n_episodes.x season_fraction            year 
              0               0               0               0               0 
     first_epid       last_epid    n_episodes.y 
              0               0               0 

Perfect!The new dataframe doesn’t has any missing value.

Next,let’s take a look at the distribution of year ,season_fraction and gender of actors!

Check the unique value of gender

print(unique(a_c_s$gender))
[1] "female"  "male"    "unknown"
ggplot(a_c_s, aes(fill=gender, y=season_fraction, x=year)) + 
    geom_bar(position="stack", stat="identity")

  labs(x = "Year", y = "season_fraction of Episodes") +
  theme_minimal()
NULL