aid url type gender
1 Kate McKinnon /Cast/?KaMc cast female
2 Alex Moffat /Cast/?AlMo cast male
3 Ego Nwodim /Cast/?EgNw cast unknown
4 Chris Redd /Cast/?ChRe cast male
5 Kenan Thompson /Cast/?KeTh cast male
6 Carey Mulligan /Guests/?3677 guest andy
aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown 11 True 19860222 NA False 8
2 A. Whitney Brown 12 True NA NA False 20
3 A. Whitney Brown 13 True NA NA False 13
4 A. Whitney Brown 14 True NA NA False 20
5 A. Whitney Brown 15 True NA NA False 20
6 A. Whitney Brown 16 True NA NA False 20
season_fraction
1 0.4444444
2 1.0000000
3 1.0000000
4 1.0000000
5 1.0000000
6 1.0000000
The SNL dataset is comprised of three separate data sources: SNL actors, SNL casts, and SNL seasons.
The SNL actors dataset consists of a list of all cast and guest members who have appeared on SNL (2306). Each actor is linked to additional information, such as their gender and whether they were a cast member or a guest. Each row in the dataset represents a single actor.
The SNL casts dataset, on the other hand, is much more comprehensive in terms of variables. It includes information on the cast members (614) rather than guests, and provides details on the seasons in which they appeared, the number of episodes in each season, and the dates of the first and last episodes of each season. Most of the values for the first and last episodes are NA, only being included if they differ from the first or last episode of that season. We will deal with this issue later in our analysis.
The SNL casts data has the desired format for our analysis, where each row represents an “actor-year.” This is what we will use as our final data set.
Lastly, the SNL seasons dataset contains information on the 46 seasons of SNL, including the year, dates of the first and last episodes, and the episode count per season. Each row in the dataset represents a single season.
Tidy & Mutate Data
To join the data sets, I will perform some tidying operations. After the join, I will conduct additional mutations to clean the final data set.
To start with, I will filter out the guest actors from the SNL actors data set, as we lack sufficient information about them, except for gender, to use in an analysis.
Code
#filtering out guests from actors datasnl_actors <- snl_actors %>%filter(type =="cast") %>%select(aid, gender)
As you can see, there are two columns in the SNL casts data set (whether the cast member was an update anchor, and whether they were featured) which are currently of character data type, but they should be of logical data type. I have updated these columns to logical data type below.
Code
#showing character class before mutationclass(snl_casts$update_anchor)
[1] "character"
Code
#mutating to change to logicalsnl_casts <- snl_casts %>%mutate(`update_anchor`=case_when(`update_anchor`=="True"~TRUE,`update_anchor`=="False"~FALSE)) %>%mutate(`featured`=case_when(`featured`=="True"~TRUE,`featured`=="False"~FALSE))#showing logical class after mutationclass(snl_casts$update_anchor)
[1] "logical"
Code
class(snl_casts$featured)
[1] "logical"
Joining Data
As we join the three data sets, we will use the SNL casts data as the primary data set and add information from the other data sets into this. The final data set should have 614 cases, with each case representing an “actor-season”.
To do this, we will first add the information about each season into the SNL casts data set. This includes the first and last date of the season, the number of episodes, and the year of the season.
Next, we will use the SNL actors data set to add the gender of the cast member into our SNL casts data set.
Code
#combining seasons data INTO casts datasnl_castsandseasons <-left_join(snl_casts, snl_seasons, by ="sid")#combining actors data INTO casts and seasons datasnl_castsseasonsandactors <-left_join(snl_castsandseasons, snl_actors, "aid")head(snl_castsseasonsandactors)
aid sid featured first_epid.x last_epid.x update_anchor
1 A. Whitney Brown 11 TRUE 19860222 NA FALSE
2 A. Whitney Brown 12 TRUE NA NA FALSE
3 A. Whitney Brown 13 TRUE NA NA FALSE
4 A. Whitney Brown 14 TRUE NA NA FALSE
5 A. Whitney Brown 15 TRUE NA NA FALSE
6 A. Whitney Brown 16 TRUE NA NA FALSE
n_episodes.x season_fraction year first_epid.y last_epid.y n_episodes.y
1 8 0.4444444 1985 19851109 19860524 18
2 20 1.0000000 1986 19861011 19870523 20
3 13 1.0000000 1987 19871017 19880227 13
4 20 1.0000000 1988 19881008 19890520 20
5 20 1.0000000 1989 19890930 19900519 20
6 20 1.0000000 1990 19900929 19910518 20
gender
1 male
2 male
3 male
4 male
5 male
6 male
Code
dim(snl_castsseasonsandactors)
[1] 614 13
After joining the three data sets, I have a data set with all the relevant variables, but it still needs some tidying up.
Currently, there are four date columns in the data, two for the first episode and two for the last episode. Since our cases are “actor-seasons,” I will combine these in a way that reflects the first and last episode dates of the season, unless the actor was only present for part of the season. In those cases, the dates will reflect the first or last episode in which they were involved. This will reduce the column count to 11.
Additionally, I have converted the numeric date columns into actual date format.
Finally, there are two episode count columns: one for the number of episodes a cast member was involved in and one for the number of episodes in a season. To make these column names clearer, I have renamed them.
Code
#creating final combined datasetsnl_all <- snl_castsseasonsandactors %>%#combining multiple first and last episode date columns to reflect dates participated by actorsmutate(first_episode =coalesce(first_epid.x, first_epid.y),last_episode =coalesce(last_epid.x, last_epid.y)) %>%#changing numeric values to be datesmutate(first_episode =ymd(first_episode),last_episode =ymd(last_episode)) %>%#removing unused date columnsselect(-c(first_epid.x, first_epid.y, last_epid.x, last_epid.y)) %>%#renaming for clarityrename("actor_episodes"= n_episodes.x) %>%rename("season_episodes"= n_episodes.y) #printing dimensions and summarydim(snl_casts)
[1] 614 8
Code
dim(snl_all)
[1] 614 11
Code
head(snl_all)
aid sid featured update_anchor actor_episodes season_fraction
1 A. Whitney Brown 11 TRUE FALSE 8 0.4444444
2 A. Whitney Brown 12 TRUE FALSE 20 1.0000000
3 A. Whitney Brown 13 TRUE FALSE 13 1.0000000
4 A. Whitney Brown 14 TRUE FALSE 20 1.0000000
5 A. Whitney Brown 15 TRUE FALSE 20 1.0000000
6 A. Whitney Brown 16 TRUE FALSE 20 1.0000000
year season_episodes gender first_episode last_episode
1 1985 18 male 1986-02-22 1986-05-24
2 1986 20 male 1986-10-11 1987-05-23
3 1987 13 male 1987-10-17 1988-02-27
4 1988 20 male 1988-10-08 1989-05-20
5 1989 20 male 1989-09-30 1990-05-19
6 1990 20 male 1990-09-29 1991-05-18
Generated by summarytools 1.0.1 (R version 4.2.2) 2023-05-06
Source Code
---title: "Challenge 8"author: "Thrishul"desription: ""date: "05/05/2023"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - challenge_8---Before we read in the data, we’ll need to load the dplyr, tidyr, ggplot2, and readr packages.```{r}library(tidyverse)library(ggplot2)library(lubridate)knitr::opts_chunk$set(echo =TRUE, warning=FALSE, message=FALSE)```# Read in data```{r}snl_actors <-read.csv("_data/snl_actors.csv")dim(snl_actors)``````{r}head(snl_actors) ``````{r}snl_casts <-read.csv("_data/snl_casts.csv")dim(snl_casts)``````{r}head(snl_casts) ``````{r}snl_seasons <-read.csv("_data/snl_seasons.csv")dim(snl_seasons)``````{r}head(snl_seasons)```The SNL dataset is comprised of three separate data sources: SNL actors, SNL casts, and SNL seasons.The SNL actors dataset consists of a list of all cast and guest members who have appeared on SNL (2306). Each actor is linked to additional information, such as their gender and whether they were a cast member or a guest. Each row in the dataset represents a single actor.The SNL casts dataset, on the other hand, is much more comprehensive in terms of variables. It includes information on the cast members (614) rather than guests, and provides details on the seasons in which they appeared, the number of episodes in each season, and the dates of the first and last episodes of each season. Most of the values for the first and last episodes are NA, only being included if they differ from the first or last episode of that season. We will deal with this issue later in our analysis.The SNL casts data has the desired format for our analysis, where each row represents an "actor-year." This is what we will use as our final data set.Lastly, the SNL seasons dataset contains information on the 46 seasons of SNL, including the year, dates of the first and last episodes, and the episode count per season. Each row in the dataset represents a single season.# Tidy & Mutate DataTo join the data sets, I will perform some tidying operations. After the join, I will conduct additional mutations to clean the final data set.To start with, I will filter out the guest actors from the SNL actors data set, as we lack sufficient information about them, except for gender, to use in an analysis.```{r}#filtering out guests from actors datasnl_actors <- snl_actors %>%filter(type =="cast") %>%select(aid, gender)```As you can see, there are two columns in the SNL casts data set (whether the cast member was an update anchor, and whether they were featured) which are currently of character data type, but they should be of logical data type. I have updated these columns to logical data type below.```{r}#showing character class before mutationclass(snl_casts$update_anchor)``````{r}#mutating to change to logicalsnl_casts <- snl_casts %>%mutate(`update_anchor`=case_when(`update_anchor`=="True"~TRUE,`update_anchor`=="False"~FALSE)) %>%mutate(`featured`=case_when(`featured`=="True"~TRUE,`featured`=="False"~FALSE))#showing logical class after mutationclass(snl_casts$update_anchor)``````{r}class(snl_casts$featured)```# Joining DataAs we join the three data sets, we will use the SNL casts data as the primary data set and add information from the other data sets into this. The final data set should have 614 cases, with each case representing an “actor-season”.To do this, we will first add the information about each season into the SNL casts data set. This includes the first and last date of the season, the number of episodes, and the year of the season.Next, we will use the SNL actors data set to add the gender of the cast member into our SNL casts data set.```{r}#combining seasons data INTO casts datasnl_castsandseasons <-left_join(snl_casts, snl_seasons, by ="sid")#combining actors data INTO casts and seasons datasnl_castsseasonsandactors <-left_join(snl_castsandseasons, snl_actors, "aid")head(snl_castsseasonsandactors)``````{r}dim(snl_castsseasonsandactors)```After joining the three data sets, I have a data set with all the relevant variables, but it still needs some tidying up.Currently, there are four date columns in the data, two for the first episode and two for the last episode. Since our cases are "actor-seasons," I will combine these in a way that reflects the first and last episode dates of the season, unless the actor was only present for part of the season. In those cases, the dates will reflect the first or last episode in which they were involved. This will reduce the column count to 11.Additionally, I have converted the numeric date columns into actual date format.Finally, there are two episode count columns: one for the number of episodes a cast member was involved in and one for the number of episodes in a season. To make these column names clearer, I have renamed them.```{r}#creating final combined datasetsnl_all <- snl_castsseasonsandactors %>%#combining multiple first and last episode date columns to reflect dates participated by actorsmutate(first_episode =coalesce(first_epid.x, first_epid.y),last_episode =coalesce(last_epid.x, last_epid.y)) %>%#changing numeric values to be datesmutate(first_episode =ymd(first_episode),last_episode =ymd(last_episode)) %>%#removing unused date columnsselect(-c(first_epid.x, first_epid.y, last_epid.x, last_epid.y)) %>%#renaming for clarityrename("actor_episodes"= n_episodes.x) %>%rename("season_episodes"= n_episodes.y) #printing dimensions and summarydim(snl_casts)``````{r}dim(snl_all)``````{r}head(snl_all)``````{r}print(summarytools::dfSummary(snl_all,valid.col=FALSE), method ='render')```