library(tidyverse)
library(ggplot2)
library(lubridate)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8 Akhilesh
Challenge Overview
Today’s challenge is to:
- read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
Read in snl datasets: - snl_actors - snl_casts - snl_seasons
<-read_csv("_data/snl_actors.csv", show_col_types = FALSE)
snl_actors<-read_csv("_data/snl_casts.csv", show_col_types = FALSE)
snl_casts<-read_csv("_data/snl_seasons.csv", show_col_types = FALSE) snl_seasons
Briefly describe the data
“fed_rate” dataset contains year, month, day wise figures of Federal Funds Target Rate, Federal Funds Upper Target, Federal Funds Lower Target, Effective Federal Funds Rate, Real GDP (Percent Change), Unemployment Rate and Inflation Rate from July 1954 to March 2017 “debt” dataset contains
‘snl_seasons’, ‘snl_casts’ and ‘snl_actors’ datasets contains various datasets pertaining to ‘season’, ‘actor casting’ and ‘actors’ details’ for ‘Saturday Night Live Show’.Saturday Night Live (often abbreviated to SNL) is an American late-night live television sketch comedy and variety show created by Lorne Michaels and developed by Dick Ebersol. Michaels currently serves as the program’s showrunner. The show premiere was hosted by George Carlin on NBC on October 11, 1975, under the original title NBC’s Saturday Night.
‘snl_seasons’ contains ‘Saturday Night Live’ show details for following parameters/columns: - ‘year’: ‘Saturday Night Live Show’ Season year,
- ‘first_epid’: first episode id for the season year,
- ‘last_epid’: last episode id for the season year and
- ‘n_episodes’: number of episodes for the season year ‘snl_seasons’ has 46 obervations and 5 columns, where each observation is contain details of separate season of ‘Saturday Night Live Show’.
“snl_casts” dataset contains casting details of actors for different seasons of ‘Saturday Night Live Show’. It contains following parameters/columns: - sid: Season Id, for which the actor has appeared in the show - featured: if the actor was featured - first_epid: first episode of appearance of the actor - last_epid: last episode of appearance of the actor - update_anchor: if the actor was update ancor in the show - n_episodes: number of episodes in the season
‘snl_casts’ has 46 obervations and 5 columns, where each observation is contain details of separate season of ‘Saturday Night Live Show’
“snl_actors” dataset contains details of actors for ‘Saturday Night Live Show’. It contains ‘actor id’, ‘web url’, ‘actor type’ and ‘gender type’ in columns ‘aid’, ‘url’, ‘type’, and ‘gender’ respectively. Dataset has 2306 observations and 4 columns.
Tidy Data (as needed)
Mutate to convert column variable type of ‘first_epid’, ‘last_epid’ to date format using lubridate package, for ‘snl_seasons’ and ‘snl_casts’ datasets.
Mutate to convert column variables to factor, dataset wise details as given below: - snl_seasons: year - snl_casts: aid, featured, update_anchor - snl_actors: type, gender
Rename ‘first_epid’, ‘last_epi’ & ‘n_episodes’ columns in ‘snl_casts’ dataset to ‘first_epid_cast’, ‘last_epid_cast’ & ‘n_episodes_cast’ respectively, as these columns contain details specific to an actor and there are similar columns in ‘snl_seasons’ dataset for specific season.
<-snl_seasons %>%
snl_seasonsmutate_at(vars(first_epid, last_epid), ymd) %>%
mutate_at(vars(year), as.factor)
<-snl_casts %>%
snl_castsselect(-c(season_fraction)) %>%
mutate_at(vars(first_epid, last_epid), ymd) %>%
mutate_at(vars(aid, featured, update_anchor), as.factor) %>%
rename(first_epid_cast = first_epid,
last_epid_cast = last_epid,
n_episodes_cast = n_episodes)
<-snl_actors %>%
snl_actorsselect(-c(url)) %>%
mutate_at(vars(type, gender), as.factor)
#sanity check, check class of column variable to verify the class/format of date columns as converted above
sapply(snl_seasons, class)
sid year first_epid last_epid n_episodes
"numeric" "factor" "Date" "Date" "numeric"
sapply(snl_casts, class)
aid sid featured first_epid_cast last_epid_cast
"factor" "numeric" "factor" "Date" "Date"
update_anchor n_episodes_cast
"factor" "numeric"
sapply(snl_actors, class)
aid type gender
"character" "factor" "factor"
Join Data
- Left join snl_casts & snl_actors dataset on ‘aid’ key and named the join as ‘cast_actors_left_join’
- Left join cast_actors_left_join & snl_seasons dataset on ‘sid’ key and named the join as ‘cast_actors_season_left_join’
<- left_join(snl_casts, snl_actors,
cast_actors_left_join by = c("aid" = "aid"))
dim(cast_actors_left_join)
[1] 614 9
head(cast_actors_left_join)
# A tibble: 6 × 9
aid sid featu…¹ first_ep…² last_epi…³ updat…⁴ n_epi…⁵ type gender
<chr> <dbl> <fct> <date> <date> <fct> <dbl> <fct> <fct>
1 A. Whitney B… 11 TRUE 1986-02-22 NA FALSE 8 cast male
2 A. Whitney B… 12 TRUE NA NA FALSE 20 cast male
3 A. Whitney B… 13 TRUE NA NA FALSE 13 cast male
4 A. Whitney B… 14 TRUE NA NA FALSE 20 cast male
5 A. Whitney B… 15 TRUE NA NA FALSE 20 cast male
6 A. Whitney B… 16 TRUE NA NA FALSE 20 cast male
# … with abbreviated variable names ¹featured, ²first_epid_cast,
# ³last_epid_cast, ⁴update_anchor, ⁵n_episodes_cast
<- left_join(cast_actors_left_join, snl_seasons,
cast_actors_season_left_join by = c("sid" = "sid"))
dim(cast_actors_season_left_join)
[1] 614 13
head(cast_actors_season_left_join)
# A tibble: 6 × 13
aid sid featu…¹ first_ep…² last_epi…³ updat…⁴ n_epi…⁵ type gender year
<chr> <dbl> <fct> <date> <date> <fct> <dbl> <fct> <fct> <fct>
1 A. Whi… 11 TRUE 1986-02-22 NA FALSE 8 cast male 1985
2 A. Whi… 12 TRUE NA NA FALSE 20 cast male 1986
3 A. Whi… 13 TRUE NA NA FALSE 13 cast male 1987
4 A. Whi… 14 TRUE NA NA FALSE 20 cast male 1988
5 A. Whi… 15 TRUE NA NA FALSE 20 cast male 1989
6 A. Whi… 16 TRUE NA NA FALSE 20 cast male 1990
# … with 3 more variables: first_epid <date>, last_epid <date>,
# n_episodes <dbl>, and abbreviated variable names ¹featured,
# ²first_epid_cast, ³last_epid_cast, ⁴update_anchor, ⁵n_episodes_cast
# ℹ Use `colnames()` to see all variable names