Challenge 8 Akhilesh

challenge_8
railroads
snl
faostat
debt
Joining Data
Author

Akhilesh Kumar

Published

August 25, 2022

library(tidyverse)
library(ggplot2)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

Read in data

Read in snl datasets: - snl_actors - snl_casts - snl_seasons

snl_actors<-read_csv("_data/snl_actors.csv", show_col_types = FALSE)
snl_casts<-read_csv("_data/snl_casts.csv", show_col_types = FALSE)
snl_seasons<-read_csv("_data/snl_seasons.csv", show_col_types = FALSE)

Briefly describe the data

“fed_rate” dataset contains year, month, day wise figures of Federal Funds Target Rate, Federal Funds Upper Target, Federal Funds Lower Target, Effective Federal Funds Rate, Real GDP (Percent Change), Unemployment Rate and Inflation Rate from July 1954 to March 2017 “debt” dataset contains

‘snl_seasons’, ‘snl_casts’ and ‘snl_actors’ datasets contains various datasets pertaining to ‘season’, ‘actor casting’ and ‘actors’ details’ for ‘Saturday Night Live Show’.Saturday Night Live (often abbreviated to SNL) is an American late-night live television sketch comedy and variety show created by Lorne Michaels and developed by Dick Ebersol. Michaels currently serves as the program’s showrunner. The show premiere was hosted by George Carlin on NBC on October 11, 1975, under the original title NBC’s Saturday Night.

‘snl_seasons’ contains ‘Saturday Night Live’ show details for following parameters/columns: - ‘year’: ‘Saturday Night Live Show’ Season year,
- ‘first_epid’: first episode id for the season year,
- ‘last_epid’: last episode id for the season year and
- ‘n_episodes’: number of episodes for the season year ‘snl_seasons’ has 46 obervations and 5 columns, where each observation is contain details of separate season of ‘Saturday Night Live Show’.

“snl_casts” dataset contains casting details of actors for different seasons of ‘Saturday Night Live Show’. It contains following parameters/columns: - sid: Season Id, for which the actor has appeared in the show - featured: if the actor was featured - first_epid: first episode of appearance of the actor - last_epid: last episode of appearance of the actor - update_anchor: if the actor was update ancor in the show - n_episodes: number of episodes in the season

‘snl_casts’ has 46 obervations and 5 columns, where each observation is contain details of separate season of ‘Saturday Night Live Show’

“snl_actors” dataset contains details of actors for ‘Saturday Night Live Show’. It contains ‘actor id’, ‘web url’, ‘actor type’ and ‘gender type’ in columns ‘aid’, ‘url’, ‘type’, and ‘gender’ respectively. Dataset has 2306 observations and 4 columns.

Tidy Data (as needed)

Mutate to convert column variable type of ‘first_epid’, ‘last_epid’ to date format using lubridate package, for ‘snl_seasons’ and ‘snl_casts’ datasets.

Mutate to convert column variables to factor, dataset wise details as given below: - snl_seasons: year - snl_casts: aid, featured, update_anchor - snl_actors: type, gender

Rename ‘first_epid’, ‘last_epi’ & ‘n_episodes’ columns in ‘snl_casts’ dataset to ‘first_epid_cast’, ‘last_epid_cast’ & ‘n_episodes_cast’ respectively, as these columns contain details specific to an actor and there are similar columns in ‘snl_seasons’ dataset for specific season.

snl_seasons<-snl_seasons %>% 
  mutate_at(vars(first_epid, last_epid), ymd) %>% 
  mutate_at(vars(year), as.factor)

snl_casts<-snl_casts %>% 
  select(-c(season_fraction)) %>% 
  mutate_at(vars(first_epid, last_epid), ymd) %>% 
  mutate_at(vars(aid, featured, update_anchor), as.factor) %>% 
  rename(first_epid_cast = first_epid,
         last_epid_cast = last_epid,
         n_episodes_cast = n_episodes)

snl_actors<-snl_actors %>% 
  select(-c(url)) %>% 
  mutate_at(vars(type, gender), as.factor) 


#sanity check, check class of column variable to verify the class/format of date columns as converted above

sapply(snl_seasons, class)
       sid       year first_epid  last_epid n_episodes 
 "numeric"   "factor"     "Date"     "Date"  "numeric" 
sapply(snl_casts, class)
            aid             sid        featured first_epid_cast  last_epid_cast 
       "factor"       "numeric"        "factor"          "Date"          "Date" 
  update_anchor n_episodes_cast 
       "factor"       "numeric" 
sapply(snl_actors, class)
        aid        type      gender 
"character"    "factor"    "factor" 

Join Data

  • Left join snl_casts & snl_actors dataset on ‘aid’ key and named the join as ‘cast_actors_left_join’
  • Left join cast_actors_left_join & snl_seasons dataset on ‘sid’ key and named the join as ‘cast_actors_season_left_join’
cast_actors_left_join <- left_join(snl_casts, snl_actors,
                  by = c("aid" = "aid"))
                   
dim(cast_actors_left_join)
[1] 614   9
head(cast_actors_left_join)
# A tibble: 6 × 9
  aid             sid featu…¹ first_ep…² last_epi…³ updat…⁴ n_epi…⁵ type  gender
  <chr>         <dbl> <fct>   <date>     <date>     <fct>     <dbl> <fct> <fct> 
1 A. Whitney B…    11 TRUE    1986-02-22 NA         FALSE         8 cast  male  
2 A. Whitney B…    12 TRUE    NA         NA         FALSE        20 cast  male  
3 A. Whitney B…    13 TRUE    NA         NA         FALSE        13 cast  male  
4 A. Whitney B…    14 TRUE    NA         NA         FALSE        20 cast  male  
5 A. Whitney B…    15 TRUE    NA         NA         FALSE        20 cast  male  
6 A. Whitney B…    16 TRUE    NA         NA         FALSE        20 cast  male  
# … with abbreviated variable names ¹​featured, ²​first_epid_cast,
#   ³​last_epid_cast, ⁴​update_anchor, ⁵​n_episodes_cast
cast_actors_season_left_join <- left_join(cast_actors_left_join, snl_seasons,
                   by = c("sid" = "sid"))

dim(cast_actors_season_left_join)
[1] 614  13
head(cast_actors_season_left_join)
# A tibble: 6 × 13
  aid       sid featu…¹ first_ep…² last_epi…³ updat…⁴ n_epi…⁵ type  gender year 
  <chr>   <dbl> <fct>   <date>     <date>     <fct>     <dbl> <fct> <fct>  <fct>
1 A. Whi…    11 TRUE    1986-02-22 NA         FALSE         8 cast  male   1985 
2 A. Whi…    12 TRUE    NA         NA         FALSE        20 cast  male   1986 
3 A. Whi…    13 TRUE    NA         NA         FALSE        13 cast  male   1987 
4 A. Whi…    14 TRUE    NA         NA         FALSE        20 cast  male   1988 
5 A. Whi…    15 TRUE    NA         NA         FALSE        20 cast  male   1989 
6 A. Whi…    16 TRUE    NA         NA         FALSE        20 cast  male   1990 
# … with 3 more variables: first_epid <date>, last_epid <date>,
#   n_episodes <dbl>, and abbreviated variable names ¹​featured,
#   ²​first_epid_cast, ³​last_epid_cast, ⁴​update_anchor, ⁵​n_episodes_cast
# ℹ Use `colnames()` to see all variable names