Challenge 8 Instructions

challenge_8
railroads
snl
faostat
debt
Joining Data
Author

Kevin Martell Luya

Published

May 22, 2023

library(tidyverse)
library(ggplot2)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • military marriages ⭐⭐
  • faostat ⭐⭐
  • railroads ⭐⭐⭐
  • fed_rate ⭐⭐⭐
  • debt ⭐⭐⭐
  • us_hh ⭐⭐⭐⭐
  • snl ⭐⭐⭐⭐⭐
actors <- read.csv(file = "_data/snl_actors.csv",
  header=TRUE,
  sep = ",")

casts <- read.csv(file = "_data/snl_casts.csv",
  header=TRUE,
  sep = ","
  )

seasons <- read.csv(file = "_data/snl_seasons.csv",
  header=TRUE,
  sep = ","
  )
head(actors)
             aid           url  type  gender
1  Kate McKinnon   /Cast/?KaMc  cast  female
2    Alex Moffat   /Cast/?AlMo  cast    male
3     Ego Nwodim   /Cast/?EgNw  cast unknown
4     Chris Redd   /Cast/?ChRe  cast    male
5 Kenan Thompson   /Cast/?KeTh  cast    male
6 Carey Mulligan /Guests/?3677 guest    andy
head(casts)
               aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown  11     True   19860222        NA         False          8
2 A. Whitney Brown  12     True         NA        NA         False         20
3 A. Whitney Brown  13     True         NA        NA         False         13
4 A. Whitney Brown  14     True         NA        NA         False         20
5 A. Whitney Brown  15     True         NA        NA         False         20
6 A. Whitney Brown  16     True         NA        NA         False         20
  season_fraction
1       0.4444444
2       1.0000000
3       1.0000000
4       1.0000000
5       1.0000000
6       1.0000000
head(seasons)
  sid year first_epid last_epid n_episodes
1   1 1975   19751011  19760731         24
2   2 1976   19760918  19770521         22
3   3 1977   19770924  19780520         20
4   4 1978   19781007  19790526         20
5   5 1979   19791013  19800524         20
6   6 1980   19801115  19810411         13

Briefly describe the data

The datasets are related to Saturday Night Live (SNL) and contain different types of information. The snl_casts dataset has data on the episodes and seasons in which actors were cast, while the snl_seasons dataset has information about each season, such as the number of episodes and the year it started. The snl_actors dataset contains information about each actor, including their role and type of appearance. The actors dataset has 2306 rows and 4 columns, the casts dataset has 614 rows and 8 columns, and the seasons dataset has 46 rows and 5 columns.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

actors  <- na.omit(actors)
#casts   <- na.omit(casts)
#seasons   <- na.omit(seasons)

Join Data

The casts dataframe has information about cast members in each season, but not their gender. Gender information is stored in the actors dataframe, which does not have information about the seasons. To merge the two dataframes, left_join() will be used based on actor ID. Guest stars and crew members will be filtered out using the filter() function. Finally, the count() function will be used to summarize gender makeup by season.

casts_seasons <- casts %>% left_join(seasons ,by="sid")

#sanity check 
head(casts_seasons)
               aid sid featured first_epid.x last_epid.x update_anchor
1 A. Whitney Brown  11     True     19860222          NA         False
2 A. Whitney Brown  12     True           NA          NA         False
3 A. Whitney Brown  13     True           NA          NA         False
4 A. Whitney Brown  14     True           NA          NA         False
5 A. Whitney Brown  15     True           NA          NA         False
6 A. Whitney Brown  16     True           NA          NA         False
  n_episodes.x season_fraction year first_epid.y last_epid.y n_episodes.y
1            8       0.4444444 1985     19851109    19860524           18
2           20       1.0000000 1986     19861011    19870523           20
3           13       1.0000000 1987     19871017    19880227           13
4           20       1.0000000 1988     19881008    19890520           20
5           20       1.0000000 1989     19890930    19900519           20
6           20       1.0000000 1990     19900929    19910518           20
seasons_casts_actors <- casts_seasons %>% 
  left_join(actors, by = "aid")

# sanity check
head(seasons_casts_actors)
               aid sid featured first_epid.x last_epid.x update_anchor
1 A. Whitney Brown  11     True     19860222          NA         False
2 A. Whitney Brown  12     True           NA          NA         False
3 A. Whitney Brown  13     True           NA          NA         False
4 A. Whitney Brown  14     True           NA          NA         False
5 A. Whitney Brown  15     True           NA          NA         False
6 A. Whitney Brown  16     True           NA          NA         False
  n_episodes.x season_fraction year first_epid.y last_epid.y n_episodes.y
1            8       0.4444444 1985     19851109    19860524           18
2           20       1.0000000 1986     19861011    19870523           20
3           13       1.0000000 1987     19871017    19880227           13
4           20       1.0000000 1988     19881008    19890520           20
5           20       1.0000000 1989     19890930    19900519           20
6           20       1.0000000 1990     19900929    19910518           20
          url type gender
1 /Cast/?AWBr cast   male
2 /Cast/?AWBr cast   male
3 /Cast/?AWBr cast   male
4 /Cast/?AWBr cast   male
5 /Cast/?AWBr cast   male
6 /Cast/?AWBr cast   male

After left joining three table, we enriched this table by changing the date format

seasons_casts_actors <- mutate( seasons_casts_actors, 
                   first_epid.x = as.Date( as.character(first_epid.x), format= "%Y%m%d"), 
                   last_epid.y = as.Date(as.character(last_epid.x), format = "%Y%m%d"))

#sanity check
table(seasons_casts_actors$first_epid.x)

1977-01-15 1979-10-13 1979-11-17 1979-12-15 1980-01-26 1980-04-09 1980-12-13 
         1          1          2          1          3          2          3 
1980-12-20 1981-04-11 1986-02-22 1986-03-22 1989-01-21 1989-03-25 1990-10-27 
         1          5          1          1          1          1          1 
1990-11-10 1991-02-09 1991-11-16 1993-10-02 1993-10-09 1994-03-12 1995-01-14 
         2          2          2          1          2          1          1 
1995-02-25 1995-04-08 1996-03-16 1999-10-23 2000-05-06 2002-03-02 2005-05-07 
         1          1          1          1          1          1          1 
2005-11-12 2008-02-23 2008-11-15 2012-04-07 2014-01-18 2014-03-01 2014-10-25 
         1          1          2          1          1          1          1 
head(seasons_casts_actors)
               aid sid featured first_epid.x last_epid.x update_anchor
1 A. Whitney Brown  11     True   1986-02-22          NA         False
2 A. Whitney Brown  12     True         <NA>          NA         False
3 A. Whitney Brown  13     True         <NA>          NA         False
4 A. Whitney Brown  14     True         <NA>          NA         False
5 A. Whitney Brown  15     True         <NA>          NA         False
6 A. Whitney Brown  16     True         <NA>          NA         False
  n_episodes.x season_fraction year first_epid.y last_epid.y n_episodes.y
1            8       0.4444444 1985     19851109        <NA>           18
2           20       1.0000000 1986     19861011        <NA>           20
3           13       1.0000000 1987     19871017        <NA>           13
4           20       1.0000000 1988     19881008        <NA>           20
5           20       1.0000000 1989     19890930        <NA>           20
6           20       1.0000000 1990     19900929        <NA>           20
          url type gender
1 /Cast/?AWBr cast   male
2 /Cast/?AWBr cast   male
3 /Cast/?AWBr cast   male
4 /Cast/?AWBr cast   male
5 /Cast/?AWBr cast   male
6 /Cast/?AWBr cast   male

Visualization

library(ggplot2)
ggplot(data = seasons_casts_actors, aes(x = year, y = mean(n_episodes.x), fill = gender)) +
  geom_bar(stat = "identity") +
  labs(x = "Year", y = "Number of Episodes", title = "Number of Episodes per Season") +
  theme_minimal()