Challenge 8

challenge_8

snl

Joining Data

Author

Lindsay Jones

Published

August 25, 2022

library(tidyverse)
library(ggplot2)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
mutate variables as needed (including sanity checks)
join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

faostat ⭐⭐
railroads ⭐⭐⭐
fed_rate ⭐⭐⭐
debt ⭐⭐⭐
us_hh ⭐⭐⭐⭐
snl ⭐⭐⭐⭐⭐

library(readr)
seasons <- read_csv("_data/snl_seasons.csv", show_col_types = FALSE)

casts <- read_csv("_data/snl_casts.csv", show_col_types = FALSE)

actors <- read_csv("_data/snl_actors.csv", show_col_types = FALSE)

Briefly describe the data

Tidy Data (as needed)

“Seasons” and “Casts” both use the column title “n_episodes”, but they mean different things- the former refers to number of episodes aired in the season, the latter refers to the number of episodes a cast member appeared in during the season. Similarly, they both use “first_epid” and “last_epid”. I will rename the columns accordingly.

#rename columns
seasons <- rename(seasons, "n_eps_aired" = "n_episodes")

casts <- rename(casts, "n_eps_seen" = "n_episodes",
                      "first_seen" = "first_epid",
                      "last_seen" = "last_epid")

Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

#mutate season dates for readability
seasons %>%
  mutate(first_epid = ymd(first_epid), last_epid = ymd(last_epid))

# A tibble: 46 × 5
     sid  year first_epid last_epid  n_eps_aired
   <dbl> <dbl> <date>     <date>           <dbl>
 1     1  1975 1975-10-11 1976-07-31          24
 2     2  1976 1976-09-18 1977-05-21          22
 3     3  1977 1977-09-24 1978-05-20          20
 4     4  1978 1978-10-07 1979-05-26          20
 5     5  1979 1979-10-13 1980-05-24          20
 6     6  1980 1980-11-15 1981-04-11          13
 7     7  1981 1981-10-03 1982-05-22          20
 8     8  1982 1982-09-25 1983-05-14          20
 9     9  1983 1983-10-08 1984-05-12          19
10    10  1984 1984-10-06 1985-04-13          17
# … with 36 more rows
# ℹ Use `print(n = ...)` to see more rows

casts %>%
  mutate(first_seen = ymd(first_seen), last_seen = ymd(last_seen))

# A tibble: 614 × 8
   aid                sid featured first_seen last_seen update…¹ n_eps…² seaso…³
   <chr>            <dbl> <lgl>    <date>     <date>    <lgl>      <dbl>   <dbl>
 1 A. Whitney Brown    11 TRUE     1986-02-22 NA        FALSE          8   0.444
 2 A. Whitney Brown    12 TRUE     NA         NA        FALSE         20   1    
 3 A. Whitney Brown    13 TRUE     NA         NA        FALSE         13   1    
 4 A. Whitney Brown    14 TRUE     NA         NA        FALSE         20   1    
 5 A. Whitney Brown    15 TRUE     NA         NA        FALSE         20   1    
 6 A. Whitney Brown    16 TRUE     NA         NA        FALSE         20   1    
 7 Alan Zweibel         5 TRUE     1980-04-09 NA        FALSE          5   0.25 
 8 Sasheer Zamata      39 TRUE     2014-01-18 NA        FALSE         11   0.524
 9 Sasheer Zamata      40 TRUE     NA         NA        FALSE         21   1    
10 Sasheer Zamata      41 FALSE    NA         NA        FALSE         21   1    
# … with 604 more rows, and abbreviated variable names ¹update_anchor,
#   ²n_eps_seen, ³season_fraction
# ℹ Use `print(n = ...)` to see more rows

Join Data

snl <- inner_join(seasons, casts, "sid")
snl <- inner_join(snl, actors, "aid")
snl

# A tibble: 614 × 15
     sid  year first_epid last_e…¹ n_eps…² aid   featu…³ first…⁴ last_…⁵ updat…⁶
   <dbl> <dbl>      <dbl>    <dbl>   <dbl> <chr> <lgl>     <dbl>   <dbl> <lgl>  
 1     1  1975   19751011 19760731      24 Gild… FALSE        NA NA      FALSE  
 2     1  1975   19751011 19760731      24 Mich… FALSE        NA  1.98e7 FALSE  
 3     1  1975   19751011 19760731      24 Lara… FALSE        NA NA      FALSE  
 4     1  1975   19751011 19760731      24 Garr… FALSE        NA NA      FALSE  
 5     1  1975   19751011 19760731      24 Jane… FALSE        NA NA      FALSE  
 6     1  1975   19751011 19760731      24 Geor… FALSE        NA  1.98e7 FALSE  
 7     1  1975   19751011 19760731      24 Chev… FALSE        NA NA      TRUE   
 8     1  1975   19751011 19760731      24 John… FALSE        NA NA      FALSE  
 9     1  1975   19751011 19760731      24 Dan … FALSE        NA NA      FALSE  
10     2  1976   19760918 19770521      22 Gild… FALSE        NA NA      FALSE  
# … with 604 more rows, 5 more variables: n_eps_seen <dbl>,
#   season_fraction <dbl>, url <chr>, type <chr>, gender <chr>, and abbreviated
#   variable names ¹last_epid, ²n_eps_aired, ³featured, ⁴first_seen,
#   ⁵last_seen, ⁶update_anchor
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Apparently the mutating of dates doesn’t carry over to the join, so I’ll do that again.

#remove year column
snl <- snl%>%
  select(-year)

#mutate season dates for readability
snl %>%
  mutate(first_epid = ymd(first_epid), 
         last_epid = ymd(last_epid),
         first_seen = ymd(first_seen), 
         last_seen = ymd(last_seen))

# A tibble: 614 × 14
     sid first_epid last_epid  n_eps_aired aid     featu…¹ first_seen last_seen 
   <dbl> <date>     <date>           <dbl> <chr>   <lgl>   <date>     <date>    
 1     1 1975-10-11 1976-07-31          24 Gilda … FALSE   NA         NA        
 2     1 1975-10-11 1976-07-31          24 Michae… FALSE   NA         1975-11-08
 3     1 1975-10-11 1976-07-31          24 Larain… FALSE   NA         NA        
 4     1 1975-10-11 1976-07-31          24 Garret… FALSE   NA         NA        
 5     1 1975-10-11 1976-07-31          24 Jane C… FALSE   NA         NA        
 6     1 1975-10-11 1976-07-31          24 George… FALSE   NA         1975-10-11
 7     1 1975-10-11 1976-07-31          24 Chevy … FALSE   NA         NA        
 8     1 1975-10-11 1976-07-31          24 John B… FALSE   NA         NA        
 9     1 1975-10-11 1976-07-31          24 Dan Ay… FALSE   NA         NA        
10     2 1976-09-18 1977-05-21          22 Gilda … FALSE   NA         NA        
# … with 604 more rows, 6 more variables: update_anchor <lgl>,
#   n_eps_seen <dbl>, season_fraction <dbl>, url <chr>, type <chr>,
#   gender <chr>, and abbreviated variable name ¹featured
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names