Joining SNL Datasets

challenge_8
snl
Using Join Functions to Combine Two or More Datasets into One Dataset
Author

Kris Smole

Published

April 26, 2023

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. tidy data (as needed, including sanity checks)
  2. mutate variables as needed (including sanity checks)
  3. join two or more data sets and analyze some aspect of the joined data (see detailed instruction below)

(be sure to only include the category tags for the data you use!)

Reading in the SNL files: Actors, Casts, Seasons

Brief description of the SNL data sets

Three datasets of the NBC television program, Saturday Night Live (SNL), exist with three themes: Actors, Casts, and Seasons. Let’s take a look at the content and features of each dataset, and determine if and how these three datasets might be combined on common features to provide the combined dataset we want for performing the analysis we’d like to do.

The actors data set format and data content:

First let’s look at the actors file, and consider what columns might be options to use for joining with the other two files:

#creating actors variable from file snl_actors.csv:
snl_actors <- read_csv("_data/snl_actors.csv")
#The dataset of the file snl_actors.csv has __________
dim(snl_actors)
[1] 2306    4
head(snl_actors)
# A tibble: 6 × 4
  aid            url           type  gender 
  <chr>          <chr>         <chr> <chr>  
1 Kate McKinnon  /Cast/?KaMc   cast  female 
2 Alex Moffat    /Cast/?AlMo   cast  male   
3 Ego Nwodim     /Cast/?EgNw   cast  unknown
4 Chris Redd     /Cast/?ChRe   cast  male   
5 Kenan Thompson /Cast/?KeTh   cast  male   
6 Carey Mulligan /Guests/?3677 guest andy   
#Remove url column from actors file
snl_actors <- snl_actors %>% select(-c(url))
#display head of snl_actors data set to confirm removal of url column
head(snl_actors)
# A tibble: 6 × 3
  aid            type  gender 
  <chr>          <chr> <chr>  
1 Kate McKinnon  cast  female 
2 Alex Moffat    cast  male   
3 Ego Nwodim     cast  unknown
4 Chris Redd     cast  male   
5 Kenan Thompson cast  male   
6 Carey Mulligan guest andy   

I removed the url column from the actors file, and I see in the changes took place, and the url is not included any longer.

The actors file has 2306 rows and 4 columns after removing the url column.

Let’s take a look at the actors data set using summarytools function:

print(summarytools::dfSummary(snl_actors,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

snl_actors

Dimensions: 2306 x 3
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. 'N Sync
2. 070 Shake
3. 10,000 Maniacs
4. 14 Karat Soul
5. 2 Chainz
6. 3-D
7. 3RDEYEGIRL
8. 50 Cent
9. A Tribe Called Quest
10. A. Whitney Brown
[ 2296 others ]
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
1 ( 0.0% )
2296 ( 99.6% )
0 (0.0%)
type [character]
1. cast
2. crew
3. guest
4. unknown
154 ( 6.7% )
170 ( 7.4% )
1926 ( 83.5% )
56 ( 2.4% )
0 (0.0%)
gender [character]
1. andy
2. female
3. male
4. unknown
21 ( 0.9% )
671 ( 29.1% )
1226 ( 53.2% )
388 ( 16.8% )
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-04-28

Summarytools provides more information than we may have needed, yet when we see all of this information about the data set, we may be inspired to consider additional analysis, or investigate aspects of the data set we weren’t previously focusing upon.

The casts data set format and data content:

Here’s the basic file characteristics for the casts data set:

#creating casts variable from file snl_casts.csv
snl_casts <- read_csv("_data/snl_casts.csv")
#The dataset of the file snl_casts.csv format and content
dim(snl_casts)
[1] 614   8
head(snl_casts)
# A tibble: 6 × 8
  aid                sid featured first_epid last_epid update_…¹ n_epi…² seaso…³
  <chr>            <dbl> <lgl>         <dbl>     <dbl> <lgl>       <dbl>   <dbl>
1 A. Whitney Brown    11 TRUE       19860222        NA FALSE           8   0.444
2 A. Whitney Brown    12 TRUE             NA        NA FALSE          20   1    
3 A. Whitney Brown    13 TRUE             NA        NA FALSE          13   1    
4 A. Whitney Brown    14 TRUE             NA        NA FALSE          20   1    
5 A. Whitney Brown    15 TRUE             NA        NA FALSE          20   1    
6 A. Whitney Brown    16 TRUE             NA        NA FALSE          20   1    
# … with abbreviated variable names ¹​update_anchor, ²​n_episodes,
#   ³​season_fraction

The casts file has 614 rows and 8 columns.

print(summarytools::dfSummary(snl_casts,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

snl_casts

Dimensions: 614 x 8
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 146 others ]
18 ( 2.9% )
14 ( 2.3% )
13 ( 2.1% )
11 ( 1.8% )
11 ( 1.8% )
10 ( 1.6% )
10 ( 1.6% )
9 ( 1.5% )
9 ( 1.5% )
9 ( 1.5% )
500 ( 81.4% )
0 (0.0%)
sid [numeric]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 0 (0.0%)
featured [logical]
1. FALSE
2. TRUE
451 ( 73.5% )
163 ( 26.5% )
0 (0.0%)
first_epid [numeric]
Mean (sd) : 19909634 (111264.8)
min ≤ med ≤ max:
19770115 ≤ 19901110 ≤ 20141025
IQR (CV) : 156624.2 (0)
35 distinct values 564 (91.9%)
last_epid [numeric]
Mean (sd) : 19944038 (126122.2)
min ≤ med ≤ max:
19751011 ≤ 19950225 ≤ 20140201
IQR (CV) : 190005 (0)
17 distinct values 597 (97.2%)
update_anchor [logical]
1. FALSE
2. TRUE
541 ( 88.1% )
73 ( 11.9% )
0 (0.0%)
n_episodes [numeric]
Mean (sd) : 18.7 (4)
min ≤ med ≤ max:
1 ≤ 20 ≤ 24
IQR (CV) : 2 (0.2)
22 distinct values 0 (0.0%)
season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-04-28

The seasons data set format and data content:

Here’s the basic file characteristics for the seasons data set:

#creating seasons variable from file snl_seasons.csv:
snl_seasons <- read_csv("_data/snl_seasons.csv")
#Showing the format and contents of the data set of the file snl_seasons.csv 
dim(snl_seasons)
[1] 46  5
head(snl_seasons)
# A tibble: 6 × 5
    sid  year first_epid last_epid n_episodes
  <dbl> <dbl>      <dbl>     <dbl>      <dbl>
1     1  1975   19751011  19760731         24
2     2  1976   19760918  19770521         22
3     3  1977   19770924  19780520         20
4     4  1978   19781007  19790526         20
5     5  1979   19791013  19800524         20
6     6  1980   19801115  19810411         13

Now, let’s take a look at the seasons data set using summarytools function:

print(summarytools::dfSummary(snl_seasons,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

snl_seasons

Dimensions: 46 x 5
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
sid [numeric]
Mean (sd) : 23.5 (13.4)
min ≤ med ≤ max:
1 ≤ 23.5 ≤ 46
IQR (CV) : 22.5 (0.6)
46 distinct values 0 (0.0%)
year [numeric]
Mean (sd) : 1997.5 (13.4)
min ≤ med ≤ max:
1975 ≤ 1997.5 ≤ 2020
IQR (CV) : 22.5 (0)
46 distinct values 0 (0.0%)
first_epid [numeric]
Mean (sd) : 19975965 (134209.3)
min ≤ med ≤ max:
19751011 ≤ 19975926 ≤ 20201003
IQR (CV) : 224910.2 (0)
46 distinct values 0 (0.0%)
last_epid [numeric]
Mean (sd) : 19985509 (134223.9)
min ≤ med ≤ max:
19760731 ≤ 19985512 ≤ 20210410
IQR (CV) : 225066.2 (0)
46 distinct values 0 (0.0%)
n_episodes [numeric]
Mean (sd) : 19.7 (2.3)
min ≤ med ≤ max:
12 ≤ 20 ≤ 24
IQR (CV) : 1 (0.1)
12 : 1 ( 2.2% )
13 : 2 ( 4.3% )
17 : 2 ( 4.3% )
18 : 2 ( 4.3% )
19 : 3 ( 6.5% )
20 : 23 ( 50.0% )
21 : 7 ( 15.2% )
22 : 5 ( 10.9% )
24 : 1 ( 2.2% )
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-04-28

How the data sets will be joined:

From seeing the contents and format of the data sets, we see the column names of each dataset, and deduce that the three datasets do not have common columns among all three data sets. Casts and seasons data sets share common data in their columns entitled “sid”. Casts and actors data sets shared common data in their columns entitled “aid”. We’ll join the three data sets in two steps within a piped function set:

1.Join snl_casts to snl_seasons

2.Join data set created in Step 1 to snl_actors.

Tidy Data steps needed?

Although not every single column of the three data sets are needed - specifically one column in snl_actors, entitled “URL”. When reviewing the individual data sets above, I removed the url column, and checked that the removal was successful. Retaining the rest of the columns within the data sets provides the possibilities of unanticipated analysis by keeping the columns and rows of the three data sets. Because the size of the data set files are not extraordinarily large, we can easily keep all of the data sets’ contents without burdening our coding work or analysis.

Joining the data sets

Joining all 3 data sets:

#full join of joined casts and seasons to actors

SNL_csa<-full_join(snl_casts, snl_seasons, by="sid")%>%
  full_join(.,snl_actors, by="aid")

head(SNL_csa)
# A tibble: 6 × 14
  aid          sid featu…¹ first…² last_…³ updat…⁴ n_epi…⁵ seaso…⁶  year first…⁷
  <chr>      <dbl> <lgl>     <dbl>   <dbl> <lgl>     <dbl>   <dbl> <dbl>   <dbl>
1 A. Whitne…    11 TRUE     1.99e7      NA FALSE         8   0.444  1985  1.99e7
2 A. Whitne…    12 TRUE    NA           NA FALSE        20   1      1986  1.99e7
3 A. Whitne…    13 TRUE    NA           NA FALSE        13   1      1987  1.99e7
4 A. Whitne…    14 TRUE    NA           NA FALSE        20   1      1988  1.99e7
5 A. Whitne…    15 TRUE    NA           NA FALSE        20   1      1989  1.99e7
6 A. Whitne…    16 TRUE    NA           NA FALSE        20   1      1990  1.99e7
# … with 4 more variables: last_epid.y <dbl>, n_episodes.y <dbl>, type <chr>,
#   gender <chr>, and abbreviated variable names ¹​featured, ²​first_epid.x,
#   ³​last_epid.x, ⁴​update_anchor, ⁵​n_episodes.x, ⁶​season_fraction,
#   ⁷​first_epid.y
dim(SNL_csa)
[1] 2764   14

The combined data set of the 3 files for SNL has 2764 rows with 14 columns. The actors file had 2764 rows, so this data set’s count of rows is what we were expecting.

We see the join of all three data sets was successful.

We can now see the full summarytools view of all three data sets now combined into one data set:

print(summarytools::dfSummary(SNL_csa,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

SNL_csa

Dimensions: 2764 x 14
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 2296 others ]
18 ( 0.7% )
14 ( 0.5% )
13 ( 0.5% )
11 ( 0.4% )
11 ( 0.4% )
10 ( 0.4% )
10 ( 0.4% )
9 ( 0.3% )
9 ( 0.3% )
9 ( 0.3% )
2650 ( 95.9% )
0 (0.0%)
sid [numeric]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 2150 (77.8%)
featured [logical]
1. FALSE
2. TRUE
451 ( 73.5% )
163 ( 26.5% )
2150 (77.8%)
first_epid.x [numeric]
Mean (sd) : 19909634 (111264.8)
min ≤ med ≤ max:
19770115 ≤ 19901110 ≤ 20141025
IQR (CV) : 156624.2 (0)
35 distinct values 2714 (98.2%)
last_epid.x [numeric]
Mean (sd) : 19944038 (126122.2)
min ≤ med ≤ max:
19751011 ≤ 19950225 ≤ 20140201
IQR (CV) : 190005 (0)
17 distinct values 2747 (99.4%)
update_anchor [logical]
1. FALSE
2. TRUE
541 ( 88.1% )
73 ( 11.9% )
2150 (77.8%)
n_episodes.x [numeric]
Mean (sd) : 18.7 (4)
min ≤ med ≤ max:
1 ≤ 20 ≤ 24
IQR (CV) : 2 (0.2)
22 distinct values 2150 (77.8%)
season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 2150 (77.8%)
year [numeric]
Mean (sd) : 1999.5 (13.1)
min ≤ med ≤ max:
1975 ≤ 2000 ≤ 2020
IQR (CV) : 22 (0)
46 distinct values 2150 (77.8%)
first_epid.y [numeric]
Mean (sd) : 19995688 (130607)
min ≤ med ≤ max:
19751011 ≤ 20001007 ≤ 20201003
IQR (CV) : 219994 (0)
46 distinct values 2150 (77.8%)
last_epid.y [numeric]
Mean (sd) : 20005232 (130621.3)
min ≤ med ≤ max:
19760731 ≤ 20010519 ≤ 20210410
IQR (CV) : 220000 (0)
46 distinct values 2150 (77.8%)
n_episodes.y [numeric]
Mean (sd) : 19.7 (2.1)
min ≤ med ≤ max:
12 ≤ 20 ≤ 24
IQR (CV) : 1 (0.1)
12 : 12 ( 2.0% )
13 : 24 ( 3.9% )
17 : 30 ( 4.9% )
18 : 31 ( 5.0% )
19 : 37 ( 6.0% )
20 : 297 ( 48.4% )
21 : 111 ( 18.1% )
22 : 63 ( 10.3% )
24 : 9 ( 1.5% )
2150 (77.8%)
type [character]
1. cast
2. crew
3. guest
4. unknown
607 ( 22.0% )
170 ( 6.2% )
1926 ( 69.7% )
61 ( 2.2% )
0 (0.0%)
gender [character]
1. andy
2. female
3. male
4. unknown
21 ( 0.8% )
813 ( 29.4% )
1540 ( 55.7% )
390 ( 14.1% )
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-04-28

Analysis of New, Combined SNL Dataset

ggplot(SNL_csa, aes(x=`year`, y=`n_episodes.y`)) +
    geom_point()+
    labs(title = "Number of SNL Episodes by Year, 1975-2020")

We can quickly and easily conclude that most frequent count of episodes per year is 20.

CONCLUSION

Combining data sets with the Join functions is a powerful and essential function. Finding common column content in the data sets is one of the first steps - after determining how you want your end result to appear. Some data sets do not share a common column, and must be joined with 3rd data set with which each of the two original data sets share different, while each share a different common column of the 3rd data set.