Challenge 8

challenge_8
snl
tidyverse
summarytools
Joining Data
Author

Saaradhaa M

Published

August 25, 2022

library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

I’m gonna try to put together 3 SNL datasets - let’s read both of them in.

snl_actors <- (read_csv("_data/snl_actors.csv", show_col_types = FALSE) [,-2])
snl_actors <- complete(snl_actors)

snl_casts <- read_csv("_data/snl_casts.csv", show_col_types = FALSE)
snl_casts <- complete(snl_casts)

snl_seasons <- read_csv("_data/snl_seasons.csv", show_col_types = FALSE)
snl_seasons <- complete(snl_seasons)

Briefly describe and tidy data

print(dfSummary(snl_actors, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_actors

Dimensions: 2306 x 3
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. 'N Sync
2. 070 Shake
3. 10,000 Maniacs
4. 14 Karat Soul
5. 2 Chainz
6. 3-D
7. 3RDEYEGIRL
8. 50 Cent
9. A Tribe Called Quest
10. A. Whitney Brown
[ 2296 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
2296(99.6%)
0 (0.0%)
type [character]
1. cast
2. crew
3. guest
4. unknown
154(6.7%)
170(7.4%)
1926(83.5%)
56(2.4%)
0 (0.0%)
gender [character]
1. andy
2. female
3. male
4. unknown
21(0.9%)
671(29.1%)
1226(53.2%)
388(16.8%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-24

print(dfSummary(snl_casts, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_casts

Dimensions: 614 x 8
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 146 others ]
18(2.9%)
14(2.3%)
13(2.1%)
11(1.8%)
11(1.8%)
10(1.6%)
10(1.6%)
9(1.5%)
9(1.5%)
9(1.5%)
500(81.4%)
0 (0.0%)
sid [numeric]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 0 (0.0%)
featured [logical]
1. FALSE
2. TRUE
451(73.5%)
163(26.5%)
0 (0.0%)
first_epid [numeric]
Mean (sd) : 19909634 (111264.8)
min ≤ med ≤ max:
19770115 ≤ 19901110 ≤ 20141025
IQR (CV) : 156624.2 (0)
35 distinct values 564 (91.9%)
last_epid [numeric]
Mean (sd) : 19944038 (126122.2)
min ≤ med ≤ max:
19751011 ≤ 19950225 ≤ 20140201
IQR (CV) : 190005 (0)
17 distinct values 597 (97.2%)
update_anchor [logical]
1. FALSE
2. TRUE
541(88.1%)
73(11.9%)
0 (0.0%)
n_episodes [numeric]
Mean (sd) : 18.7 (4)
min ≤ med ≤ max:
1 ≤ 20 ≤ 24
IQR (CV) : 2 (0.2)
22 distinct values 0 (0.0%)
season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-24

  • Tidying snl_actors (2306 rows, 3 columns):
    • No missing data.
    • We can leave “aid” as a character column and convert type and gender to factors.
  • Tidying snl_casts (614 rows, 8 columns):
    • Missing data in first_epid and last_epid - this is fine because we used complete() to make the NAs explicit.
    • “aid” is also present in this dataset, which hopefully makes it intuitive to put the 2 datasets together (no renaming needed). There are much fewer values for Actor ID, so this might just be a subset of the other dataset.
  • Tidying snl_seasons (46 rows, 5 columns):
  • No tidying needed for now - I’ll come back to this later.
# mutation.
snl_actors <- snl_actors %>% mutate(type = as.factor(type), gender = as.factor(gender))
                                    
# sanity check.
head(snl_actors)

Join Data (Attempt 1)

snl_mega <- snl_actors %>%
  left_join(snl_casts, by = "aid")
print(dfSummary(snl_mega, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_mega

Dimensions: 2764 x 10
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 2296 others ]
18(0.7%)
14(0.5%)
13(0.5%)
11(0.4%)
11(0.4%)
10(0.4%)
10(0.4%)
9(0.3%)
9(0.3%)
9(0.3%)
2650(95.9%)
0 (0.0%)
type [factor]
1. cast
2. crew
3. guest
4. unknown
607(22.0%)
170(6.2%)
1926(69.7%)
61(2.2%)
0 (0.0%)
gender [factor]
1. andy
2. female
3. male
4. unknown
21(0.8%)
813(29.4%)
1540(55.7%)
390(14.1%)
0 (0.0%)
sid [numeric]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 2150 (77.8%)
featured [logical]
1. FALSE
2. TRUE
451(73.5%)
163(26.5%)
2150 (77.8%)
first_epid [numeric]
Mean (sd) : 19909634 (111264.8)
min ≤ med ≤ max:
19770115 ≤ 19901110 ≤ 20141025
IQR (CV) : 156624.2 (0)
35 distinct values 2714 (98.2%)
last_epid [numeric]
Mean (sd) : 19944038 (126122.2)
min ≤ med ≤ max:
19751011 ≤ 19950225 ≤ 20140201
IQR (CV) : 190005 (0)
17 distinct values 2747 (99.4%)
update_anchor [logical]
1. FALSE
2. TRUE
541(88.1%)
73(11.9%)
2150 (77.8%)
n_episodes [numeric]
Mean (sd) : 18.7 (4)
min ≤ med ≤ max:
1 ≤ 20 ≤ 24
IQR (CV) : 2 (0.2)
22 distinct values 2150 (77.8%)
season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 2150 (77.8%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-24

After joining, we have 10 columns, which was expected (3 + 8 - 1 = 10). We have a lot more missing values, which was also expected - the snl_casts dataset, which has more columns, has much fewer rows. How did we get 2764 rows, though –> 2306 + 614 - 2764 = 156. I think there are only 156 overlaps in both datasets - let’s try to verify this.

snl_casts %>% select(aid) %>%
  filter(snl_casts$aid %in% snl_actors$aid) %>% unique()

Ok, we’ve verified that. I want to map over the values from snl_mega just for these 156 actors, so as to minimize the number of NA values. I would usually do an index/match on Excel for this - how would I do this in R?

Join Data (Attempt 2)

I’m also going to try to add 1 column from snl_mega to snl_seasons (another dataset) –> number of actors per season.

# create subset.
sub <- snl_mega %>% select(aid, sid) %>%  
  filter(! is.na(aid)) %>% 
  group_by(sid) %>% unique()

# get number of actors per season.
sub_final <- sub %>% group_by(sid) %>% summarise("actors_per_season"=n())

# sanity check.
sum(sub_final$actors_per_season)
[1] 2764

Ok, now let me try joining them.

snl_szns <- snl_seasons %>%
  right_join(sub_final, by = "sid")
print(dfSummary(snl_szns, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_szns

Dimensions: 47 x 6
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
sid [numeric]
Mean (sd) : 23.5 (13.4)
min ≤ med ≤ max:
1 ≤ 23.5 ≤ 46
IQR (CV) : 22.5 (0.6)
46 distinct values 1 (2.1%)
year [numeric]
Mean (sd) : 1997.5 (13.4)
min ≤ med ≤ max:
1975 ≤ 1997.5 ≤ 2020
IQR (CV) : 22.5 (0)
46 distinct values 1 (2.1%)
first_epid [numeric]
Mean (sd) : 19975965 (134209.3)
min ≤ med ≤ max:
19751011 ≤ 19975926 ≤ 20201003
IQR (CV) : 224910.2 (0)
46 distinct values 1 (2.1%)
last_epid [numeric]
Mean (sd) : 19985509 (134223.9)
min ≤ med ≤ max:
19760731 ≤ 19985512 ≤ 20210410
IQR (CV) : 225066.2 (0)
46 distinct values 1 (2.1%)
n_episodes [numeric]
Mean (sd) : 19.7 (2.3)
min ≤ med ≤ max:
12 ≤ 20 ≤ 24
IQR (CV) : 1 (0.1)
12:1(2.2%)
13:2(4.3%)
17:2(4.3%)
18:2(4.3%)
19:3(6.5%)
20:23(50.0%)
21:7(15.2%)
22:5(10.9%)
24:1(2.2%)
1 (2.1%)
actors_per_season [integer]
Mean (sd) : 58.8 (311.7)
min ≤ med ≤ max:
8 ≤ 14 ≤ 2150
IQR (CV) : 5 (5.3)
13 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-24

Ok - from the univariate graph produced by summarytools, it looks like the seasons in the middle had the most actors. We do need to qualify this statement, because we filtered out NA values when calculating actors per season.