Challenge 8

challenge_8

snl

tidyverse

summarytools

Joining Data

Author

Saaradhaa M

Published

August 25, 2022

library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

I’m gonna try to put together 3 SNL datasets - let’s read both of them in.

snl_actors <- (read_csv("_data/snl_actors.csv", show_col_types = FALSE) [,-2])
snl_actors <- complete(snl_actors)

snl_casts <- read_csv("_data/snl_casts.csv", show_col_types = FALSE)
snl_casts <- complete(snl_casts)

snl_seasons <- read_csv("_data/snl_seasons.csv", show_col_types = FALSE)
snl_seasons <- complete(snl_seasons)

Briefly describe and tidy data

print(dfSummary(snl_actors, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_actors

Dimensions: 2306 x 3
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

aid [character]

1. 'N Sync

2. 070 Shake

3. 10,000 Maniacs

4. 14 Karat Soul

5. 2 Chainz

6. 3-D

7. 3RDEYEGIRL

8. 50 Cent

9. A Tribe Called Quest

10. A. Whitney Brown

[ 2296 others ]

1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
2296	(	99.6%	)

0 (0.0%)

type [character]

1. cast

2. crew

3. guest

4. unknown

154	(	6.7%	)
170	(	7.4%	)
1926	(	83.5%	)
56	(	2.4%	)

0 (0.0%)

gender [character]

1. andy

2. female

3. male

4. unknown

21	(	0.9%	)
671	(	29.1%	)
1226	(	53.2%	)
388	(	16.8%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-24

print(dfSummary(snl_casts, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_casts

Dimensions: 614 x 8
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

aid [character]

1. Kenan Thompson

2. Darrell Hammond

3. Seth Meyers

4. Al Franken

5. Fred Armisen

6. Kate McKinnon

7. Tim Meadows

8. Aidy Bryant

9. Bobby Moynihan

10. Cecily Strong

[ 146 others ]

18	(	2.9%	)
14	(	2.3%	)
13	(	2.1%	)
11	(	1.8%	)
11	(	1.8%	)
10	(	1.6%	)
10	(	1.6%	)
9	(	1.5%	)
9	(	1.5%	)
9	(	1.5%	)
500	(	81.4%	)

0 (0.0%)

sid [numeric]

Mean (sd) : 25.5 (13.1)

min ≤ med ≤ max:

1 ≤ 26 ≤ 46

IQR (CV) : 22 (0.5)

46 distinct values

0 (0.0%)

featured [logical]

1. FALSE

2. TRUE

451	(	73.5%	)
163	(	26.5%	)

0 (0.0%)

first_epid [numeric]

Mean (sd) : 19909634 (111264.8)

min ≤ med ≤ max:

19770115 ≤ 19901110 ≤ 20141025

IQR (CV) : 156624.2 (0)

35 distinct values

564 (91.9%)

last_epid [numeric]

Mean (sd) : 19944038 (126122.2)

min ≤ med ≤ max:

19751011 ≤ 19950225 ≤ 20140201

IQR (CV) : 190005 (0)

17 distinct values

597 (97.2%)

update_anchor [logical]

1. FALSE

2. TRUE

541	(	88.1%	)
73	(	11.9%	)

0 (0.0%)

n_episodes [numeric]

Mean (sd) : 18.7 (4)

min ≤ med ≤ max:

1 ≤ 20 ≤ 24

IQR (CV) : 2 (0.2)

22 distinct values

0 (0.0%)

season_fraction [numeric]

Mean (sd) : 0.9 (0.2)

min ≤ med ≤ max:

0 ≤ 1 ≤ 1

IQR (CV) : 0 (0.2)

36 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-24

Tidying snl_actors (2306 rows, 3 columns):
- No missing data.
- We can leave “aid” as a character column and convert type and gender to factors.
Tidying snl_casts (614 rows, 8 columns):
- Missing data in first_epid and last_epid - this is fine because we used complete() to make the NAs explicit.
- “aid” is also present in this dataset, which hopefully makes it intuitive to put the 2 datasets together (no renaming needed). There are much fewer values for Actor ID, so this might just be a subset of the other dataset.
Tidying snl_seasons (46 rows, 5 columns):
No tidying needed for now - I’ll come back to this later.

# mutation.
snl_actors <- snl_actors %>% mutate(type = as.factor(type), gender = as.factor(gender))
                                    
# sanity check.
head(snl_actors)

Join Data (Attempt 1)

snl_mega <- snl_actors %>%
  left_join(snl_casts, by = "aid")
print(dfSummary(snl_mega, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_mega

Dimensions: 2764 x 10
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

aid [character]

1. Kenan Thompson

2. Darrell Hammond

3. Seth Meyers

4. Al Franken

5. Fred Armisen

6. Kate McKinnon

7. Tim Meadows

8. Aidy Bryant

9. Bobby Moynihan

10. Cecily Strong

[ 2296 others ]

18	(	0.7%	)
14	(	0.5%	)
13	(	0.5%	)
11	(	0.4%	)
11	(	0.4%	)
10	(	0.4%	)
10	(	0.4%	)
9	(	0.3%	)
9	(	0.3%	)
9	(	0.3%	)
2650	(	95.9%	)

0 (0.0%)

type [factor]

1. cast

2. crew

3. guest

4. unknown

607	(	22.0%	)
170	(	6.2%	)
1926	(	69.7%	)
61	(	2.2%	)

0 (0.0%)

gender [factor]

1. andy

2. female

3. male

4. unknown

21	(	0.8%	)
813	(	29.4%	)
1540	(	55.7%	)
390	(	14.1%	)

0 (0.0%)

sid [numeric]

Mean (sd) : 25.5 (13.1)

min ≤ med ≤ max:

1 ≤ 26 ≤ 46

IQR (CV) : 22 (0.5)

46 distinct values

2150 (77.8%)

featured [logical]

1. FALSE

2. TRUE

451	(	73.5%	)
163	(	26.5%	)

2150 (77.8%)

first_epid [numeric]

Mean (sd) : 19909634 (111264.8)

min ≤ med ≤ max:

19770115 ≤ 19901110 ≤ 20141025

IQR (CV) : 156624.2 (0)

35 distinct values

2714 (98.2%)

last_epid [numeric]

Mean (sd) : 19944038 (126122.2)

min ≤ med ≤ max:

19751011 ≤ 19950225 ≤ 20140201

IQR (CV) : 190005 (0)

17 distinct values

2747 (99.4%)

update_anchor [logical]

1. FALSE

2. TRUE

541	(	88.1%	)
73	(	11.9%	)

2150 (77.8%)

n_episodes [numeric]

Mean (sd) : 18.7 (4)

min ≤ med ≤ max:

1 ≤ 20 ≤ 24

IQR (CV) : 2 (0.2)

22 distinct values

2150 (77.8%)

season_fraction [numeric]

Mean (sd) : 0.9 (0.2)

min ≤ med ≤ max:

0 ≤ 1 ≤ 1

IQR (CV) : 0 (0.2)

36 distinct values

2150 (77.8%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-24

After joining, we have 10 columns, which was expected (3 + 8 - 1 = 10). We have a lot more missing values, which was also expected - the snl_casts dataset, which has more columns, has much fewer rows. How did we get 2764 rows, though –> 2306 + 614 - 2764 = 156. I think there are only 156 overlaps in both datasets - let’s try to verify this.

snl_casts %>% select(aid) %>%
  filter(snl_casts$aid %in% snl_actors$aid) %>% unique()

Ok, we’ve verified that. I want to map over the values from snl_mega just for these 156 actors, so as to minimize the number of NA values. I would usually do an index/match on Excel for this - how would I do this in R?

Join Data (Attempt 2)

I’m also going to try to add 1 column from snl_mega to snl_seasons (another dataset) –> number of actors per season.

# create subset.
sub <- snl_mega %>% select(aid, sid) %>%  
  filter(! is.na(aid)) %>% 
  group_by(sid) %>% unique()

# get number of actors per season.
sub_final <- sub %>% group_by(sid) %>% summarise("actors_per_season"=n())

# sanity check.
sum(sub_final$actors_per_season)

[1] 2764

Ok, now let me try joining them.

snl_szns <- snl_seasons %>%
  right_join(sub_final, by = "sid")
print(dfSummary(snl_szns, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_szns

Dimensions: 47 x 6
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

sid [numeric]

Mean (sd) : 23.5 (13.4)

min ≤ med ≤ max:

1 ≤ 23.5 ≤ 46

IQR (CV) : 22.5 (0.6)

46 distinct values

1 (2.1%)

year [numeric]

Mean (sd) : 1997.5 (13.4)

min ≤ med ≤ max:

1975 ≤ 1997.5 ≤ 2020

IQR (CV) : 22.5 (0)

46 distinct values

1 (2.1%)

first_epid [numeric]

Mean (sd) : 19975965 (134209.3)

min ≤ med ≤ max:

19751011 ≤ 19975926 ≤ 20201003

IQR (CV) : 224910.2 (0)

46 distinct values

1 (2.1%)

last_epid [numeric]

Mean (sd) : 19985509 (134223.9)

min ≤ med ≤ max:

19760731 ≤ 19985512 ≤ 20210410

IQR (CV) : 225066.2 (0)

46 distinct values

1 (2.1%)

n_episodes [numeric]

Mean (sd) : 19.7 (2.3)

min ≤ med ≤ max:

12 ≤ 20 ≤ 24

IQR (CV) : 1 (0.1)

12	:	1	(	2.2%	)
13	:	2	(	4.3%	)
17	:	2	(	4.3%	)
18	:	2	(	4.3%	)
19	:	3	(	6.5%	)
20	:	23	(	50.0%	)
21	:	7	(	15.2%	)
22	:	5	(	10.9%	)
24	:	1	(	2.2%	)

1 (2.1%)

actors_per_season [integer]

Mean (sd) : 58.8 (311.7)

min ≤ med ≤ max:

8 ≤ 14 ≤ 2150

IQR (CV) : 5 (5.3)

13 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-24

Ok - from the univariate graph produced by summarytools, it looks like the seasons in the middle had the most actors. We do need to qualify this statement, because we filtered out NA values when calculating actors per season.