Challenge_8_Jyoti

challenge_8

summarytools

Joining Data

Author

Jyoti Rani

Published

August 22, 2022

Including libraries

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2

Warning: package 'ggplot2' was built under R version 4.2.2

Warning: package 'tibble' was built under R version 4.2.2

Warning: package 'stringr' was built under R version 4.2.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(summarytools)

Warning: package 'summarytools' was built under R version 4.2.2


Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

I read 3 SNL datasets, together

snl_actors <- (read_csv("_data/snl_actors.csv", show_col_types = FALSE) [,-2])
snl_actors <- complete(snl_actors)

snl_casts <- read_csv("_data/snl_casts.csv", show_col_types = FALSE)
snl_casts <- complete(snl_casts)

snl_seasons <- read_csv("_data/snl_seasons.csv", show_col_types = FALSE)
snl_seasons <- complete(snl_seasons)

Briefly describe and tidy data

print(dfSummary(snl_actors, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_actors

Dimensions: 2306 x 3
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

aid [character]

1. 'N Sync

2. 070 Shake

3. 10,000 Maniacs

4. 14 Karat Soul

5. 2 Chainz

6. 3-D

7. 3RDEYEGIRL

8. 50 Cent

9. A Tribe Called Quest

10. A$AP Rocky

[ 2296 others ]

1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
2296	(	99.6%	)

0 (0.0%)

type [character]

1. cast

2. crew

3. guest

4. unknown

154	(	6.7%	)
170	(	7.4%	)
1926	(	83.5%	)
56	(	2.4%	)

0 (0.0%)

gender [character]

1. andy

2. female

3. male

4. unknown

21	(	0.9%	)
671	(	29.1%	)
1226	(	53.2%	)
388	(	16.8%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

print(dfSummary(snl_casts, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_casts

Dimensions: 614 x 8
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

aid [character]

1. Kenan Thompson

2. Darrell Hammond

3. Seth Meyers

4. Al Franken

5. Fred Armisen

6. Kate McKinnon

7. Tim Meadows

8. Aidy Bryant

9. Bobby Moynihan

10. Cecily Strong

[ 146 others ]

18	(	2.9%	)
14	(	2.3%	)
13	(	2.1%	)
11	(	1.8%	)
11	(	1.8%	)
10	(	1.6%	)
10	(	1.6%	)
9	(	1.5%	)
9	(	1.5%	)
9	(	1.5%	)
500	(	81.4%	)

0 (0.0%)

sid [numeric]

Mean (sd) : 25.5 (13.1)

min ≤ med ≤ max:

1 ≤ 26 ≤ 46

IQR (CV) : 22 (0.5)

46 distinct values

0 (0.0%)

featured [logical]

1. FALSE

2. TRUE

451	(	73.5%	)
163	(	26.5%	)

0 (0.0%)

first_epid [numeric]

Mean (sd) : 19909635 (111264.8)

min ≤ med ≤ max:

19770115 ≤ 19901110 ≤ 20141025

IQR (CV) : 156624.2 (0)

35 distinct values

564 (91.9%)

last_epid [numeric]

Mean (sd) : 19944038 (126122.2)

min ≤ med ≤ max:

19751011 ≤ 19950225 ≤ 20140201

IQR (CV) : 190005 (0)

17 distinct values

597 (97.2%)

update_anchor [logical]

1. FALSE

2. TRUE

541	(	88.1%	)
73	(	11.9%	)

0 (0.0%)

n_episodes [numeric]

Mean (sd) : 18.7 (4)

min ≤ med ≤ max:

1 ≤ 20 ≤ 24

IQR (CV) : 2 (0.2)

22 distinct values

0 (0.0%)

season_fraction [numeric]

Mean (sd) : 0.9 (0.2)

min ≤ med ≤ max:

0 ≤ 1 ≤ 1

IQR (CV) : 0 (0.2)

36 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

Tidying snl_actors (2306 rows, 3 columns):
- No missing data.
- We can leave “aid” as a character column and convert type and gender to factors.
Tidying snl_casts (614 rows, 8 columns):
- Missing data in first_epid and last_epid - this is fine because we used complete() to make the NAs explicit.
- “aid” is also present in this dataset, which hopefully makes it intuitive to put the 2 datasets together (no renaming needed). There are much fewer values for Actor ID, so this might just be a subset of the other dataset.
Tidying snl_seasons (46 rows, 5 columns):

# mutation.
snl_actors <- snl_actors %>% mutate(type = as.factor(type), gender = as.factor(gender))
                                    
# sanity check.
head(snl_actors)

Join Data 1

snl_mega <- snl_actors %>%
  left_join(snl_casts, by = "aid")
print(dfSummary(snl_mega, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_mega

Dimensions: 2764 x 10
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

aid [character]

1. Kenan Thompson

2. Darrell Hammond

3. Seth Meyers

4. Al Franken

5. Fred Armisen

6. Kate McKinnon

7. Tim Meadows

8. Aidy Bryant

9. Bobby Moynihan

10. Cecily Strong

[ 2296 others ]

18	(	0.7%	)
14	(	0.5%	)
13	(	0.5%	)
11	(	0.4%	)
11	(	0.4%	)
10	(	0.4%	)
10	(	0.4%	)
9	(	0.3%	)
9	(	0.3%	)
9	(	0.3%	)
2650	(	95.9%	)

0 (0.0%)

type [factor]

1. cast

2. crew

3. guest

4. unknown

607	(	22.0%	)
170	(	6.2%	)
1926	(	69.7%	)
61	(	2.2%	)

0 (0.0%)

gender [factor]

1. andy

2. female

3. male

4. unknown

21	(	0.8%	)
813	(	29.4%	)
1540	(	55.7%	)
390	(	14.1%	)

0 (0.0%)

sid [numeric]

Mean (sd) : 25.5 (13.1)

min ≤ med ≤ max:

1 ≤ 26 ≤ 46

IQR (CV) : 22 (0.5)

46 distinct values

2150 (77.8%)

featured [logical]

1. FALSE

2. TRUE

451	(	73.5%	)
163	(	26.5%	)

2150 (77.8%)

first_epid [numeric]

Mean (sd) : 19909635 (111264.8)

min ≤ med ≤ max:

19770115 ≤ 19901110 ≤ 20141025

IQR (CV) : 156624.2 (0)

35 distinct values

2714 (98.2%)

last_epid [numeric]

Mean (sd) : 19944038 (126122.2)

min ≤ med ≤ max:

19751011 ≤ 19950225 ≤ 20140201

IQR (CV) : 190005 (0)

17 distinct values

2747 (99.4%)

update_anchor [logical]

1. FALSE

2. TRUE

541	(	88.1%	)
73	(	11.9%	)

2150 (77.8%)

n_episodes [numeric]

Mean (sd) : 18.7 (4)

min ≤ med ≤ max:

1 ≤ 20 ≤ 24

IQR (CV) : 2 (0.2)

22 distinct values

2150 (77.8%)

season_fraction [numeric]

Mean (sd) : 0.9 (0.2)

min ≤ med ≤ max:

0 ≤ 1 ≤ 1

IQR (CV) : 0 (0.2)

36 distinct values

2150 (77.8%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

After joining, we have 10 columns. We have a lot more missing values, which was also expected - the snl_casts dataset, which has more columns, has much fewer rows. There are 156 overlaps in both datasets - let’s try to verify this.

snl_casts %>% select(aid) %>%
  filter(snl_casts$aid %in% snl_actors$aid) %>% unique()

Now I map over the values from snl_mega just for these 156 actors, so as to minimize the number of NA values. I would usually do an index/match on Excel for this - how would I do this in R?

Join Data2

add 1 column from snl_mega to snl_seasons (another dataset) –> number of actors per season.

# create subset.
sub <- snl_mega %>% select(aid, sid) %>%  
  filter(! is.na(aid)) %>% 
  group_by(sid) %>% unique()

# get number of actors per season.
sub_final <- sub %>% group_by(sid) %>% summarise("actors_per_season"=n())

# sanity check.
sum(sub_final$actors_per_season)

[1] 2764

Now I join them.

snl_szns <- snl_seasons %>%
  right_join(sub_final, by = "sid")
print(dfSummary(snl_szns, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_szns

Dimensions: 47 x 6
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

sid [numeric]

Mean (sd) : 23.5 (13.4)

min ≤ med ≤ max:

1 ≤ 23.5 ≤ 46

IQR (CV) : 22.5 (0.6)

46 distinct values

1 (2.1%)

year [numeric]

Mean (sd) : 1997.5 (13.4)

min ≤ med ≤ max:

1975 ≤ 1997.5 ≤ 2020

IQR (CV) : 22.5 (0)

46 distinct values

1 (2.1%)

first_epid [numeric]

Mean (sd) : 19975965 (134209.3)

min ≤ med ≤ max:

19751011 ≤ 19975927 ≤ 20201003

IQR (CV) : 224910.2 (0)

46 distinct values

1 (2.1%)

last_epid [numeric]

Mean (sd) : 19985509 (134223.9)

min ≤ med ≤ max:

19760731 ≤ 19985512 ≤ 20210410

IQR (CV) : 225066.2 (0)

46 distinct values

1 (2.1%)

n_episodes [numeric]

Mean (sd) : 19.7 (2.3)

min ≤ med ≤ max:

12 ≤ 20 ≤ 24

IQR (CV) : 1 (0.1)

12	:	1	(	2.2%	)
13	:	2	(	4.3%	)
17	:	2	(	4.3%	)
18	:	2	(	4.3%	)
19	:	3	(	6.5%	)
20	:	23	(	50.0%	)
21	:	7	(	15.2%	)
22	:	5	(	10.9%	)
24	:	1	(	2.2%	)

1 (2.1%)

actors_per_season [integer]

Mean (sd) : 58.8 (311.7)

min ≤ med ≤ max:

8 ≤ 14 ≤ 2150

IQR (CV) : 5 (5.3)

13 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

Ok - from the univariate graph produced by summarytools, it looks like the seasons in the middle had the most actors. We do need to qualify this statement, because we filtered out NA values when calculating actors per season.