Challenge_8_Jyoti

challenge_8
summarytools
Joining Data
Author

Jyoti Rani

Published

August 22, 2022

Including libraries

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tibble' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(summarytools)
Warning: package 'summarytools' was built under R version 4.2.2

Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

I read 3 SNL datasets, together

snl_actors <- (read_csv("_data/snl_actors.csv", show_col_types = FALSE) [,-2])
snl_actors <- complete(snl_actors)

snl_casts <- read_csv("_data/snl_casts.csv", show_col_types = FALSE)
snl_casts <- complete(snl_casts)

snl_seasons <- read_csv("_data/snl_seasons.csv", show_col_types = FALSE)
snl_seasons <- complete(snl_seasons)

Briefly describe and tidy data

print(dfSummary(snl_actors, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_actors

Dimensions: 2306 x 3
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. 'N Sync
2. 070 Shake
3. 10,000 Maniacs
4. 14 Karat Soul
5. 2 Chainz
6. 3-D
7. 3RDEYEGIRL
8. 50 Cent
9. A Tribe Called Quest
10. A$AP Rocky
[ 2296 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
2296(99.6%)
0 (0.0%)
type [character]
1. cast
2. crew
3. guest
4. unknown
154(6.7%)
170(7.4%)
1926(83.5%)
56(2.4%)
0 (0.0%)
gender [character]
1. andy
2. female
3. male
4. unknown
21(0.9%)
671(29.1%)
1226(53.2%)
388(16.8%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

print(dfSummary(snl_casts, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_casts

Dimensions: 614 x 8
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 146 others ]
18(2.9%)
14(2.3%)
13(2.1%)
11(1.8%)
11(1.8%)
10(1.6%)
10(1.6%)
9(1.5%)
9(1.5%)
9(1.5%)
500(81.4%)
0 (0.0%)
sid [numeric]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 0 (0.0%)
featured [logical]
1. FALSE
2. TRUE
451(73.5%)
163(26.5%)
0 (0.0%)
first_epid [numeric]
Mean (sd) : 19909635 (111264.8)
min ≤ med ≤ max:
19770115 ≤ 19901110 ≤ 20141025
IQR (CV) : 156624.2 (0)
35 distinct values 564 (91.9%)
last_epid [numeric]
Mean (sd) : 19944038 (126122.2)
min ≤ med ≤ max:
19751011 ≤ 19950225 ≤ 20140201
IQR (CV) : 190005 (0)
17 distinct values 597 (97.2%)
update_anchor [logical]
1. FALSE
2. TRUE
541(88.1%)
73(11.9%)
0 (0.0%)
n_episodes [numeric]
Mean (sd) : 18.7 (4)
min ≤ med ≤ max:
1 ≤ 20 ≤ 24
IQR (CV) : 2 (0.2)
22 distinct values 0 (0.0%)
season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

  • Tidying snl_actors (2306 rows, 3 columns):
    • No missing data.
    • We can leave “aid” as a character column and convert type and gender to factors.
  • Tidying snl_casts (614 rows, 8 columns):
    • Missing data in first_epid and last_epid - this is fine because we used complete() to make the NAs explicit.
    • “aid” is also present in this dataset, which hopefully makes it intuitive to put the 2 datasets together (no renaming needed). There are much fewer values for Actor ID, so this might just be a subset of the other dataset.
  • Tidying snl_seasons (46 rows, 5 columns):
# mutation.
snl_actors <- snl_actors %>% mutate(type = as.factor(type), gender = as.factor(gender))
                                    
# sanity check.
head(snl_actors)

Join Data 1

snl_mega <- snl_actors %>%
  left_join(snl_casts, by = "aid")
print(dfSummary(snl_mega, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_mega

Dimensions: 2764 x 10
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 2296 others ]
18(0.7%)
14(0.5%)
13(0.5%)
11(0.4%)
11(0.4%)
10(0.4%)
10(0.4%)
9(0.3%)
9(0.3%)
9(0.3%)
2650(95.9%)
0 (0.0%)
type [factor]
1. cast
2. crew
3. guest
4. unknown
607(22.0%)
170(6.2%)
1926(69.7%)
61(2.2%)
0 (0.0%)
gender [factor]
1. andy
2. female
3. male
4. unknown
21(0.8%)
813(29.4%)
1540(55.7%)
390(14.1%)
0 (0.0%)
sid [numeric]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 2150 (77.8%)
featured [logical]
1. FALSE
2. TRUE
451(73.5%)
163(26.5%)
2150 (77.8%)
first_epid [numeric]
Mean (sd) : 19909635 (111264.8)
min ≤ med ≤ max:
19770115 ≤ 19901110 ≤ 20141025
IQR (CV) : 156624.2 (0)
35 distinct values 2714 (98.2%)
last_epid [numeric]
Mean (sd) : 19944038 (126122.2)
min ≤ med ≤ max:
19751011 ≤ 19950225 ≤ 20140201
IQR (CV) : 190005 (0)
17 distinct values 2747 (99.4%)
update_anchor [logical]
1. FALSE
2. TRUE
541(88.1%)
73(11.9%)
2150 (77.8%)
n_episodes [numeric]
Mean (sd) : 18.7 (4)
min ≤ med ≤ max:
1 ≤ 20 ≤ 24
IQR (CV) : 2 (0.2)
22 distinct values 2150 (77.8%)
season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 2150 (77.8%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

After joining, we have 10 columns. We have a lot more missing values, which was also expected - the snl_casts dataset, which has more columns, has much fewer rows. There are 156 overlaps in both datasets - let’s try to verify this.

snl_casts %>% select(aid) %>%
  filter(snl_casts$aid %in% snl_actors$aid) %>% unique()

Now I map over the values from snl_mega just for these 156 actors, so as to minimize the number of NA values. I would usually do an index/match on Excel for this - how would I do this in R?

Join Data2

add 1 column from snl_mega to snl_seasons (another dataset) –> number of actors per season.

# create subset.
sub <- snl_mega %>% select(aid, sid) %>%  
  filter(! is.na(aid)) %>% 
  group_by(sid) %>% unique()

# get number of actors per season.
sub_final <- sub %>% group_by(sid) %>% summarise("actors_per_season"=n())

# sanity check.
sum(sub_final$actors_per_season)
[1] 2764

Now I join them.

snl_szns <- snl_seasons %>%
  right_join(sub_final, by = "sid")
print(dfSummary(snl_szns, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

snl_szns

Dimensions: 47 x 6
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
sid [numeric]
Mean (sd) : 23.5 (13.4)
min ≤ med ≤ max:
1 ≤ 23.5 ≤ 46
IQR (CV) : 22.5 (0.6)
46 distinct values 1 (2.1%)
year [numeric]
Mean (sd) : 1997.5 (13.4)
min ≤ med ≤ max:
1975 ≤ 1997.5 ≤ 2020
IQR (CV) : 22.5 (0)
46 distinct values 1 (2.1%)
first_epid [numeric]
Mean (sd) : 19975965 (134209.3)
min ≤ med ≤ max:
19751011 ≤ 19975927 ≤ 20201003
IQR (CV) : 224910.2 (0)
46 distinct values 1 (2.1%)
last_epid [numeric]
Mean (sd) : 19985509 (134223.9)
min ≤ med ≤ max:
19760731 ≤ 19985512 ≤ 20210410
IQR (CV) : 225066.2 (0)
46 distinct values 1 (2.1%)
n_episodes [numeric]
Mean (sd) : 19.7 (2.3)
min ≤ med ≤ max:
12 ≤ 20 ≤ 24
IQR (CV) : 1 (0.1)
12:1(2.2%)
13:2(4.3%)
17:2(4.3%)
18:2(4.3%)
19:3(6.5%)
20:23(50.0%)
21:7(15.2%)
22:5(10.9%)
24:1(2.2%)
1 (2.1%)
actors_per_season [integer]
Mean (sd) : 58.8 (311.7)
min ≤ med ≤ max:
8 ≤ 14 ≤ 2150
IQR (CV) : 5 (5.3)
13 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

Ok - from the univariate graph produced by summarytools, it looks like the seasons in the middle had the most actors. We do need to qualify this statement, because we filtered out NA values when calculating actors per season.