challenge_8
snl
Joining Data
Author

Mani Shanker Kamarapu

Published

August 25, 2022

Code
library(tidyverse)
library(ggplot2)
library(scales)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

I have chosen SNL data sets. There are a total of 3 data sets in SNL, but they are similar as they are taken from same resource. First I would read the data sets and check them and determine the rows we can use to combine the data sets.

Code
actors <- read_csv("_data/snl_actors.csv")
casts <- read_csv("_data/snl_casts.csv")
seasons <- read_csv("_data/snl_seasons.csv")
print(summarytools::dfSummary(actors, 
                              varnumbers = FALSE, 
                              plain.ascii = FALSE, 
                              style = "grid", 
                              graph.magnif = 0.70, 
                              valid.col = FALSE), 
      method = 'render', 
      table.classes = 'table-condensed')

Data Frame Summary

actors

Dimensions: 2306 x 4
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. 'N Sync
2. 070 Shake
3. 10,000 Maniacs
4. 14 Karat Soul
5. 2 Chainz
6. 3-D
7. 3RDEYEGIRL
8. 50 Cent
9. A Tribe Called Quest
10. A$AP Rocky
[ 2296 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
2296(99.6%)
0 (0.0%)
url [character]
1. /Cast/?AbEl
2. /Cast/?AdSa
3. /Cast/?AiBr
4. /Cast/?AlFr
5. /Cast/?AlMo
6. /Cast/?AlZw
7. /Cast/?AmPo
8. /Cast/?AnDi
9. /Cast/?AnGa
10. /Cast/?AnMH
[ 2239 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
2239(99.6%)
57 (2.5%)
type [character]
1. cast
2. crew
3. guest
4. unknown
154(6.7%)
170(7.4%)
1926(83.5%)
56(2.4%)
0 (0.0%)
gender [character]
1. andy
2. female
3. male
4. unknown
21(0.9%)
671(29.1%)
1226(53.2%)
388(16.8%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.1.3)
2022-09-04

Code
print(summarytools::dfSummary(casts, 
                              varnumbers = FALSE, 
                              plain.ascii = FALSE, 
                              style = "grid", 
                              graph.magnif = 0.70, 
                              valid.col = FALSE), 
      method = 'render', 
      table.classes = 'table-condensed')

Data Frame Summary

casts

Dimensions: 614 x 8
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 146 others ]
18(2.9%)
14(2.3%)
13(2.1%)
11(1.8%)
11(1.8%)
10(1.6%)
10(1.6%)
9(1.5%)
9(1.5%)
9(1.5%)
500(81.4%)
0 (0.0%)
sid [numeric]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 0 (0.0%)
featured [logical]
1. FALSE
2. TRUE
451(73.5%)
163(26.5%)
0 (0.0%)
first_epid [numeric]
Mean (sd) : 19909635 (111264.8)
min ≤ med ≤ max:
19770115 ≤ 19901110 ≤ 20141025
IQR (CV) : 156624.2 (0)
35 distinct values 564 (91.9%)
last_epid [numeric]
Mean (sd) : 19944038 (126122.2)
min ≤ med ≤ max:
19751011 ≤ 19950225 ≤ 20140201
IQR (CV) : 190005 (0)
17 distinct values 597 (97.2%)
update_anchor [logical]
1. FALSE
2. TRUE
541(88.1%)
73(11.9%)
0 (0.0%)
n_episodes [numeric]
Mean (sd) : 18.7 (4)
min ≤ med ≤ max:
1 ≤ 20 ≤ 24
IQR (CV) : 2 (0.2)
22 distinct values 0 (0.0%)
season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.1.3)
2022-09-04

Code
print(summarytools::dfSummary(seasons, 
                              varnumbers = FALSE, 
                              plain.ascii = FALSE, 
                              style = "grid", 
                              graph.magnif = 0.70, 
                              valid.col = FALSE), 
      method = 'render', 
      table.classes = 'table-condensed')

Data Frame Summary

seasons

Dimensions: 46 x 5
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
sid [numeric]
Mean (sd) : 23.5 (13.4)
min ≤ med ≤ max:
1 ≤ 23.5 ≤ 46
IQR (CV) : 22.5 (0.6)
46 distinct values 0 (0.0%)
year [numeric]
Mean (sd) : 1997.5 (13.4)
min ≤ med ≤ max:
1975 ≤ 1997.5 ≤ 2020
IQR (CV) : 22.5 (0)
46 distinct values 0 (0.0%)
first_epid [numeric]
Mean (sd) : 19975965 (134209.3)
min ≤ med ≤ max:
19751011 ≤ 19975927 ≤ 20201003
IQR (CV) : 224910.2 (0)
46 distinct values 0 (0.0%)
last_epid [numeric]
Mean (sd) : 19985509 (134223.9)
min ≤ med ≤ max:
19760731 ≤ 19985512 ≤ 20210410
IQR (CV) : 225066.2 (0)
46 distinct values 0 (0.0%)
n_episodes [numeric]
Mean (sd) : 19.7 (2.3)
min ≤ med ≤ max:
12 ≤ 20 ≤ 24
IQR (CV) : 1 (0.1)
12:1(2.2%)
13:2(4.3%)
17:2(4.3%)
18:2(4.3%)
19:3(6.5%)
20:23(50.0%)
21:7(15.2%)
22:5(10.9%)
24:1(2.2%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.1.3)
2022-09-04

Briefly describe the data

The data sets are read and before combining the data sets we need to tidy the data as needed so it makes it simple to join data and remove the unwanted and empty observations. So we tidy them separately and then join them.

Tidy Data (as needed)

Code
actors <- actors %>%
  select(!url)
Code
casts <- casts %>%
  select(aid, sid, season_fraction)

The data sets are now tidy and I have observed that we have common variables between the data sets, First between actors and casts we can join using aid(actor id) variable and then join with seasons using Sid(season id) variable.

Join Data

inner join function

The inner join function is used to join data sets, where the resulting data set contains the intersection f both data sets.

Code
joined_actors_casts <- actors %>%
  inner_join(casts, by = "aid")

all_joined <- joined_actors_casts %>%
  inner_join(seasons, by = "sid")

all_joined
Code
print(summarytools::dfSummary(all_joined, 
                              varnumbers = FALSE, 
                              plain.ascii = FALSE, 
                              style = "grid", 
                              graph.magnif = 0.70, 
                              valid.col = FALSE), 
      method = 'render', 
      table.classes = 'table-condensed')

Data Frame Summary

all_joined

Dimensions: 614 x 9
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
aid [character]
1. Kenan Thompson
2. Darrell Hammond
3. Seth Meyers
4. Al Franken
5. Fred Armisen
6. Kate McKinnon
7. Tim Meadows
8. Aidy Bryant
9. Bobby Moynihan
10. Cecily Strong
[ 146 others ]
18(2.9%)
14(2.3%)
13(2.1%)
11(1.8%)
11(1.8%)
10(1.6%)
10(1.6%)
9(1.5%)
9(1.5%)
9(1.5%)
500(81.4%)
0 (0.0%)
type [character]
1. cast
2. unknown
607(98.9%)
7(1.1%)
0 (0.0%)
gender [character]
1. female
2. male
3. unknown
194(31.6%)
416(67.8%)
4(0.7%)
0 (0.0%)
sid [numeric]
Mean (sd) : 25.5 (13.1)
min ≤ med ≤ max:
1 ≤ 26 ≤ 46
IQR (CV) : 22 (0.5)
46 distinct values 0 (0.0%)
season_fraction [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV) : 0 (0.2)
36 distinct values 0 (0.0%)
year [numeric]
Mean (sd) : 1999.5 (13.1)
min ≤ med ≤ max:
1975 ≤ 2000 ≤ 2020
IQR (CV) : 22 (0)
46 distinct values 0 (0.0%)
first_epid [numeric]
Mean (sd) : 19995688 (130607)
min ≤ med ≤ max:
19751011 ≤ 20001007 ≤ 20201003
IQR (CV) : 219994 (0)
46 distinct values 0 (0.0%)
last_epid [numeric]
Mean (sd) : 20005232 (130621.3)
min ≤ med ≤ max:
19760731 ≤ 20010519 ≤ 20210410
IQR (CV) : 220000 (0)
46 distinct values 0 (0.0%)
n_episodes [numeric]
Mean (sd) : 19.7 (2.1)
min ≤ med ≤ max:
12 ≤ 20 ≤ 24
IQR (CV) : 1 (0.1)
12:12(2.0%)
13:24(3.9%)
17:30(4.9%)
18:31(5.0%)
19:37(6.0%)
20:297(48.4%)
21:111(18.1%)
22:63(10.3%)
24:9(1.5%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.1.3)
2022-09-04

Now the data sets are joined into a complete data set, but we have to remove some values to make it tidy and it is the best we can do to describe and plot the graphs.

Code
yg <- all_joined %>%
  filter(!gender == "unknown") %>%
  group_by(year) %>%
  count(gender)
sg <- all_joined %>%
  filter(!gender == "unknown") %>%
  group_by(year) %>%
  count(gender)

yg %>%
  ggplot(aes(year, n, color = gender)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  labs(title = "The gender analysis based on year")

As per the graph, the male are more than female as casts in different seasons as the year passes on.