Challenge 8

challenge_8

snl

Joining Data

Author

Mani Shanker Kamarapu

Published

August 25, 2022

Code

library(tidyverse)
library(ggplot2)
library(scales)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

I have chosen SNL data sets. There are a total of 3 data sets in SNL, but they are similar as they are taken from same resource. First I would read the data sets and check them and determine the rows we can use to combine the data sets.

Code

actors <- read_csv("_data/snl_actors.csv")
casts <- read_csv("_data/snl_casts.csv")
seasons <- read_csv("_data/snl_seasons.csv")
print(summarytools::dfSummary(actors, 
                              varnumbers = FALSE, 
                              plain.ascii = FALSE, 
                              style = "grid", 
                              graph.magnif = 0.70, 
                              valid.col = FALSE), 
      method = 'render', 
      table.classes = 'table-condensed')

Data Frame Summary

actors

Dimensions: 2306 x 4
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

aid [character]

1. 'N Sync

2. 070 Shake

3. 10,000 Maniacs

4. 14 Karat Soul

5. 2 Chainz

6. 3-D

7. 3RDEYEGIRL

8. 50 Cent

9. A Tribe Called Quest

10. A$AP Rocky

[ 2296 others ]

1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
2296	(	99.6%	)

0 (0.0%)

url [character]

1. /Cast/?AbEl

2. /Cast/?AdSa

3. /Cast/?AiBr

4. /Cast/?AlFr

5. /Cast/?AlMo

6. /Cast/?AlZw

7. /Cast/?AmPo

8. /Cast/?AnDi

9. /Cast/?AnGa

10. /Cast/?AnMH

[ 2239 others ]

1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
2239	(	99.6%	)

57 (2.5%)

type [character]

1. cast

2. crew

3. guest

4. unknown

154	(	6.7%	)
170	(	7.4%	)
1926	(	83.5%	)
56	(	2.4%	)

0 (0.0%)

gender [character]

1. andy

2. female

3. male

4. unknown

21	(	0.9%	)
671	(	29.1%	)
1226	(	53.2%	)
388	(	16.8%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.1.3)
2022-09-04

Code

print(summarytools::dfSummary(casts, 
                              varnumbers = FALSE, 
                              plain.ascii = FALSE, 
                              style = "grid", 
                              graph.magnif = 0.70, 
                              valid.col = FALSE), 
      method = 'render', 
      table.classes = 'table-condensed')

Data Frame Summary

casts

Dimensions: 614 x 8
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

aid [character]

1. Kenan Thompson

2. Darrell Hammond

3. Seth Meyers

4. Al Franken

5. Fred Armisen

6. Kate McKinnon

7. Tim Meadows

8. Aidy Bryant

9. Bobby Moynihan

10. Cecily Strong

[ 146 others ]

18	(	2.9%	)
14	(	2.3%	)
13	(	2.1%	)
11	(	1.8%	)
11	(	1.8%	)
10	(	1.6%	)
10	(	1.6%	)
9	(	1.5%	)
9	(	1.5%	)
9	(	1.5%	)
500	(	81.4%	)

0 (0.0%)

sid [numeric]

Mean (sd) : 25.5 (13.1)

min ≤ med ≤ max:

1 ≤ 26 ≤ 46

IQR (CV) : 22 (0.5)

46 distinct values

0 (0.0%)

featured [logical]

1. FALSE

2. TRUE

451	(	73.5%	)
163	(	26.5%	)

0 (0.0%)

first_epid [numeric]

Mean (sd) : 19909635 (111264.8)

min ≤ med ≤ max:

19770115 ≤ 19901110 ≤ 20141025

IQR (CV) : 156624.2 (0)

35 distinct values

564 (91.9%)

last_epid [numeric]

Mean (sd) : 19944038 (126122.2)

min ≤ med ≤ max:

19751011 ≤ 19950225 ≤ 20140201

IQR (CV) : 190005 (0)

17 distinct values

597 (97.2%)

update_anchor [logical]

1. FALSE

2. TRUE

541	(	88.1%	)
73	(	11.9%	)

0 (0.0%)

n_episodes [numeric]

Mean (sd) : 18.7 (4)

min ≤ med ≤ max:

1 ≤ 20 ≤ 24

IQR (CV) : 2 (0.2)

22 distinct values

0 (0.0%)

season_fraction [numeric]

Mean (sd) : 0.9 (0.2)

min ≤ med ≤ max:

0 ≤ 1 ≤ 1

IQR (CV) : 0 (0.2)

36 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.1.3)
2022-09-04

Code

print(summarytools::dfSummary(seasons, 
                              varnumbers = FALSE, 
                              plain.ascii = FALSE, 
                              style = "grid", 
                              graph.magnif = 0.70, 
                              valid.col = FALSE), 
      method = 'render', 
      table.classes = 'table-condensed')

Data Frame Summary

seasons

Dimensions: 46 x 5
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

sid [numeric]

Mean (sd) : 23.5 (13.4)

min ≤ med ≤ max:

1 ≤ 23.5 ≤ 46

IQR (CV) : 22.5 (0.6)

46 distinct values

0 (0.0%)

year [numeric]

Mean (sd) : 1997.5 (13.4)

min ≤ med ≤ max:

1975 ≤ 1997.5 ≤ 2020

IQR (CV) : 22.5 (0)

46 distinct values

0 (0.0%)

first_epid [numeric]

Mean (sd) : 19975965 (134209.3)

min ≤ med ≤ max:

19751011 ≤ 19975927 ≤ 20201003

IQR (CV) : 224910.2 (0)

46 distinct values

0 (0.0%)

last_epid [numeric]

Mean (sd) : 19985509 (134223.9)

min ≤ med ≤ max:

19760731 ≤ 19985512 ≤ 20210410

IQR (CV) : 225066.2 (0)

46 distinct values

0 (0.0%)

n_episodes [numeric]

Mean (sd) : 19.7 (2.3)

min ≤ med ≤ max:

12 ≤ 20 ≤ 24

IQR (CV) : 1 (0.1)

12	:	1	(	2.2%	)
13	:	2	(	4.3%	)
17	:	2	(	4.3%	)
18	:	2	(	4.3%	)
19	:	3	(	6.5%	)
20	:	23	(	50.0%	)
21	:	7	(	15.2%	)
22	:	5	(	10.9%	)
24	:	1	(	2.2%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.1.3)
2022-09-04

Briefly describe the data

The data sets are read and before combining the data sets we need to tidy the data as needed so it makes it simple to join data and remove the unwanted and empty observations. So we tidy them separately and then join them.

Tidy Data (as needed)

Code

actors <- actors %>%
  select(!url)

Code

casts <- casts %>%
  select(aid, sid, season_fraction)

The data sets are now tidy and I have observed that we have common variables between the data sets, First between actors and casts we can join using aid(actor id) variable and then join with seasons using Sid(season id) variable.

Join Data

inner join function

The inner join function is used to join data sets, where the resulting data set contains the intersection f both data sets.

Code

joined_actors_casts <- actors %>%
  inner_join(casts, by = "aid")

all_joined <- joined_actors_casts %>%
  inner_join(seasons, by = "sid")

all_joined

Code

print(summarytools::dfSummary(all_joined, 
                              varnumbers = FALSE, 
                              plain.ascii = FALSE, 
                              style = "grid", 
                              graph.magnif = 0.70, 
                              valid.col = FALSE), 
      method = 'render', 
      table.classes = 'table-condensed')

Data Frame Summary

all_joined

Dimensions: 614 x 9
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

aid [character]

1. Kenan Thompson

2. Darrell Hammond

3. Seth Meyers

4. Al Franken

5. Fred Armisen

6. Kate McKinnon

7. Tim Meadows

8. Aidy Bryant

9. Bobby Moynihan

10. Cecily Strong

[ 146 others ]

18	(	2.9%	)
14	(	2.3%	)
13	(	2.1%	)
11	(	1.8%	)
11	(	1.8%	)
10	(	1.6%	)
10	(	1.6%	)
9	(	1.5%	)
9	(	1.5%	)
9	(	1.5%	)
500	(	81.4%	)

0 (0.0%)

type [character]

1. cast

2. unknown

607	(	98.9%	)
7	(	1.1%	)

0 (0.0%)

gender [character]

1. female

2. male

3. unknown

194	(	31.6%	)
416	(	67.8%	)
4	(	0.7%	)

0 (0.0%)

sid [numeric]

Mean (sd) : 25.5 (13.1)

min ≤ med ≤ max:

1 ≤ 26 ≤ 46

IQR (CV) : 22 (0.5)

46 distinct values

0 (0.0%)

season_fraction [numeric]

Mean (sd) : 0.9 (0.2)

min ≤ med ≤ max:

0 ≤ 1 ≤ 1

IQR (CV) : 0 (0.2)

36 distinct values

0 (0.0%)

year [numeric]

Mean (sd) : 1999.5 (13.1)

min ≤ med ≤ max:

1975 ≤ 2000 ≤ 2020

IQR (CV) : 22 (0)

46 distinct values

0 (0.0%)

first_epid [numeric]

Mean (sd) : 19995688 (130607)

min ≤ med ≤ max:

19751011 ≤ 20001007 ≤ 20201003

IQR (CV) : 219994 (0)

46 distinct values

0 (0.0%)

last_epid [numeric]

Mean (sd) : 20005232 (130621.3)

min ≤ med ≤ max:

19760731 ≤ 20010519 ≤ 20210410

IQR (CV) : 220000 (0)

46 distinct values

0 (0.0%)

n_episodes [numeric]

Mean (sd) : 19.7 (2.1)

min ≤ med ≤ max:

12 ≤ 20 ≤ 24

IQR (CV) : 1 (0.1)

12	:	12	(	2.0%	)
13	:	24	(	3.9%	)
17	:	30	(	4.9%	)
18	:	31	(	5.0%	)
19	:	37	(	6.0%	)
20	:	297	(	48.4%	)
21	:	111	(	18.1%	)
22	:	63	(	10.3%	)
24	:	9	(	1.5%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.1.3)
2022-09-04

Now the data sets are joined into a complete data set, but we have to remove some values to make it tidy and it is the best we can do to describe and plot the graphs.

Code

yg <- all_joined %>%
  filter(!gender == "unknown") %>%
  group_by(year) %>%
  count(gender)
sg <- all_joined %>%
  filter(!gender == "unknown") %>%
  group_by(year) %>%
  count(gender)

yg %>%
  ggplot(aes(year, n, color = gender)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  labs(title = "The gender analysis based on year")

As per the graph, the male are more than female as casts in different seasons as the year passes on.