Challenge 8

challenge_8

snl

Joining Data

Author

Miranda Manka

Published

August 25, 2022

library(tidyverse)
library(ggplot2)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
mutate variables as needed (including sanity checks)
join two or more data sets and analyze some aspect of the joined data

Read in data

snl_seasons = read_csv("_data/snl_seasons.csv", show_col_types = FALSE)

snl_casts = read_csv("_data/snl_casts.csv", show_col_types = FALSE)

Briefly describe the data

The first dataset, snl_seasons is about the show “Saturday Night Live” which I can see started in 1975 and is still going. This dataset contains 46 observations with information about the different seasons. The variables are “sid” (season), “year” (year the season took place), “first_epid” (first episode of the season in the format YYYYMMDD), “last_epid” (last episode of the season in the format YYYYMMDD), and “n_episodes” (number of episodes in the season). The mean number of episodes per season is 19.7, the minimum is 12 episodes in 2007, and the maximum is 24 episodes in the first season.

print(summarytools::dfSummary(snl_seasons,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

snl_seasons

Dimensions: 46 x 5
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

sid [numeric]

Mean (sd) : 23.5 (13.4)

min ≤ med ≤ max:

1 ≤ 23.5 ≤ 46

IQR (CV) : 22.5 (0.6)

46 distinct values

0 (0.0%)

year [numeric]

Mean (sd) : 1997.5 (13.4)

min ≤ med ≤ max:

1975 ≤ 1997.5 ≤ 2020

IQR (CV) : 22.5 (0)

46 distinct values

0 (0.0%)

first_epid [numeric]

Mean (sd) : 19975965 (134209.3)

min ≤ med ≤ max:

19751011 ≤ 19975927 ≤ 20201003

IQR (CV) : 224910.2 (0)

46 distinct values

0 (0.0%)

last_epid [numeric]

Mean (sd) : 19985509 (134223.9)

min ≤ med ≤ max:

19760731 ≤ 19985512 ≤ 20210410

IQR (CV) : 225066.2 (0)

46 distinct values

0 (0.0%)

n_episodes [numeric]

Mean (sd) : 19.7 (2.3)

min ≤ med ≤ max:

12 ≤ 20 ≤ 24

IQR (CV) : 1 (0.1)

12	:	1	(	2.2%	)
13	:	2	(	4.3%	)
17	:	2	(	4.3%	)
18	:	2	(	4.3%	)
19	:	3	(	6.5%	)
20	:	23	(	50.0%	)
21	:	7	(	15.2%	)
22	:	5	(	10.9%	)
24	:	1	(	2.2%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-28

The second dataset, snl_casts has 614 observations, where each row contains a cast member for each different season they are in. The variables are “aid” (the person), “sid” (season), “featured” (true or false, not sure what this specifically represents - maybe if they were featured or a side part), “first_epid” (first episode they appear in for that season in the format YYYYMMDD), “last_epid” (last episode they appear in for that season in the format YYYYMMDD), “update_anchor” (true or false for weekend update anchor), “n_episodes” (number of episodes they appear in for that season), and “season_fraction” (the fraction of the number of episodes they appear in out of the total number of episodes in the season).

print(summarytools::dfSummary(snl_casts,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

snl_casts

Dimensions: 614 x 8
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

aid [character]

1. Kenan Thompson

2. Darrell Hammond

3. Seth Meyers

4. Al Franken

5. Fred Armisen

6. Kate McKinnon

7. Tim Meadows

8. Aidy Bryant

9. Bobby Moynihan

10. Cecily Strong

[ 146 others ]

18	(	2.9%	)
14	(	2.3%	)
13	(	2.1%	)
11	(	1.8%	)
11	(	1.8%	)
10	(	1.6%	)
10	(	1.6%	)
9	(	1.5%	)
9	(	1.5%	)
9	(	1.5%	)
500	(	81.4%	)

0 (0.0%)

sid [numeric]

Mean (sd) : 25.5 (13.1)

min ≤ med ≤ max:

1 ≤ 26 ≤ 46

IQR (CV) : 22 (0.5)

46 distinct values

0 (0.0%)

featured [logical]

1. FALSE

2. TRUE

451	(	73.5%	)
163	(	26.5%	)

0 (0.0%)

first_epid [numeric]

Mean (sd) : 19909635 (111264.8)

min ≤ med ≤ max:

19770115 ≤ 19901110 ≤ 20141025

IQR (CV) : 156624.2 (0)

35 distinct values

564 (91.9%)

last_epid [numeric]

Mean (sd) : 19944038 (126122.2)

min ≤ med ≤ max:

19751011 ≤ 19950225 ≤ 20140201

IQR (CV) : 190005 (0)

17 distinct values

597 (97.2%)

update_anchor [logical]

1. FALSE

2. TRUE

541	(	88.1%	)
73	(	11.9%	)

0 (0.0%)

n_episodes [numeric]

Mean (sd) : 18.7 (4)

min ≤ med ≤ max:

1 ≤ 20 ≤ 24

IQR (CV) : 2 (0.2)

22 distinct values

0 (0.0%)

season_fraction [numeric]

Mean (sd) : 0.9 (0.2)

min ≤ med ≤ max:

0 ≤ 1 ≤ 1

IQR (CV) : 0 (0.2)

36 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-28

Tidy Data

To tidy the data, first I renamed the first_epid, last-epid, and n_episodes variables found in both datasets to differentiate them so that when the join is complete, there is a name difference between the variables.

snl_seasons = snl_seasons %>% 
  rename(first_epid_s = first_epid, last_epid_s = last_epid, 
     n_episodes_s = n_episodes)

snl_casts = snl_casts %>% 
  rename(name = aid, first_epid_c = first_epid, last_epid_c = last_epid,
     n_episodes_c = n_episodes)

Next, I converted the first_epid and last_epid variables in each dataset (now already renamed) to dates because they are currently in YYYYMMDD format but not actually dates.

snl_seasons = snl_seasons %>%
  mutate(first_epid_s = ymd(first_epid_s), last_epid_s = ymd(last_epid_s))

snl_casts = snl_casts %>%
  mutate(first_epid_c = ymd(first_epid_c), last_epid_c = ymd(last_epid_c))

Join Data

Both datasets seem ready to join now. To do this, I will use a left join. I will be joining snl_casts (614 observations) into snl_seasons (46 observations). I will use every variable for both datasets, and I will be joining on the variable “sid” (the key variable) because in snl_seasons it uniquely describes the data and is available in snl_casts. After the join there should be 614 rows and 12 columns (the sum of both datasets columns minus one because the key variable is not counted twice).

snl_joined = left_join(snl_seasons, snl_casts, by = "sid")

After joining, I can see that the new dataset is as expected, 614 rows and 12 columns.

print(summarytools::dfSummary(snl_joined,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

snl_joined

Dimensions: 614 x 12
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

sid [numeric]

Mean (sd) : 25.5 (13.1)

min ≤ med ≤ max:

1 ≤ 26 ≤ 46

IQR (CV) : 22 (0.5)

46 distinct values

0 (0.0%)

year [numeric]

Mean (sd) : 1999.5 (13.1)

min ≤ med ≤ max:

1975 ≤ 2000 ≤ 2020

IQR (CV) : 22 (0)

46 distinct values

0 (0.0%)

first_epid_s [Date]

min : 1975-10-11

med : 2000-10-07

max : 2020-10-03

range : 44y 11m 22d

46 distinct values

0 (0.0%)

last_epid_s [Date]

min : 1976-07-31

med : 2001-05-19

max : 2021-04-10

range : 44y 8m 10d

46 distinct values

0 (0.0%)

n_episodes_s [numeric]

Mean (sd) : 19.7 (2.1)

min ≤ med ≤ max:

12 ≤ 20 ≤ 24

IQR (CV) : 1 (0.1)

12	:	12	(	2.0%	)
13	:	24	(	3.9%	)
17	:	30	(	4.9%	)
18	:	31	(	5.0%	)
19	:	37	(	6.0%	)
20	:	297	(	48.4%	)
21	:	111	(	18.1%	)
22	:	63	(	10.3%	)
24	:	9	(	1.5%	)

0 (0.0%)

name [character]

1. Kenan Thompson

2. Darrell Hammond

3. Seth Meyers

4. Al Franken

5. Fred Armisen

6. Kate McKinnon

7. Tim Meadows

8. Aidy Bryant

9. Bobby Moynihan

10. Cecily Strong

[ 146 others ]

18	(	2.9%	)
14	(	2.3%	)
13	(	2.1%	)
11	(	1.8%	)
11	(	1.8%	)
10	(	1.6%	)
10	(	1.6%	)
9	(	1.5%	)
9	(	1.5%	)
9	(	1.5%	)
500	(	81.4%	)

0 (0.0%)

featured [logical]

1. FALSE

2. TRUE

451	(	73.5%	)
163	(	26.5%	)

0 (0.0%)

first_epid_c [Date]

min : 1977-01-15

med : 1990-11-10

max : 2014-10-25

range : 37y 9m 10d

35 distinct values

564 (91.9%)

last_epid_c [Date]

min : 1975-10-11

med : 1995-02-25

max : 2014-02-01

range : 38y 3m 21d

17 distinct values

597 (97.2%)

update_anchor [logical]

1. FALSE

2. TRUE

541	(	88.1%	)
73	(	11.9%	)

0 (0.0%)

n_episodes_c [numeric]

Mean (sd) : 18.7 (4)

min ≤ med ≤ max:

1 ≤ 20 ≤ 24

IQR (CV) : 2 (0.2)

22 distinct values

0 (0.0%)

season_fraction [numeric]

Mean (sd) : 0.9 (0.2)

min ≤ med ≤ max:

0 ≤ 1 ≤ 1

IQR (CV) : 0 (0.2)

36 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-28