Challenge 8

challenge_8

Jaswanth Reddy Kommuru

snl

Joining Data

Author

Jaswanth Reddy Kommuru

Published

May 11, 2023

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
mutate variables as needed (including sanity checks)
join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

military marriages ⭐⭐
faostat ⭐⭐
railroads ⭐⭐⭐
fed_rate ⭐⭐⭐
debt ⭐⭐⭐
us_hh ⭐⭐⭐⭐
snl ⭐⭐⭐⭐⭐

adata <- read.csv("~/Documents/601/601_Spring_2023/posts/_data/snl_actors.csv")
cdata <- read.csv("~/Documents/601/601_Spring_2023/posts/_data/snl_casts.csv")
sdata <- read.csv("~/Documents/601/601_Spring_2023/posts/_data/snl_seasons.csv")

Briefly describe the data

head(sdata)

  sid year first_epid last_epid n_episodes
1   1 1975   19751011  19760731         24
2   2 1976   19760918  19770521         22
3   3 1977   19770924  19780520         20
4   4 1978   19781007  19790526         20
5   5 1979   19791013  19800524         20
6   6 1980   19801115  19810411         13

head(cdata)

               aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown  11     True   19860222        NA         False          8
2 A. Whitney Brown  12     True         NA        NA         False         20
3 A. Whitney Brown  13     True         NA        NA         False         13
4 A. Whitney Brown  14     True         NA        NA         False         20
5 A. Whitney Brown  15     True         NA        NA         False         20
6 A. Whitney Brown  16     True         NA        NA         False         20
  season_fraction
1       0.4444444
2       1.0000000
3       1.0000000
4       1.0000000
5       1.0000000
6       1.0000000

head(adata)

             aid           url  type  gender
1  Kate McKinnon   /Cast/?KaMc  cast  female
2    Alex Moffat   /Cast/?AlMo  cast    male
3     Ego Nwodim   /Cast/?EgNw  cast unknown
4     Chris Redd   /Cast/?ChRe  cast    male
5 Kenan Thompson   /Cast/?KeTh  cast    male
6 Carey Mulligan /Guests/?3677 guest    andy

str(sdata)

'data.frame':   46 obs. of  5 variables:
 $ sid       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ year      : int  1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 ...
 $ first_epid: int  19751011 19760918 19770924 19781007 19791013 19801115 19811003 19820925 19831008 19841006 ...
 $ last_epid : int  19760731 19770521 19780520 19790526 19800524 19810411 19820522 19830514 19840512 19850413 ...
 $ n_episodes: int  24 22 20 20 20 13 20 20 19 17 ...

str(cdata)

'data.frame':   614 obs. of  8 variables:
 $ aid            : chr  "A. Whitney Brown" "A. Whitney Brown" "A. Whitney Brown" "A. Whitney Brown" ...
 $ sid            : int  11 12 13 14 15 16 5 39 40 41 ...
 $ featured       : chr  "True" "True" "True" "True" ...
 $ first_epid     : num  19860222 NA NA NA NA ...
 $ last_epid      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ update_anchor  : chr  "False" "False" "False" "False" ...
 $ n_episodes     : int  8 20 13 20 20 20 5 11 21 21 ...
 $ season_fraction: num  0.444 1 1 1 1 ...

str(adata)

'data.frame':   2306 obs. of  4 variables:
 $ aid   : chr  "Kate McKinnon" "Alex Moffat" "Ego Nwodim" "Chris Redd" ...
 $ url   : chr  "/Cast/?KaMc" "/Cast/?AlMo" "/Cast/?EgNw" "/Cast/?ChRe" ...
 $ type  : chr  "cast" "cast" "cast" "cast" ...
 $ gender: chr  "female" "male" "unknown" "male" ...

dim(adata)

[1] 2306    4

dim(sdata)

[1] 46  5

dim(cdata)

[1] 614   8

colnames(snldata)

Error in is.data.frame(x): object 'snldata' not found

dim(snldata)

Error in eval(expr, envir, enclos): object 'snldata' not found

summary(adata)

     aid                url                type              gender         
 Length:2306        Length:2306        Length:2306        Length:2306       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character

summary(cdata)

     aid                 sid          featured           first_epid      
 Length:614         Min.   : 1.00   Length:614         Min.   :19770115  
 Class :character   1st Qu.:15.00   Class :character   1st Qu.:19801215  
 Mode  :character   Median :26.00   Mode  :character   Median :19901110  
                    Mean   :25.47                      Mean   :19909634  
                    3rd Qu.:37.00                      3rd Qu.:19957839  
                    Max.   :46.00                      Max.   :20141025  
                                                       NA's   :564       
   last_epid        update_anchor        n_episodes    season_fraction  
 Min.   :19751011   Length:614         Min.   : 1.00   Min.   :0.04167  
 1st Qu.:19850112   Class :character   1st Qu.:19.00   1st Qu.:1.00000  
 Median :19950225   Mode  :character   Median :20.00   Median :1.00000  
 Mean   :19944038                      Mean   :18.73   Mean   :0.94827  
 3rd Qu.:20040117                      3rd Qu.:21.00   3rd Qu.:1.00000  
 Max.   :20140201                      Max.   :24.00   Max.   :1.00000  
 NA's   :597

summary(sdata)

      sid             year        first_epid         last_epid       
 Min.   : 1.00   Min.   :1975   Min.   :19751011   Min.   :19760731  
 1st Qu.:12.25   1st Qu.:1986   1st Qu.:19863512   1st Qu.:19872949  
 Median :23.50   Median :1998   Median :19975926   Median :19985512  
 Mean   :23.50   Mean   :1998   Mean   :19975965   Mean   :19985509  
 3rd Qu.:34.75   3rd Qu.:2009   3rd Qu.:20088423   3rd Qu.:20098015  
 Max.   :46.00   Max.   :2020   Max.   :20201003   Max.   :20210410  
   n_episodes  
 Min.   :12.0  
 1st Qu.:20.0  
 Median :20.0  
 Mean   :19.7  
 3rd Qu.:21.0  
 Max.   :24.0

Separate and well-organized data are contained in the data frames. The list of people who have appeared on the show, including actors, guests, musical guests, and staff members, is vast in the Actors Data Frame. Each item is specific to each actor and includes information about their gender and type of role. The Casts Data Frame, on the other hand, concentrates on actors who were a part of the cast during a specific season. Each item corresponds to a particular actor and contains details like the number of episodes they appeared in that season, the dates of their first and last appearances, whether they were a Weekend Update anchor, and their featured status. The Seasons Data Frame, contains details about each distinct season. Each entry is for a specific season and contains information about that season, including the year it debuted, the premiere and finale dates, and the total number of episodes.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

sdata <- sdata %>%
  mutate(across(c(first_epid, last_epid), ymd))

cdata <- cdata %>%
  mutate(across(c(first_epid, last_epid), ymd))

Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

Join Data

Be sure to include a sanity check, and double-check that case count is correct!

adata <- adata %>%
  filter(type == "cast") %>%
  select(-url)

member_count <- n_distinct(cdata$aid)
member_count

[1] 156

joined_data <- full_join(adata, cdata, by = "aid") %>%
        select(c(aid, gender, sid, featured, update_anchor))

head(joined_data)

            aid gender sid featured update_anchor
1 Kate McKinnon female  37     True         False
2 Kate McKinnon female  38     True         False
3 Kate McKinnon female  39    False         False
4 Kate McKinnon female  40    False         False
5 Kate McKinnon female  41    False         False
6 Kate McKinnon female  42    False         False