DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 8 Solution

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Tasks:

Challenge 8 Solution

challenge_8
shelton
snl
Joining Data
Author

Dane Shelton

Published

November 10, 2022

Challenge Tasks:

Today’s challenge is to:

  1. read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

  • 1. Read in data: SNL
  • 2. Tidy Data (as needed)
  • 3. Mutate
  • 4. Join Data

Read in and describe the following datasets, using R package readr and command read_csv.

  • snl_actors.csv
  • snl_casts.csv
snl_actors.csv

snl_actors.csv provides a list of all 2000+ personalities that have appeared on or worked on an SNL episode,type: whether they’re a guest or regular cast member, and their gender.

snl_casts.csv

snl_casts.csv goes more in depth, providing a list of each cast member for seasons 1 - 46 of the show (1975 - 2018). the 614 bservations are identified by several variables including cast member name, season id, whether they were the update_anchor, whether they were a featured cast member, and the total number of n_episodes they appeared in for the given season.

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

# Rename Columns and Remove First/Last/Percent
cast_join <- cast_og %>%
                rename("Name" = aid, 
                       "Season" = sid, 
                       "Featured"= featured,
                       "Anchor"=update_anchor,
                       "Episode Count"=n_episodes)%>%
                select(c(Name:Featured, 
                         Anchor:`Episode Count`))

# Rename and Remove URl
act_join <- act_og %>% 
                rename("Name" = aid, 
                       "Type" = type,
                       "Gender" = gender) %>%
                select(c("Name","Type","Gender" ))

Are there any variables that require mutation to be usable in your analysis stream?

Yes; I am going to mutate the Type variable for act_join. Gender is listed as unknown for most musical acts and celebrities, so I am going to create a Musician/Celeb category within Type. This isn’t a brilliant fix as Gender is correct for some musicians, so they will be listed as normal Guests. We’ll proceed anyways as it removes the identifier unknown and adds another layer to the data.

Document your work.

act_mutate <-  act_join%>%
              mutate(Type = 
                       case_when(`Gender` == 'unknown' ~ 'Musician/Celeb',
                                 `Type` == 'guest' ~ 'Guest',
                               `Type` == 'cast' ~ 'Cast',
                               `Type` == 'crew' ~ 'Crew',
                               ),
                     Gender=
                       case_when(`Gender` == 'male' ~ 'M',
                                 `Gender` == 'female' ~ 'F',
                                 `Gender` == 'unknown' ~ 'NA',
                                 `Gender` == 'andy' ~ 'NA'))
cast_join
act_mutate

For the cast_join data, our keys are Nameand Season; Name is also a foreign key related to the act_mutate data. Let’s first check to ensure there are no counts above 1 for cast_join primary keys.

# Primary Keys Count
cast_join %>%
  count(Name, Season)%>%
      filter(n>1)

Let’s use a variety of joins to combine act_mutate and cast_join

Mutating Join 1 - Full Join
# Mutating Join: All appearances: cast,crew,guest,celeb, musician
act_cast_full <- full_join(cast_join, act_mutate, by='Name')

act_cast_full

A full_join is used to match observations that share the same key, which is Name in cast_join and act_mutate. All observations are kept though, meaning we have several N/A in our data as not all guests have information cast members do.

Mutating Join 2 - Cast Only
# Left Joing: Only those that appeared in cast_join, could also use inner_join so only key matches

cast_only <- left_join(cast_join, act_mutate, by='Name')

cast_only 

A left_join(by='Name') is used to join the two datasets by Name keeping only the observations in the cast_join dataset with a matching name in act_mutate

Unique Cast Members
# How many unique cast members, how many appearances did they make total?
unique_cast <- cast_only %>% 
  group_by(Name)%>%
  summarize(Name, Gender, Type, `Total Apps` = sum(`Episode Count`))%>%
  distinct(Name, Gender, Type, `Total Apps`)%>%
  arrange(desc(`Total Apps`))

unique_cast <- unique_cast %>%
                mutate(Gender=
                         case_when(`Gender`=='M' ~ 'M',
                                   `Gender` == 'F'~ 'F',
                                   `Gender` == 'NA' ~ 'F'),
                       Type=('Cast'))
unique_cast

cast_only is used with other dplyr verbs to create a table of 156 unique cast members that worked on at least one season on SNL. They’re arranged by the total number of episodes they appeared in.

Filtering Join 1 - Non-Cast Members
# Just Actors: Non-Cast
act_only <- anti_join(act_mutate, cast_join)

act_only

A filtering join is used on the rows (mutating joins are for columns). An anti-join removes all the observations in act_mutate that are also present in cast_join, leaving us with only non-cast members. From this table we could filter out observations based on Gender or Type for interesting lists.

Source Code
---
title: "Challenge 8 Solution"
author: "Dane Shelton"
description: "Joining Data"
date: "11/10/2022"
format:
  html:
    df-print: paged
    callout-icon: false
    callout-appearance: simple
    toc: true
    code-copy: true
    code-tools: true
categories:
  - challenge_8
  - shelton
  - snl
---

```{r}
#| label: setup
#| echo: false
#| warning: false
#| message: false

library(tidyverse)
library(ggplot2)
library(summarytools)

knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
```

## Challenge Tasks:

Today's challenge is to:

1)  read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
2)  tidy data (as needed, including sanity checks)
3)  mutate variables as needed (including sanity checks)
4)  join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

:::panel-tabset

## 1. Read in data: SNL

Read in and describe the following datasets, using R package `readr` and command `read_csv`.

 
  - `snl_actors.csv`
  - `snl_casts.csv`
  

```{r}
#| label: Read In US HH
#| utput: true

# Read In Using ReadXl

cast_og <-  read_csv('_data/snl_casts.csv', show_col_types = F)
act_og <-  read_csv('_data/snl_actors.csv', show_col_types=F)
```

:::{.callout-note collapse="true"}
## `snl_actors.csv`
```{r}
#| label: head(act)
head(act_og, n=5)
```

`snl_actors.csv` provides a list of all 2000+ personalities that have appeared on or worked on an SNL episode,`type`: whether they're a guest or regular cast member, and their `gender`.
:::

:::{.callout-note collapse="true"}

## `snl_casts.csv`
```{r}
#| label: head(cast)
head(cast_og, n=5)
```

`snl_casts.csv` goes more in depth, providing a list of each cast member for seasons 1 - 46 of the show (1975 - 2018). the 614 bservations are identified by several variables including cast member `name`, `season id`, whether they were the `update_anchor`, whether they were a `featured` cast member, and the total number of `n_episodes` they appeared in for the given season. 

:::

```{r}
#| label: quick eda
#| include: false

(summarytools::dfSummary(cast_og))
```

## 2. Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

```{r}
#| label: quick tidy 
#| echo: true

# Rename Columns and Remove First/Last/Percent
cast_join <- cast_og %>%
                rename("Name" = aid, 
                       "Season" = sid, 
                       "Featured"= featured,
                       "Anchor"=update_anchor,
                       "Episode Count"=n_episodes)%>%
                select(c(Name:Featured, 
                         Anchor:`Episode Count`))

# Rename and Remove URl
act_join <- act_og %>% 
                rename("Name" = aid, 
                       "Type" = type,
                       "Gender" = gender) %>%
                select(c("Name","Type","Gender" ))

```

## 3. Mutate

Are there any variables that require mutation to be usable in your analysis stream? 

Yes; I am going to mutate the `Type` variable for `act_join`. `Gender` is listed as unknown for most musical acts and celebrities, so I am going to create a `Musician/Celeb` category within `Type`. This isn't a brilliant fix as `Gender` is correct for some musicians, so they will be listed as normal `Guests`. We'll proceed anyways as it removes the identifier `unknown` and adds another layer to the data.


Document your work.

```{r}
#| label: mutate act_join
#| echo: true

act_mutate <-  act_join%>%
              mutate(Type = 
                       case_when(`Gender` == 'unknown' ~ 'Musician/Celeb',
                                 `Type` == 'guest' ~ 'Guest',
                               `Type` == 'cast' ~ 'Cast',
                               `Type` == 'crew' ~ 'Crew',
                               ),
                     Gender=
                       case_when(`Gender` == 'male' ~ 'M',
                                 `Gender` == 'female' ~ 'F',
                                 `Gender` == 'unknown' ~ 'NA',
                                 `Gender` == 'andy' ~ 'NA'))

```

:::{.callout-note collapse="true"}

## `cast_join`
```{r}
#| label: cast_join
#| output: true

head(cast_join, n=5)

```
:::


:::{.callout-note collapse="true"}

## `act_mutate`
```{r}
#| label: act_mutate
#| output: true

head(act_mutate, n=5)

```
:::


## 4. Join Data

For the `cast_join` data, our keys are `Name`and `Season`; `Name` is also a foreign key related to the `act_mutate` data. Let's first check to ensure there are no counts above 1 for `cast_join` primary keys.

```{r}
#| label: cast_join Primary keys
#| include: true
#| echo: true

# Primary Keys Count
cast_join %>%
  count(Name, Season)%>%
      filter(n>1)
```

Let's use a variety of joins to combine `act_mutate` and `cast_join`

::: {.callout-note collapse="true"}

##  Mutating Join 1 - Full Join
```{r}
#| label: joins 1
#| include: true
#| echo: true

# Mutating Join: All appearances: cast,crew,guest,celeb, musician
act_cast_full <- full_join(cast_join, act_mutate, by='Name')

act_cast_full
```
A `full_join` is used to match observations that share the same key, which is `Name` in `cast_join` and `act_mutate`. All observations are kept though, meaning we have several N/A in our data as not all guests have information cast members do.
:::


::: {.callout-note collapse="true}

## Mutating Join 2 - Cast Only
```{r}
#| label: joins 2
#| echo: true
#| include: true

# Left Joing: Only those that appeared in cast_join, could also use inner_join so only key matches

cast_only <- left_join(cast_join, act_mutate, by='Name')

cast_only 
```

A `left_join(by='Name')` is used to join the two datasets by `Name` keeping only the observations in the `cast_join` dataset with a matching name in `act_mutate` 
:::


:::{.callout-note collapse="true}

## Unique Cast Members
```{r}
#| label: joins 3
#| echo: true
#| include: true

# How many unique cast members, how many appearances did they make total?
unique_cast <- cast_only %>% 
  group_by(Name)%>%
  summarize(Name, Gender, Type, `Total Apps` = sum(`Episode Count`))%>%
  distinct(Name, Gender, Type, `Total Apps`)%>%
  arrange(desc(`Total Apps`))

unique_cast <- unique_cast %>%
                mutate(Gender=
                         case_when(`Gender`=='M' ~ 'M',
                                   `Gender` == 'F'~ 'F',
                                   `Gender` == 'NA' ~ 'F'),
                       Type=('Cast'))
unique_cast

```

`cast_only` is used with other `dplyr` verbs to create a table of 156 unique cast members that worked on at least one season on SNL. They're arranged by the total number of episodes they appeared in.
:::


:::{.callout-note collapse="true"}

## Filtering Join 1 - Non-Cast Members
```{r}
#| label: joins 4 filtering
#| echo: true
#| include: true
      
# Just Actors: Non-Cast
act_only <- anti_join(act_mutate, cast_join)

act_only
```

A filtering join is used on the rows (mutating joins are for columns). An anti-join removes all the observations in `act_mutate` that are also present in `cast_join`, leaving us with only non-cast members. From this table we could filter out observations based on `Gender` or `Type` for interesting lists.
:::

:::