Challenge 8

challenge_8
railroads
snl
faostat
debt
Joining Data
Author

Priya Marla

Published

January 25, 2023

library(tidyverse)
library(ggplot2)
library(here)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • military marriages ⭐⭐
  • faostat ⭐⭐
  • railroads ⭐⭐⭐
  • fed_rate ⭐⭐⭐
  • debt ⭐⭐⭐
  • us_hh ⭐⭐⭐⭐
  • snl ⭐⭐⭐⭐⭐
snl_actors <- here("posts","_data","snl_actors.csv") %>%
  read_csv()

snl_casts <- here("posts","_data","snl_casts.csv") %>%
  read_csv()

snl_seasons <- here("posts","_data","snl_seasons.csv") %>%
  read_csv()

snl_actors
# A tibble: 2,306 × 4
   aid            url           type  gender 
   <chr>          <chr>         <chr> <chr>  
 1 Kate McKinnon  /Cast/?KaMc   cast  female 
 2 Alex Moffat    /Cast/?AlMo   cast  male   
 3 Ego Nwodim     /Cast/?EgNw   cast  unknown
 4 Chris Redd     /Cast/?ChRe   cast  male   
 5 Kenan Thompson /Cast/?KeTh   cast  male   
 6 Carey Mulligan /Guests/?3677 guest andy   
 7 Marcus Mumford /Guests/?3679 guest male   
 8 Aidy Bryant    /Cast/?AiBr   cast  female 
 9 Steve Higgins  /Crew/?StHi   crew  male   
10 Mikey Day      /Cast/?MiDa   cast  male   
# … with 2,296 more rows
snl_casts
# A tibble: 614 × 8
   aid                sid featured first_epid last_epid update…¹ n_epi…² seaso…³
   <chr>            <dbl> <lgl>         <dbl>     <dbl> <lgl>      <dbl>   <dbl>
 1 A. Whitney Brown    11 TRUE       19860222        NA FALSE          8   0.444
 2 A. Whitney Brown    12 TRUE             NA        NA FALSE         20   1    
 3 A. Whitney Brown    13 TRUE             NA        NA FALSE         13   1    
 4 A. Whitney Brown    14 TRUE             NA        NA FALSE         20   1    
 5 A. Whitney Brown    15 TRUE             NA        NA FALSE         20   1    
 6 A. Whitney Brown    16 TRUE             NA        NA FALSE         20   1    
 7 Alan Zweibel         5 TRUE       19800409        NA FALSE          5   0.25 
 8 Sasheer Zamata      39 TRUE       20140118        NA FALSE         11   0.524
 9 Sasheer Zamata      40 TRUE             NA        NA FALSE         21   1    
10 Sasheer Zamata      41 FALSE            NA        NA FALSE         21   1    
# … with 604 more rows, and abbreviated variable names ¹​update_anchor,
#   ²​n_episodes, ³​season_fraction
snl_seasons
# A tibble: 46 × 5
     sid  year first_epid last_epid n_episodes
   <dbl> <dbl>      <dbl>     <dbl>      <dbl>
 1     1  1975   19751011  19760731         24
 2     2  1976   19760918  19770521         22
 3     3  1977   19770924  19780520         20
 4     4  1978   19781007  19790526         20
 5     5  1979   19791013  19800524         20
 6     6  1980   19801115  19810411         13
 7     7  1981   19811003  19820522         20
 8     8  1982   19820925  19830514         20
 9     9  1983   19831008  19840512         19
10    10  1984   19841006  19850413         17
# … with 36 more rows

Briefly describe the data

There are 3 datasets describing the data about various actors and their appearances in the saturday night live show. snl_actors dataset contains data about various actors ever involved with snl show. snl_casts dataset contains data about various actors appearances in snl. snl_seasons dataset contains details about seasons aired.

There are total 46 seasons with 906 episodes, 2306 actors from year 1975 to 2020.

length(unique(snl_seasons$sid))
[1] 46
length(unique(snl_actors$aid))
[1] 2306
min(snl_seasons$year)
[1] 1975
max(snl_seasons$year)
[1] 2020
sum(snl_seasons$n_episodes)
[1] 906

Analysis

Here our goal is to find the count of various genders in the saturday night live show over the years. To get the year, gender and count we have to combine 3 different tables i.e snl_actors, snl_casts and snl_seasons.

Join Data

Be sure to include a sanity check, and double-check that case count is correct!

From snl_actors table we get the “gender” column and will get get “sid” from snl_casts which can later be used to combine with snl_seasons. We can combine this with snl_casts as they have a common column “aid” and name the new dataframe as actors_casts. We then combine actors_casts with snl_seasons where we get the “year”.

#Merging snl_actors and snl_casts
actors_casts <- merge(snl_actors, snl_casts, by = 'aid')

#merging actors_casts and snl_seasons based on gender
actors_casts_seasons_gender <- actors_casts %>%
  merge(snl_seasons, by = 'sid') %>%
  group_by(sid, year, gender) %>%
  count(sid, gender)
actors_casts_seasons_gender
# A tibble: 95 × 4
# Groups:   sid, year, gender [95]
     sid  year gender     n
   <dbl> <dbl> <chr>  <int>
 1     1  1975 female     3
 2     1  1975 male       6
 3     2  1976 female     3
 4     2  1976 male       5
 5     3  1977 female     3
 6     3  1977 male       6
 7     4  1978 female     3
 8     4  1978 male       6
 9     5  1979 female     3
10     5  1979 male      12
# … with 85 more rows
#plotting graph
actors_casts_seasons_gender %>%
  ggplot(aes(year, n, col = gender)) +
  geom_line() +
  ylab("count") +
  ggtitle("Various gender count over the years")

#merging actors_casts and snl_seasons
actors_casts_seasons <- actors_casts %>%
  merge(snl_seasons, by = 'sid') %>%
  group_by(sid, year) %>%
  count(sid)
actors_casts_seasons
# A tibble: 46 × 3
# Groups:   sid, year [46]
     sid  year     n
   <dbl> <dbl> <int>
 1     1  1975     9
 2     2  1976     8
 3     3  1977     9
 4     4  1978     9
 5     5  1979    15
 6     6  1980    15
 7     7  1981     8
 8     8  1982     8
 9     9  1983     9
10    10  1984    10
# … with 36 more rows
mean(actors_casts_seasons$n)
[1] 13.34783
min(actors_casts_seasons$n)
[1] 8
max(actors_casts_seasons$n)
[1] 20
#plotting graph
actors_casts_seasons %>%
  ggplot(aes(x = year)) +
  geom_line(aes(y = n), color = "blue") +
  geom_point(aes(y = n), color = "red") +
  ylab("count") +
  ggtitle("Count of cast over the years")

From the above visualization we can see that there is rise in number of employees in each season over the years. Number of females are consistently less less than the number of males. The second graph shows that the number of actors over the years. We can infer that there has been increase in the actors over years with minimum number of actors per season are 8 recorded in between 1975 and 1985 and maximum are 20 in 2020.