library(tidyverse)
library(ggplot2)
library(dplyr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8 Instructions
Challenge Overview
Today’s challenge is to:
- read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- military marriages ⭐⭐
- faostat ⭐⭐
- railroads ⭐⭐⭐
- fed_rate ⭐⭐⭐
- debt ⭐⭐⭐
- us_hh ⭐⭐⭐⭐
- snl ⭐⭐⭐⭐⭐
<- read.csv(file = "_data/snl_actors.csv",
actors header=TRUE,
sep = ",")
<- read.csv(file = "_data/snl_casts.csv",
casts header=TRUE,
sep = ","
)
<- read.csv(file = "_data/snl_seasons.csv",
seasons header=TRUE,
sep = ","
)head(actors)
aid url type gender
1 Kate McKinnon /Cast/?KaMc cast female
2 Alex Moffat /Cast/?AlMo cast male
3 Ego Nwodim /Cast/?EgNw cast unknown
4 Chris Redd /Cast/?ChRe cast male
5 Kenan Thompson /Cast/?KeTh cast male
6 Carey Mulligan /Guests/?3677 guest andy
head(casts)
aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown 11 True 19860222 NA False 8
2 A. Whitney Brown 12 True NA NA False 20
3 A. Whitney Brown 13 True NA NA False 13
4 A. Whitney Brown 14 True NA NA False 20
5 A. Whitney Brown 15 True NA NA False 20
6 A. Whitney Brown 16 True NA NA False 20
season_fraction
1 0.4444444
2 1.0000000
3 1.0000000
4 1.0000000
5 1.0000000
6 1.0000000
head(seasons)
sid year first_epid last_epid n_episodes
1 1 1975 19751011 19760731 24
2 2 1976 19760918 19770521 22
3 3 1977 19770924 19780520 20
4 4 1978 19781007 19790526 20
5 5 1979 19791013 19800524 20
6 6 1980 19801115 19810411 13
Briefly describe the data
The datasets are related to Saturday Night Live (SNL) and contain different types of information. The snl_casts dataset has data on the episodes and seasons in which actors were cast, while the snl_seasons dataset has information about each season, such as the number of episodes and the year it started. The snl_actors dataset contains information about each actor, including their role and type of appearance. The actors dataset has 2306 rows and 4 columns, the casts dataset has 614 rows and 8 columns, and the seasons dataset has 46 rows and 5 columns.
Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
<- na.omit(actors)
actors #casts <- na.omit(casts)
#seasons <- na.omit(seasons)
Join Data
The casts dataframe has information about cast members in each season, but not their gender. Gender information is stored in the actors dataframe, which does not have information about the seasons. To merge the two dataframes, left_join() will be used based on actor ID. Guest stars and crew members will be filtered out using the filter() function. Finally, the count() function will be used to summarize gender makeup by season.
<- casts %>% left_join(seasons ,by="sid")
casts_seasons
#sanity check
head(casts_seasons)
aid sid featured first_epid.x last_epid.x update_anchor
1 A. Whitney Brown 11 True 19860222 NA False
2 A. Whitney Brown 12 True NA NA False
3 A. Whitney Brown 13 True NA NA False
4 A. Whitney Brown 14 True NA NA False
5 A. Whitney Brown 15 True NA NA False
6 A. Whitney Brown 16 True NA NA False
n_episodes.x season_fraction year first_epid.y last_epid.y n_episodes.y
1 8 0.4444444 1985 19851109 19860524 18
2 20 1.0000000 1986 19861011 19870523 20
3 13 1.0000000 1987 19871017 19880227 13
4 20 1.0000000 1988 19881008 19890520 20
5 20 1.0000000 1989 19890930 19900519 20
6 20 1.0000000 1990 19900929 19910518 20
<- casts_seasons %>%
seasons_casts_actors left_join(actors, by = "aid")
# sanity check
head(seasons_casts_actors)
aid sid featured first_epid.x last_epid.x update_anchor
1 A. Whitney Brown 11 True 19860222 NA False
2 A. Whitney Brown 12 True NA NA False
3 A. Whitney Brown 13 True NA NA False
4 A. Whitney Brown 14 True NA NA False
5 A. Whitney Brown 15 True NA NA False
6 A. Whitney Brown 16 True NA NA False
n_episodes.x season_fraction year first_epid.y last_epid.y n_episodes.y
1 8 0.4444444 1985 19851109 19860524 18
2 20 1.0000000 1986 19861011 19870523 20
3 13 1.0000000 1987 19871017 19880227 13
4 20 1.0000000 1988 19881008 19890520 20
5 20 1.0000000 1989 19890930 19900519 20
6 20 1.0000000 1990 19900929 19910518 20
url type gender
1 /Cast/?AWBr cast male
2 /Cast/?AWBr cast male
3 /Cast/?AWBr cast male
4 /Cast/?AWBr cast male
5 /Cast/?AWBr cast male
6 /Cast/?AWBr cast male
After left joining three table, we enriched this table by changing the date format
<- mutate( seasons_casts_actors,
seasons_casts_actors first_epid.x = as.Date( as.character(first_epid.x), format= "%Y%m%d"),
last_epid.y = as.Date(as.character(last_epid.x), format = "%Y%m%d"))
#sanity check
table(seasons_casts_actors$first_epid.x)
1977-01-15 1979-10-13 1979-11-17 1979-12-15 1980-01-26 1980-04-09 1980-12-13
1 1 2 1 3 2 3
1980-12-20 1981-04-11 1986-02-22 1986-03-22 1989-01-21 1989-03-25 1990-10-27
1 5 1 1 1 1 1
1990-11-10 1991-02-09 1991-11-16 1993-10-02 1993-10-09 1994-03-12 1995-01-14
2 2 2 1 2 1 1
1995-02-25 1995-04-08 1996-03-16 1999-10-23 2000-05-06 2002-03-02 2005-05-07
1 1 1 1 1 1 1
2005-11-12 2008-02-23 2008-11-15 2012-04-07 2014-01-18 2014-03-01 2014-10-25
1 1 2 1 1 1 1
head(seasons_casts_actors)
aid sid featured first_epid.x last_epid.x update_anchor
1 A. Whitney Brown 11 True 1986-02-22 NA False
2 A. Whitney Brown 12 True <NA> NA False
3 A. Whitney Brown 13 True <NA> NA False
4 A. Whitney Brown 14 True <NA> NA False
5 A. Whitney Brown 15 True <NA> NA False
6 A. Whitney Brown 16 True <NA> NA False
n_episodes.x season_fraction year first_epid.y last_epid.y n_episodes.y
1 8 0.4444444 1985 19851109 <NA> 18
2 20 1.0000000 1986 19861011 <NA> 20
3 13 1.0000000 1987 19871017 <NA> 13
4 20 1.0000000 1988 19881008 <NA> 20
5 20 1.0000000 1989 19890930 <NA> 20
6 20 1.0000000 1990 19900929 <NA> 20
url type gender
1 /Cast/?AWBr cast male
2 /Cast/?AWBr cast male
3 /Cast/?AWBr cast male
4 /Cast/?AWBr cast male
5 /Cast/?AWBr cast male
6 /Cast/?AWBr cast male
Visualization
library(ggplot2)
ggplot(data = seasons_casts_actors, aes(x = year, y = mean(n_episodes.x), fill = gender)) +
geom_bar(stat = "identity") +
labs(x = "Year", y = "Number of Episodes", title = "Number of Episodes per Season") +
theme_minimal()