library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8
Challenge Overview
Today’s challenge is to:
- read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- military marriages ⭐⭐
- faostat ⭐⭐
- railroads ⭐⭐⭐
- fed_rate ⭐⭐⭐
- debt ⭐⭐⭐
- us_hh ⭐⭐⭐⭐
- snl ⭐⭐⭐⭐⭐
<- read_csv("_data/snl_actors.csv")
actors actors
# A tibble: 2,306 × 4
aid url type gender
<chr> <chr> <chr> <chr>
1 Kate McKinnon /Cast/?KaMc cast female
2 Alex Moffat /Cast/?AlMo cast male
3 Ego Nwodim /Cast/?EgNw cast unknown
4 Chris Redd /Cast/?ChRe cast male
5 Kenan Thompson /Cast/?KeTh cast male
6 Carey Mulligan /Guests/?3677 guest andy
7 Marcus Mumford /Guests/?3679 guest male
8 Aidy Bryant /Cast/?AiBr cast female
9 Steve Higgins /Crew/?StHi crew male
10 Mikey Day /Cast/?MiDa cast male
# ℹ 2,296 more rows
<- read_csv("_data/snl_casts.csv")
casts casts
# A tibble: 614 × 8
aid sid featured first_epid last_epid update_anchor n_episodes
<chr> <dbl> <lgl> <dbl> <dbl> <lgl> <dbl>
1 A. Whitney Brown 11 TRUE 19860222 NA FALSE 8
2 A. Whitney Brown 12 TRUE NA NA FALSE 20
3 A. Whitney Brown 13 TRUE NA NA FALSE 13
4 A. Whitney Brown 14 TRUE NA NA FALSE 20
5 A. Whitney Brown 15 TRUE NA NA FALSE 20
6 A. Whitney Brown 16 TRUE NA NA FALSE 20
7 Alan Zweibel 5 TRUE 19800409 NA FALSE 5
8 Sasheer Zamata 39 TRUE 20140118 NA FALSE 11
9 Sasheer Zamata 40 TRUE NA NA FALSE 21
10 Sasheer Zamata 41 FALSE NA NA FALSE 21
# ℹ 604 more rows
# ℹ 1 more variable: season_fraction <dbl>
<- read_csv("_data/snl_seasons.csv")
seasons seasons
# A tibble: 46 × 5
sid year first_epid last_epid n_episodes
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1975 19751011 19760731 24
2 2 1976 19760918 19770521 22
3 3 1977 19770924 19780520 20
4 4 1978 19781007 19790526 20
5 5 1979 19791013 19800524 20
6 6 1980 19801115 19810411 13
7 7 1981 19811003 19820522 20
8 8 1982 19820925 19830514 20
9 9 1983 19831008 19840512 19
10 10 1984 19841006 19850413 17
# ℹ 36 more rows
Briefly describe the data
colnames(actors)
[1] "aid" "url" "type" "gender"
colnames(casts)
[1] "aid" "sid" "featured" "first_epid"
[5] "last_epid" "update_anchor" "n_episodes" "season_fraction"
colnames(seasons)
[1] "sid" "year" "first_epid" "last_epid" "n_episodes"
The actors dataset provides general cast and gender information for the cast on SNL. The casts dataset tells us which SNL seasons each actor participated in (marked by “sid” variable) and details about each actor’s participation in that season. The seasons dataset provides general information about different SNL seasons only, such as the year the season aired, the start and end date, and the number of episodes in that season.
Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
This data is already tidy because each relevant variable has its own column, each case its own row, and each value its own cell. One change we could possibly make is collapsing the first episode and last episode dates into one column for the casts and seasons datasets, but that would make us repeat actor names and season IDs which could look untidy.
Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
<- casts %>%
casts mutate(first_epid = ymd(first_epid), last_epid = ymd(last_epid))
casts
# A tibble: 614 × 8
aid sid featured first_epid last_epid update_anchor n_episodes
<chr> <dbl> <lgl> <date> <date> <lgl> <dbl>
1 A. Whitney Bro… 11 TRUE 1986-02-22 NA FALSE 8
2 A. Whitney Bro… 12 TRUE NA NA FALSE 20
3 A. Whitney Bro… 13 TRUE NA NA FALSE 13
4 A. Whitney Bro… 14 TRUE NA NA FALSE 20
5 A. Whitney Bro… 15 TRUE NA NA FALSE 20
6 A. Whitney Bro… 16 TRUE NA NA FALSE 20
7 Alan Zweibel 5 TRUE 1980-04-09 NA FALSE 5
8 Sasheer Zamata 39 TRUE 2014-01-18 NA FALSE 11
9 Sasheer Zamata 40 TRUE NA NA FALSE 21
10 Sasheer Zamata 41 FALSE NA NA FALSE 21
# ℹ 604 more rows
# ℹ 1 more variable: season_fraction <dbl>
<- seasons %>%
seasons mutate(first_epid = ymd(first_epid), last_epid = ymd(last_epid))
seasons
# A tibble: 46 × 5
sid year first_epid last_epid n_episodes
<dbl> <dbl> <date> <date> <dbl>
1 1 1975 1975-10-11 1976-07-31 24
2 2 1976 1976-09-18 1977-05-21 22
3 3 1977 1977-09-24 1978-05-20 20
4 4 1978 1978-10-07 1979-05-26 20
5 5 1979 1979-10-13 1980-05-24 20
6 6 1980 1980-11-15 1981-04-11 13
7 7 1981 1981-10-03 1982-05-22 20
8 8 1982 1982-09-25 1983-05-14 20
9 9 1983 1983-10-08 1984-05-12 19
10 10 1984 1984-10-06 1985-04-13 17
# ℹ 36 more rows
The first and last episode date columns in the casts and seasons datasets are not in the correct format. Hence, I mutated them into the lubridate ymd() format.
Join Data
<- actors %>%
merged left_join(casts, by = "aid") %>%
rename("actor_first_epid" = "first_epid", "actor_last_epid" = "last_epid", "n_actor_episodes" = "n_episodes") %>%
left_join(seasons, by = "sid") %>%
rename("season_first_epid" = "first_epid", "season_last_epid" = "last_epid") %>%
select(1, 4, 3, 2, 5, 6, 12:14, 7:11) %>%
arrange(season_first_epid)
merged
# A tibble: 2,764 × 14
aid gender type url sid featured year season_first_epid
<chr> <chr> <chr> <chr> <dbl> <lgl> <dbl> <date>
1 Chevy Chase male cast /Cast… 1 FALSE 1975 1975-10-11
2 Dan Aykroyd male cast /Cast… 1 FALSE 1975 1975-10-11
3 Garrett Morris male cast /Cast… 1 FALSE 1975 1975-10-11
4 John Belushi male cast /Cast… 1 FALSE 1975 1975-10-11
5 Michael O'Donoghue male cast /Cast… 1 FALSE 1975 1975-10-11
6 Jane Curtin female cast /Cast… 1 FALSE 1975 1975-10-11
7 Laraine Newman female cast /Cast… 1 FALSE 1975 1975-10-11
8 Gilda Radner female cast /Cast… 1 FALSE 1975 1975-10-11
9 George Coe male cast /Cast… 1 FALSE 1975 1975-10-11
10 Bill Murray male cast /Cast… 2 FALSE 1976 1976-09-18
# ℹ 2,754 more rows
# ℹ 6 more variables: season_last_epid <date>, actor_first_epid <date>,
# actor_last_epid <date>, update_anchor <lgl>, n_actor_episodes <dbl>,
# season_fraction <dbl>
We want to merge all these datasets into one. First, we match all the actor IDs in actors to those in casts and merge them using left_join(). We then rename the date and episode count columns so we can heave a clear differentiation when we make the next merge. Now we have a dataset with general actor information and their cast participation information for different SNL seasons. Finally, we match all the season IDs in this merged table to those in seasons and merge them using left_join(). We rename the newly added date columns such that they correspond to entries from the seasons dataset. Now, we have a fully joined dataset - merged - that contains general actor information, cast information for each actor, and general information for each season that each actor participated in. I have rearranged this dataset to look more coherent using select() and arranged the entries in ascending order of season start date.
Visualizations
%>%
merged ggplot(aes(sid, fill = gender)) +
geom_histogram(bins = 20) +
theme_minimal() +
labs(title = "Bar Plot", subtitle = "Cast composition for different SNL seasons", y = "No. of cast members", x = "Season ID", fill = "Gender")
This graphs tells us the number of cast members for different SNL seasons (denoted by their season IDs) and the gender composition of the casts for each season.