library(tidyverse)
library(ggplot2)
library(lubridate)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8
Challenge Overview
Today’s challenge is to:
- read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- faostat ⭐⭐
- railroads ⭐⭐⭐
- fed_rate ⭐⭐⭐
- debt ⭐⭐⭐
- us_hh ⭐⭐⭐⭐
- snl ⭐⭐⭐⭐⭐
library(readr)
<- read_csv("_data/snl_seasons.csv", show_col_types = FALSE)
seasons
<- read_csv("_data/snl_casts.csv", show_col_types = FALSE)
casts
<- read_csv("_data/snl_actors.csv", show_col_types = FALSE) actors
Briefly describe the data
Tidy Data (as needed)
“Seasons” and “Casts” both use the column title “n_episodes”, but they mean different things- the former refers to number of episodes aired in the season, the latter refers to the number of episodes a cast member appeared in during the season. Similarly, they both use “first_epid” and “last_epid”. I will rename the columns accordingly.
#rename columns
<- rename(seasons, "n_eps_aired" = "n_episodes")
seasons
<- rename(casts, "n_eps_seen" = "n_episodes",
casts "first_seen" = "first_epid",
"last_seen" = "last_epid")
Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
#mutate season dates for readability
%>%
seasons mutate(first_epid = ymd(first_epid), last_epid = ymd(last_epid))
# A tibble: 46 × 5
sid year first_epid last_epid n_eps_aired
<dbl> <dbl> <date> <date> <dbl>
1 1 1975 1975-10-11 1976-07-31 24
2 2 1976 1976-09-18 1977-05-21 22
3 3 1977 1977-09-24 1978-05-20 20
4 4 1978 1978-10-07 1979-05-26 20
5 5 1979 1979-10-13 1980-05-24 20
6 6 1980 1980-11-15 1981-04-11 13
7 7 1981 1981-10-03 1982-05-22 20
8 8 1982 1982-09-25 1983-05-14 20
9 9 1983 1983-10-08 1984-05-12 19
10 10 1984 1984-10-06 1985-04-13 17
# … with 36 more rows
# ℹ Use `print(n = ...)` to see more rows
%>%
casts mutate(first_seen = ymd(first_seen), last_seen = ymd(last_seen))
# A tibble: 614 × 8
aid sid featured first_seen last_seen update…¹ n_eps…² seaso…³
<chr> <dbl> <lgl> <date> <date> <lgl> <dbl> <dbl>
1 A. Whitney Brown 11 TRUE 1986-02-22 NA FALSE 8 0.444
2 A. Whitney Brown 12 TRUE NA NA FALSE 20 1
3 A. Whitney Brown 13 TRUE NA NA FALSE 13 1
4 A. Whitney Brown 14 TRUE NA NA FALSE 20 1
5 A. Whitney Brown 15 TRUE NA NA FALSE 20 1
6 A. Whitney Brown 16 TRUE NA NA FALSE 20 1
7 Alan Zweibel 5 TRUE 1980-04-09 NA FALSE 5 0.25
8 Sasheer Zamata 39 TRUE 2014-01-18 NA FALSE 11 0.524
9 Sasheer Zamata 40 TRUE NA NA FALSE 21 1
10 Sasheer Zamata 41 FALSE NA NA FALSE 21 1
# … with 604 more rows, and abbreviated variable names ¹update_anchor,
# ²n_eps_seen, ³season_fraction
# ℹ Use `print(n = ...)` to see more rows
Join Data
<- inner_join(seasons, casts, "sid")
snl <- inner_join(snl, actors, "aid")
snl snl
# A tibble: 614 × 15
sid year first_epid last_e…¹ n_eps…² aid featu…³ first…⁴ last_…⁵ updat…⁶
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <lgl> <dbl> <dbl> <lgl>
1 1 1975 19751011 19760731 24 Gild… FALSE NA NA FALSE
2 1 1975 19751011 19760731 24 Mich… FALSE NA 1.98e7 FALSE
3 1 1975 19751011 19760731 24 Lara… FALSE NA NA FALSE
4 1 1975 19751011 19760731 24 Garr… FALSE NA NA FALSE
5 1 1975 19751011 19760731 24 Jane… FALSE NA NA FALSE
6 1 1975 19751011 19760731 24 Geor… FALSE NA 1.98e7 FALSE
7 1 1975 19751011 19760731 24 Chev… FALSE NA NA TRUE
8 1 1975 19751011 19760731 24 John… FALSE NA NA FALSE
9 1 1975 19751011 19760731 24 Dan … FALSE NA NA FALSE
10 2 1976 19760918 19770521 22 Gild… FALSE NA NA FALSE
# … with 604 more rows, 5 more variables: n_eps_seen <dbl>,
# season_fraction <dbl>, url <chr>, type <chr>, gender <chr>, and abbreviated
# variable names ¹last_epid, ²n_eps_aired, ³featured, ⁴first_seen,
# ⁵last_seen, ⁶update_anchor
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Apparently the mutating of dates doesn’t carry over to the join, so I’ll do that again.
#remove year column
<- snl%>%
snl select(-year)
#mutate season dates for readability
%>%
snl mutate(first_epid = ymd(first_epid),
last_epid = ymd(last_epid),
first_seen = ymd(first_seen),
last_seen = ymd(last_seen))
# A tibble: 614 × 14
sid first_epid last_epid n_eps_aired aid featu…¹ first_seen last_seen
<dbl> <date> <date> <dbl> <chr> <lgl> <date> <date>
1 1 1975-10-11 1976-07-31 24 Gilda … FALSE NA NA
2 1 1975-10-11 1976-07-31 24 Michae… FALSE NA 1975-11-08
3 1 1975-10-11 1976-07-31 24 Larain… FALSE NA NA
4 1 1975-10-11 1976-07-31 24 Garret… FALSE NA NA
5 1 1975-10-11 1976-07-31 24 Jane C… FALSE NA NA
6 1 1975-10-11 1976-07-31 24 George… FALSE NA 1975-10-11
7 1 1975-10-11 1976-07-31 24 Chevy … FALSE NA NA
8 1 1975-10-11 1976-07-31 24 John B… FALSE NA NA
9 1 1975-10-11 1976-07-31 24 Dan Ay… FALSE NA NA
10 2 1976-09-18 1977-05-21 22 Gilda … FALSE NA NA
# … with 604 more rows, 6 more variables: update_anchor <lgl>,
# n_eps_seen <dbl>, season_fraction <dbl>, url <chr>, type <chr>,
# gender <chr>, and abbreviated variable name ¹featured
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names