library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8
Challenge Overview
Today’s challenge is to:
- read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- military marriages ⭐⭐
- faostat ⭐⭐
- railroads ⭐⭐⭐
- fed_rate ⭐⭐⭐
- debt ⭐⭐⭐
- us_hh ⭐⭐⭐⭐
- snl ⭐⭐⭐⭐⭐
<- read_csv("_data/snl_actors.csv")
snl_actors head(snl_actors)
# A tibble: 6 × 4
aid url type gender
<chr> <chr> <chr> <chr>
1 Kate McKinnon /Cast/?KaMc cast female
2 Alex Moffat /Cast/?AlMo cast male
3 Ego Nwodim /Cast/?EgNw cast unknown
4 Chris Redd /Cast/?ChRe cast male
5 Kenan Thompson /Cast/?KeTh cast male
6 Carey Mulligan /Guests/?3677 guest andy
<- read_csv("_data/snl_casts.csv")
snl_casts head(snl_casts)
# A tibble: 6 × 8
aid sid featured first_epid last_epid update_…¹ n_epi…² seaso…³
<chr> <dbl> <lgl> <dbl> <dbl> <lgl> <dbl> <dbl>
1 A. Whitney Brown 11 TRUE 19860222 NA FALSE 8 0.444
2 A. Whitney Brown 12 TRUE NA NA FALSE 20 1
3 A. Whitney Brown 13 TRUE NA NA FALSE 13 1
4 A. Whitney Brown 14 TRUE NA NA FALSE 20 1
5 A. Whitney Brown 15 TRUE NA NA FALSE 20 1
6 A. Whitney Brown 16 TRUE NA NA FALSE 20 1
# … with abbreviated variable names ¹update_anchor, ²n_episodes,
# ³season_fraction
<- read_csv("_data/snl_seasons.csv")
snl_seasons head(snl_seasons)
# A tibble: 6 × 5
sid year first_epid last_epid n_episodes
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1975 19751011 19760731 24
2 2 1976 19760918 19770521 22
3 3 1977 19770924 19780520 20
4 4 1978 19781007 19790526 20
5 5 1979 19791013 19800524 20
6 6 1980 19801115 19810411 13
colnames(snl_actors)
[1] "aid" "url" "type" "gender"
colnames(snl_casts)
[1] "aid" "sid" "featured" "first_epid"
[5] "last_epid" "update_anchor" "n_episodes" "season_fraction"
colnames(snl_seasons)
[1] "sid" "year" "first_epid" "last_epid" "n_episodes"
Briefly describe the data
Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
%>%
snl_actors snl_actors(na.rm = True)
Error in snl_actors(., na.rm = True): could not find function "snl_actors"
%>%
snl_casts snl_casts(na.rm = True)
Error in snl_casts(., na.rm = True): could not find function "snl_casts"
%>%
snl_seasons snl_seasons(na.rm = True)
Error in snl_seasons(., na.rm = True): could not find function "snl_seasons"
Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
Join Data
Be sure to include a sanity check, and double-check that case count is correct!
<- snl_actors %>%
snl_data select(aid, gender) %>%
left_join(snl_casts, by="aid")
head(snl_data)
# A tibble: 6 × 9
aid gender sid featured first_epid last_…¹ updat…² n_epi…³ seaso…⁴
<chr> <chr> <dbl> <lgl> <dbl> <dbl> <lgl> <dbl> <dbl>
1 Kate McKinnon female 37 TRUE 20120407 NA FALSE 5 0.227
2 Kate McKinnon female 38 TRUE NA NA FALSE 21 1
3 Kate McKinnon female 39 FALSE NA NA FALSE 21 1
4 Kate McKinnon female 40 FALSE NA NA FALSE 21 1
5 Kate McKinnon female 41 FALSE NA NA FALSE 21 1
6 Kate McKinnon female 42 FALSE NA NA FALSE 21 1
# … with abbreviated variable names ¹last_epid, ²update_anchor, ³n_episodes,
# ⁴season_fraction
Visualize Data
%>%
snl_data ggplot(aes(x = gender,y = first_epid))+geom_boxplot()
%>%
snl_data ggplot(aes(gender)) +
geom_bar()
We understand that there are gender data specifically referring to andy and unknown but they don’t have any data belonging to first_epid. This means that they have not had any.