library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8
Challenge Overview
Today’s challenge is to:
- read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- military marriages ⭐⭐
- faostat ⭐⭐
- railroads ⭐⭐⭐
- fed_rate ⭐⭐⭐
- debt ⭐⭐⭐
- us_hh ⭐⭐⭐⭐
- snl ⭐⭐⭐⭐⭐
<- read.csv("~/Documents/601/601_Spring_2023/posts/_data/snl_actors.csv")
adata <- read.csv("~/Documents/601/601_Spring_2023/posts/_data/snl_casts.csv")
cdata <- read.csv("~/Documents/601/601_Spring_2023/posts/_data/snl_seasons.csv") sdata
Briefly describe the data
head(sdata)
sid year first_epid last_epid n_episodes
1 1 1975 19751011 19760731 24
2 2 1976 19760918 19770521 22
3 3 1977 19770924 19780520 20
4 4 1978 19781007 19790526 20
5 5 1979 19791013 19800524 20
6 6 1980 19801115 19810411 13
head(cdata)
aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown 11 True 19860222 NA False 8
2 A. Whitney Brown 12 True NA NA False 20
3 A. Whitney Brown 13 True NA NA False 13
4 A. Whitney Brown 14 True NA NA False 20
5 A. Whitney Brown 15 True NA NA False 20
6 A. Whitney Brown 16 True NA NA False 20
season_fraction
1 0.4444444
2 1.0000000
3 1.0000000
4 1.0000000
5 1.0000000
6 1.0000000
head(adata)
aid url type gender
1 Kate McKinnon /Cast/?KaMc cast female
2 Alex Moffat /Cast/?AlMo cast male
3 Ego Nwodim /Cast/?EgNw cast unknown
4 Chris Redd /Cast/?ChRe cast male
5 Kenan Thompson /Cast/?KeTh cast male
6 Carey Mulligan /Guests/?3677 guest andy
str(sdata)
'data.frame': 46 obs. of 5 variables:
$ sid : int 1 2 3 4 5 6 7 8 9 10 ...
$ year : int 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 ...
$ first_epid: int 19751011 19760918 19770924 19781007 19791013 19801115 19811003 19820925 19831008 19841006 ...
$ last_epid : int 19760731 19770521 19780520 19790526 19800524 19810411 19820522 19830514 19840512 19850413 ...
$ n_episodes: int 24 22 20 20 20 13 20 20 19 17 ...
str(cdata)
'data.frame': 614 obs. of 8 variables:
$ aid : chr "A. Whitney Brown" "A. Whitney Brown" "A. Whitney Brown" "A. Whitney Brown" ...
$ sid : int 11 12 13 14 15 16 5 39 40 41 ...
$ featured : chr "True" "True" "True" "True" ...
$ first_epid : num 19860222 NA NA NA NA ...
$ last_epid : num NA NA NA NA NA NA NA NA NA NA ...
$ update_anchor : chr "False" "False" "False" "False" ...
$ n_episodes : int 8 20 13 20 20 20 5 11 21 21 ...
$ season_fraction: num 0.444 1 1 1 1 ...
str(adata)
'data.frame': 2306 obs. of 4 variables:
$ aid : chr "Kate McKinnon" "Alex Moffat" "Ego Nwodim" "Chris Redd" ...
$ url : chr "/Cast/?KaMc" "/Cast/?AlMo" "/Cast/?EgNw" "/Cast/?ChRe" ...
$ type : chr "cast" "cast" "cast" "cast" ...
$ gender: chr "female" "male" "unknown" "male" ...
dim(adata)
[1] 2306 4
dim(sdata)
[1] 46 5
dim(cdata)
[1] 614 8
colnames(snldata)
Error in is.data.frame(x): object 'snldata' not found
dim(snldata)
Error in eval(expr, envir, enclos): object 'snldata' not found
summary(adata)
aid url type gender
Length:2306 Length:2306 Length:2306 Length:2306
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
summary(cdata)
aid sid featured first_epid
Length:614 Min. : 1.00 Length:614 Min. :19770115
Class :character 1st Qu.:15.00 Class :character 1st Qu.:19801215
Mode :character Median :26.00 Mode :character Median :19901110
Mean :25.47 Mean :19909634
3rd Qu.:37.00 3rd Qu.:19957839
Max. :46.00 Max. :20141025
NA's :564
last_epid update_anchor n_episodes season_fraction
Min. :19751011 Length:614 Min. : 1.00 Min. :0.04167
1st Qu.:19850112 Class :character 1st Qu.:19.00 1st Qu.:1.00000
Median :19950225 Mode :character Median :20.00 Median :1.00000
Mean :19944038 Mean :18.73 Mean :0.94827
3rd Qu.:20040117 3rd Qu.:21.00 3rd Qu.:1.00000
Max. :20140201 Max. :24.00 Max. :1.00000
NA's :597
summary(sdata)
sid year first_epid last_epid
Min. : 1.00 Min. :1975 Min. :19751011 Min. :19760731
1st Qu.:12.25 1st Qu.:1986 1st Qu.:19863512 1st Qu.:19872949
Median :23.50 Median :1998 Median :19975926 Median :19985512
Mean :23.50 Mean :1998 Mean :19975965 Mean :19985509
3rd Qu.:34.75 3rd Qu.:2009 3rd Qu.:20088423 3rd Qu.:20098015
Max. :46.00 Max. :2020 Max. :20201003 Max. :20210410
n_episodes
Min. :12.0
1st Qu.:20.0
Median :20.0
Mean :19.7
3rd Qu.:21.0
Max. :24.0
Separate and well-organized data are contained in the data frames. The list of people who have appeared on the show, including actors, guests, musical guests, and staff members, is vast in the Actors Data Frame. Each item is specific to each actor and includes information about their gender and type of role. The Casts Data Frame, on the other hand, concentrates on actors who were a part of the cast during a specific season. Each item corresponds to a particular actor and contains details like the number of episodes they appeared in that season, the dates of their first and last appearances, whether they were a Weekend Update anchor, and their featured status. The Seasons Data Frame, contains details about each distinct season. Each entry is for a specific season and contains information about that season, including the year it debuted, the premiere and finale dates, and the total number of episodes.
Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
<- sdata %>%
sdata mutate(across(c(first_epid, last_epid), ymd))
<- cdata %>%
cdata mutate(across(c(first_epid, last_epid), ymd))
Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
Join Data
Be sure to include a sanity check, and double-check that case count is correct!
<- adata %>%
adata filter(type == "cast") %>%
select(-url)
<- n_distinct(cdata$aid)
member_count member_count
[1] 156
<- full_join(adata, cdata, by = "aid") %>%
joined_data select(c(aid, gender, sid, featured, update_anchor))
head(joined_data)
aid gender sid featured update_anchor
1 Kate McKinnon female 37 True False
2 Kate McKinnon female 38 True False
3 Kate McKinnon female 39 False False
4 Kate McKinnon female 40 False False
5 Kate McKinnon female 41 False False
6 Kate McKinnon female 42 False False