library(tidyverse)
library(lubridate)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8
Challenge Overview
Today’s challenge is to:
- read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- military marriages ⭐⭐
- faostat ⭐⭐
- railroads ⭐⭐⭐
- fed_rate ⭐⭐⭐
- debt ⭐⭐⭐
- us_hh ⭐⭐⭐⭐
- snl ⭐⭐⭐⭐⭐
<- read.csv("_data/snl_seasons.csv")
seasons_data <- read.csv("_data/snl_actors.csv")
actors_data <- read.csv("_data/snl_casts.csv")
casts_data
# Displaying first few rows of the datasets
head(seasons_data)
sid year first_epid last_epid n_episodes
1 1 1975 19751011 19760731 24
2 2 1976 19760918 19770521 22
3 3 1977 19770924 19780520 20
4 4 1978 19781007 19790526 20
5 5 1979 19791013 19800524 20
6 6 1980 19801115 19810411 13
head(actors_data)
aid url type gender
1 Kate McKinnon /Cast/?KaMc cast female
2 Alex Moffat /Cast/?AlMo cast male
3 Ego Nwodim /Cast/?EgNw cast unknown
4 Chris Redd /Cast/?ChRe cast male
5 Kenan Thompson /Cast/?KeTh cast male
6 Carey Mulligan /Guests/?3677 guest andy
head(casts_data)
aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown 11 True 19860222 NA False 8
2 A. Whitney Brown 12 True NA NA False 20
3 A. Whitney Brown 13 True NA NA False 13
4 A. Whitney Brown 14 True NA NA False 20
5 A. Whitney Brown 15 True NA NA False 20
6 A. Whitney Brown 16 True NA NA False 20
season_fraction
1 0.4444444
2 1.0000000
3 1.0000000
4 1.0000000
5 1.0000000
6 1.0000000
# Dimensions of datasets
dim(seasons_data)
[1] 46 5
dim(actors_data)
[1] 2306 4
dim(casts_data)
[1] 614 8
# Summaries of datasets
summary(seasons_data)
sid year first_epid last_epid
Min. : 1.00 Min. :1975 Min. :19751011 Min. :19760731
1st Qu.:12.25 1st Qu.:1986 1st Qu.:19863512 1st Qu.:19872949
Median :23.50 Median :1998 Median :19975926 Median :19985512
Mean :23.50 Mean :1998 Mean :19975965 Mean :19985509
3rd Qu.:34.75 3rd Qu.:2009 3rd Qu.:20088423 3rd Qu.:20098015
Max. :46.00 Max. :2020 Max. :20201003 Max. :20210410
n_episodes
Min. :12.0
1st Qu.:20.0
Median :20.0
Mean :19.7
3rd Qu.:21.0
Max. :24.0
summary(actors_data)
aid url type gender
Length:2306 Length:2306 Length:2306 Length:2306
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
summary(casts_data)
aid sid featured first_epid
Length:614 Min. : 1.00 Length:614 Min. :19770115
Class :character 1st Qu.:15.00 Class :character 1st Qu.:19801215
Mode :character Median :26.00 Mode :character Median :19901110
Mean :25.47 Mean :19909634
3rd Qu.:37.00 3rd Qu.:19957839
Max. :46.00 Max. :20141025
NA's :564
last_epid update_anchor n_episodes season_fraction
Min. :19751011 Length:614 Min. : 1.00 Min. :0.04167
1st Qu.:19850112 Class :character 1st Qu.:19.00 1st Qu.:1.00000
Median :19950225 Mode :character Median :20.00 Median :1.00000
Mean :19944038 Mean :18.73 Mean :0.94827
3rd Qu.:20040117 3rd Qu.:21.00 3rd Qu.:1.00000
Max. :20140201 Max. :24.00 Max. :1.00000
NA's :597
Briefly describe the data
The data frames contain independent and well-organized data. The actors Data Frame consists of a comprehensive list of actors, guests, musical guests, and crew members who have appeared on the show. Each observation pertains to an individual actor and includes details about their role type and gender.
On the other hand, the casts data frame focuses on actors who were part of the cast during a specific season. Each observation represents an actor and includes information such as their featured status, dates of the first and last episodes they appeared in, whether they served as an anchor on weekend update, and the number of episodes they participated in during that season.
Lastly, the seasons data frame encompasses information about each specific season. Each observation corresponds to a particular season and includes data such as the year it aired, the dates of the first and last episodes, and the total number of episodes.
Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
# Mutating seasons dataset
<- seasons_data %>% # (46 x 5)
seasons_data mutate(first_epid = ymd(first_epid), last_epid = ymd(last_epid))
# Mutating casts dataset
<- casts_data %>%
casts_data mutate(first_epid = ymd(first_epid), last_epid = ymd(last_epid))
Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
Join Data
Be sure to include a sanity check, and double-check that case count is correct!
# Sanity chack : cast members count :154
<-actors_data %>%
actors_data select(-url) %>%
filter(type == "cast")
# Saanity Check: cast members count : 156
%>%
casts_data select(aid)%>%
n_distinct()
[1] 156
<- full_join(actors_data, casts_data, by = "aid") %>%
fullyjoined_data select(c(aid, gender, sid, featured, update_anchor))
Join of both the tables have been completed. Let us perform some analysis.
head(fullyjoined_data)
aid gender sid featured update_anchor
1 Kate McKinnon female 37 True False
2 Kate McKinnon female 38 True False
3 Kate McKinnon female 39 False False
4 Kate McKinnon female 40 False False
5 Kate McKinnon female 41 False False
6 Kate McKinnon female 42 False False
%>%
fullyjoined_data filter(update_anchor == "True") %>%
select(gender)%>%
table() %>%
prop.table()
gender
female male
0.260274 0.739726
# When considering all seasons, approximately 74% of the hosts for weekend update are male.