library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8 Instructions
Challenge Overview
Today’s challenge is to:
- read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- military marriages ⭐⭐
- faostat ⭐⭐
- railroads ⭐⭐⭐
- fed_rate ⭐⭐⭐
- debt ⭐⭐⭐
- us_hh ⭐⭐⭐⭐
- snl ⭐⭐⭐⭐⭐
<- read.csv("_data/snl_seasons.csv")
seasons_df <- read.csv("_data/snl_actors.csv")
actors_df <- read.csv("_data/snl_casts.csv")
casts_df
# Displaying first few rows of the datasets
head(seasons_df)
sid year first_epid last_epid n_episodes
1 1 1975 19751011 19760731 24
2 2 1976 19760918 19770521 22
3 3 1977 19770924 19780520 20
4 4 1978 19781007 19790526 20
5 5 1979 19791013 19800524 20
6 6 1980 19801115 19810411 13
head(actors_df)
aid url type gender
1 Kate McKinnon /Cast/?KaMc cast female
2 Alex Moffat /Cast/?AlMo cast male
3 Ego Nwodim /Cast/?EgNw cast unknown
4 Chris Redd /Cast/?ChRe cast male
5 Kenan Thompson /Cast/?KeTh cast male
6 Carey Mulligan /Guests/?3677 guest andy
head(casts_df)
aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown 11 True 19860222 NA False 8
2 A. Whitney Brown 12 True NA NA False 20
3 A. Whitney Brown 13 True NA NA False 13
4 A. Whitney Brown 14 True NA NA False 20
5 A. Whitney Brown 15 True NA NA False 20
6 A. Whitney Brown 16 True NA NA False 20
season_fraction
1 0.4444444
2 1.0000000
3 1.0000000
4 1.0000000
5 1.0000000
6 1.0000000
# Dimensions of datasets
dim(seasons_df)
[1] 46 5
dim(actors_df)
[1] 2306 4
dim(casts_df)
[1] 614 8
summary(actors_df)
aid url type gender
Length:2306 Length:2306 Length:2306 Length:2306
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
summary(seasons_df)
sid year first_epid last_epid
Min. : 1.00 Min. :1975 Min. :19751011 Min. :19760731
1st Qu.:12.25 1st Qu.:1986 1st Qu.:19863512 1st Qu.:19872949
Median :23.50 Median :1998 Median :19975926 Median :19985512
Mean :23.50 Mean :1998 Mean :19975965 Mean :19985509
3rd Qu.:34.75 3rd Qu.:2009 3rd Qu.:20088423 3rd Qu.:20098015
Max. :46.00 Max. :2020 Max. :20201003 Max. :20210410
n_episodes
Min. :12.0
1st Qu.:20.0
Median :20.0
Mean :19.7
3rd Qu.:21.0
Max. :24.0
summary(casts_df)
aid sid featured first_epid
Length:614 Min. : 1.00 Length:614 Min. :19770115
Class :character 1st Qu.:15.00 Class :character 1st Qu.:19801215
Mode :character Median :26.00 Mode :character Median :19901110
Mean :25.47 Mean :19909634
3rd Qu.:37.00 3rd Qu.:19957839
Max. :46.00 Max. :20141025
NA's :564
last_epid update_anchor n_episodes season_fraction
Min. :19751011 Length:614 Min. : 1.00 Min. :0.04167
1st Qu.:19850112 Class :character 1st Qu.:19.00 1st Qu.:1.00000
Median :19950225 Mode :character Median :20.00 Median :1.00000
Mean :19944038 Mean :18.73 Mean :0.94827
3rd Qu.:20040117 3rd Qu.:21.00 3rd Qu.:1.00000
Max. :20140201 Max. :24.00 Max. :1.00000
NA's :597
Briefly describe the data
The data frames contain organized and independent data. The actors Data Frame includes a comprehensive list of actors, guests, musical guests, and crew members from the show, with each entry representing an individual actor and providing details about their role type and gender.
The casts data frame focuses on actors who were part of a specific season’s cast. Each entry represents an actor and includes information like their featured status, the dates of their first and last episodes, whether they served as an anchor on weekend update, and the number of episodes they participated in during that season.
The seasons data frame provides information about each season individually. Each entry corresponds to a specific season and includes data such as the airing year, the dates of the first and last episodes, and the total number of episodes.
Tidy Data (as needed)
<- actors_df %>%
actors_df drop_na()
<- seasons_df %>%
seasons_df drop_na()
<- casts_df %>%
casts_df drop_na()
colnames(actors_df)
[1] "aid" "url" "type" "gender"
colnames(casts_df)
[1] "aid" "sid" "featured" "first_epid"
[5] "last_epid" "update_anchor" "n_episodes" "season_fraction"
colnames(seasons_df)
[1] "sid" "year" "first_epid" "last_epid" "n_episodes"
<- actors_df %>%
actors_df mutate(appearances = rowSums(select(., starts_with("ep_")), na.rm = TRUE))
Join Data
<- casts_df %>%
df_casts_actors left_join(actors_df, by = "aid") %>%
select(sid, type, gender, featured, appearances)
<- seasons_df %>%
df_data left_join(df_casts_actors, by = "sid")
colnames(df_data)
[1] "sid" "year" "first_epid" "last_epid" "n_episodes"
[6] "type" "gender" "featured" "appearances"
colnames(df_casts_actors)
[1] "sid" "type" "gender" "featured" "appearances"
<- casts_df %>%
joined_data select(-update_anchor, -season_fraction, -n_episodes) %>%
left_join(seasons_df, by="sid")
joined_data
[1] aid sid featured first_epid.x last_epid.x
[6] year first_epid.y last_epid.y n_episodes
<0 rows> (or 0-length row.names)
colnames(joined_data)
[1] "aid" "sid" "featured" "first_epid.x" "last_epid.x"
[6] "year" "first_epid.y" "last_epid.y" "n_episodes"