library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8
Read in data
Read in one (or more) of the following datasets, using the correct R package and command. - snl ⭐⭐⭐⭐⭐
<- read.csv("_data/snl_actors.csv")
actors <- read.csv("_data/snl_seasons.csv")
seasons <- read.csv("_data/snl_casts.csv") casts
Briefly describe the data
show dimension of actors
dim(actors)
[1] 2306 4
Show the columns of actors
colnames(actors)
[1] "aid" "url" "type" "gender"
head(actors)
aid url type gender
1 Kate McKinnon /Cast/?KaMc cast female
2 Alex Moffat /Cast/?AlMo cast male
3 Ego Nwodim /Cast/?EgNw cast unknown
4 Chris Redd /Cast/?ChRe cast male
5 Kenan Thompson /Cast/?KeTh cast male
6 Carey Mulligan /Guests/?3677 guest andy
By examining the dimensions of these dataframes, we can observe that “Saturday Night Live” spanned across 46 seasons and involved a total of 2306 actors.The “actors” dataset focuses on specific information related to the actors themselves.
show dimension of seasons
dim(seasons)
[1] 46 5
Show the columns of seasons
colnames(seasons)
[1] "sid" "year" "first_epid" "last_epid" "n_episodes"
head(seasons)
sid year first_epid last_epid n_episodes
1 1 1975 19751011 19760731 24
2 2 1976 19760918 19770521 22
3 3 1977 19770924 19780520 20
4 4 1978 19781007 19790526 20
5 5 1979 19791013 19800524 20
6 6 1980 19801115 19810411 13
For seasons
dataset,the “seasons” dataset provides comprehensive details about each season of the show, including the premiere and finale episodes, the initial airing year of each season, and the overall number of episodes in each season`
show dimension of casts
dim(casts)
[1] 614 8
Show the columns of casts
colnames(casts)
[1] "aid" "sid" "featured" "first_epid"
[5] "last_epid" "update_anchor" "n_episodes" "season_fraction"
head(casts)
aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown 11 True 19860222 NA False 8
2 A. Whitney Brown 12 True NA NA False 20
3 A. Whitney Brown 13 True NA NA False 13
4 A. Whitney Brown 14 True NA NA False 20
5 A. Whitney Brown 15 True NA NA False 20
6 A. Whitney Brown 16 True NA NA False 20
season_fraction
1 0.4444444
2 1.0000000
3 1.0000000
4 1.0000000
5 1.0000000
6 1.0000000
The “casts” dataset offers insights into the cast members’ participation on the show, including details such as their notable episodes, appearance counts, and other relevant data. ## Tidy data Check if own missing values.
<- colSums(is.na(actors))
m_actors m_actors
aid url type gender
0 0 0 0
<- colSums(is.na(seasons))
m_seasons m_seasons
sid year first_epid last_epid n_episodes
0 0 0 0 0
<- colSums(is.na(casts))
m_casts m_casts
aid sid featured first_epid last_epid
0 0 0 564 597
update_anchor n_episodes season_fraction
0 0 0
actors
and seasons
datasets have no missing values, casts
have missing values for first_epid
and last_epid
, but they are not used here, so I removed these two variables.
<- subset(casts,select = -c(first_epid,last_epid))
casts colnames(casts)
[1] "aid" "sid" "featured" "update_anchor"
[5] "n_episodes" "season_fraction"
Join Data
Inner join the 3 datasets according to aid
and sid
<- inner_join(actors,casts, by="aid")
a_c <- inner_join(a_c,seasons, by="sid")
a_c_s colnames(a_c_s)
[1] "aid" "url" "type" "gender"
[5] "sid" "featured" "update_anchor" "n_episodes.x"
[9] "season_fraction" "year" "first_epid" "last_epid"
[13] "n_episodes.y"
head(a_c_s)
aid url type gender sid featured update_anchor n_episodes.x
1 Kate McKinnon /Cast/?KaMc cast female 37 True False 5
2 Kate McKinnon /Cast/?KaMc cast female 38 True False 21
3 Kate McKinnon /Cast/?KaMc cast female 39 False False 21
4 Kate McKinnon /Cast/?KaMc cast female 40 False False 21
5 Kate McKinnon /Cast/?KaMc cast female 41 False False 21
6 Kate McKinnon /Cast/?KaMc cast female 42 False False 21
season_fraction year first_epid last_epid n_episodes.y
1 0.2272727 2011 20110924 20120519 22
2 1.0000000 2012 20120915 20130518 21
3 1.0000000 2013 20130928 20140517 21
4 1.0000000 2014 20140927 20150516 21
5 1.0000000 2015 20151003 20160521 21
6 1.0000000 2016 20161001 20170520 21
Check if the new generated dataframe has missing value.
<- colSums(is.na(a_c_s))
m_n m_n
aid url type gender sid
0 0 0 0 0
featured update_anchor n_episodes.x season_fraction year
0 0 0 0 0
first_epid last_epid n_episodes.y
0 0 0
Perfect!The new dataframe doesn’t has any missing value.
Next,let’s take a look at the distribution of year
,season_fraction
and gender
of actors!
Check the unique value of gender
print(unique(a_c_s$gender))
[1] "female" "male" "unknown"
ggplot(a_c_s, aes(fill=gender, y=season_fraction, x=year)) +
geom_bar(position="stack", stat="identity")
labs(x = "Year", y = "season_fraction of Episodes") +
theme_minimal()
NULL