library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8 Solution
Challenge Overview
Today’s challenge is to:
- read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- military marriages ⭐⭐
- faostat ⭐⭐
- railroads ⭐⭐⭐
- fed_rate ⭐⭐⭐
- debt ⭐⭐⭐
- us_hh ⭐⭐⭐⭐
- snl ⭐⭐⭐⭐⭐
<- read.csv("_data/snl_actors.csv")
actors head(actors)
aid url type gender
1 Kate McKinnon /Cast/?KaMc cast female
2 Alex Moffat /Cast/?AlMo cast male
3 Ego Nwodim /Cast/?EgNw cast unknown
4 Chris Redd /Cast/?ChRe cast male
5 Kenan Thompson /Cast/?KeTh cast male
6 Carey Mulligan /Guests/?3677 guest andy
dim(actors)
[1] 2306 4
<- read.csv("_data/snl_seasons.csv")
seasons head(seasons)
sid year first_epid last_epid n_episodes
1 1 1975 19751011 19760731 24
2 2 1976 19760918 19770521 22
3 3 1977 19770924 19780520 20
4 4 1978 19781007 19790526 20
5 5 1979 19791013 19800524 20
6 6 1980 19801115 19810411 13
dim(seasons)
[1] 46 5
<- read.csv("_data/snl_casts.csv")
casts head(casts, 20)
aid sid featured first_epid last_epid update_anchor n_episodes
1 A. Whitney Brown 11 True 19860222 NA False 8
2 A. Whitney Brown 12 True NA NA False 20
3 A. Whitney Brown 13 True NA NA False 13
4 A. Whitney Brown 14 True NA NA False 20
5 A. Whitney Brown 15 True NA NA False 20
6 A. Whitney Brown 16 True NA NA False 20
7 Alan Zweibel 5 True 19800409 NA False 5
8 Sasheer Zamata 39 True 20140118 NA False 11
9 Sasheer Zamata 40 True NA NA False 21
10 Sasheer Zamata 41 False NA NA False 21
11 Sasheer Zamata 42 False NA NA False 21
12 Bowen Yang 45 True NA NA False 18
13 Bowen Yang 46 True NA NA False 17
14 Fred Wolf 21 True NA NA False 20
15 Fred Wolf 22 True NA 19961019 False 3
16 Casey Wilson 33 True 20080223 NA False 8
17 Casey Wilson 34 True NA NA False 22
18 Kristen Wiig 31 True 20051112 NA False 15
19 Kristen Wiig 32 False NA NA False 20
20 Kristen Wiig 33 False NA NA False 12
season_fraction
1 0.4444444
2 1.0000000
3 1.0000000
4 1.0000000
5 1.0000000
6 1.0000000
7 0.2500000
8 0.5238095
9 1.0000000
10 1.0000000
11 1.0000000
12 1.0000000
13 1.0000000
14 1.0000000
15 0.1500000
16 0.6666667
17 1.0000000
18 0.7894737
19 1.0000000
20 1.0000000
Briefly describe the data
I chose the SNL dataset. It has three files described as below: 1. Actor: It contains the details of the actors who worked with SNL. There are 4 columns where the column ‘aid’ represents the name of the person, ‘url’ refers to the url associated to the aid and two other columns type and gender which talks about how the aid is associated and the gender of the aid respectively. The show featured 2306 actors in total. 2. Seasons: It contains the details of all season of SNL. The date first and last episodes were released and the number of episodes in each season. sid contains the season number. This show had 46 seasons. 3. Casts: This contains the data of actor , the season in which they were featured and episode details.
Tidy Data (as needed)
I will check for null values in any of the datasets since null values inflate rows in case of joins. The first episode and last episode id contains some null values. However we will not be using these columns to make any joins thus we can ignore them.
colSums(is.na(actors))
aid url type gender
0 0 0 0
colSums(is.na(seasons))
sid year first_epid last_epid n_episodes
0 0 0 0 0
colSums(is.na(casts))
aid sid featured first_epid last_epid
0 0 0 564 597
update_anchor n_episodes season_fraction
0 0 0
Join Data
I will join the actors, casts and season data to plot the distribution of male and female actors over the seasons to see if there is any bias as is claimed with regards to SNL.
<- actors %>% inner_join(casts, by="aid")
actors_casts
<- actors_casts %>% inner_join(seasons, by="sid") actors_casts_seasons
Data Visualization
We can see that there is a bias between the female and male actors who appear on SNL.
<- actors_casts_seasons %>%
seasons_gender select(sid, gender)
<- seasons_gender %>% group_by(sid, gender) %>% summarise(count = n())
seasons_gender
# Create the line plot
ggplot(seasons_gender, aes(x = sid, y = count, group = gender, color = gender)) +
geom_line() +
labs(title = "Gender of Actors across seasons",x = "Seasons", y = "Count", color = "Gender")