library(tidyverse)
library(ggplot2)
library(here)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8
Challenge Overview
Today’s challenge is to:
- read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- military marriages ⭐⭐
- faostat ⭐⭐
- railroads ⭐⭐⭐
- fed_rate ⭐⭐⭐
- debt ⭐⭐⭐
- us_hh ⭐⭐⭐⭐
- snl ⭐⭐⭐⭐⭐
<- here("posts","_data","snl_actors.csv") %>%
snl_actors read_csv()
<- here("posts","_data","snl_casts.csv") %>%
snl_casts read_csv()
<- here("posts","_data","snl_seasons.csv") %>%
snl_seasons read_csv()
snl_actors
# A tibble: 2,306 × 4
aid url type gender
<chr> <chr> <chr> <chr>
1 Kate McKinnon /Cast/?KaMc cast female
2 Alex Moffat /Cast/?AlMo cast male
3 Ego Nwodim /Cast/?EgNw cast unknown
4 Chris Redd /Cast/?ChRe cast male
5 Kenan Thompson /Cast/?KeTh cast male
6 Carey Mulligan /Guests/?3677 guest andy
7 Marcus Mumford /Guests/?3679 guest male
8 Aidy Bryant /Cast/?AiBr cast female
9 Steve Higgins /Crew/?StHi crew male
10 Mikey Day /Cast/?MiDa cast male
# … with 2,296 more rows
snl_casts
# A tibble: 614 × 8
aid sid featured first_epid last_epid update…¹ n_epi…² seaso…³
<chr> <dbl> <lgl> <dbl> <dbl> <lgl> <dbl> <dbl>
1 A. Whitney Brown 11 TRUE 19860222 NA FALSE 8 0.444
2 A. Whitney Brown 12 TRUE NA NA FALSE 20 1
3 A. Whitney Brown 13 TRUE NA NA FALSE 13 1
4 A. Whitney Brown 14 TRUE NA NA FALSE 20 1
5 A. Whitney Brown 15 TRUE NA NA FALSE 20 1
6 A. Whitney Brown 16 TRUE NA NA FALSE 20 1
7 Alan Zweibel 5 TRUE 19800409 NA FALSE 5 0.25
8 Sasheer Zamata 39 TRUE 20140118 NA FALSE 11 0.524
9 Sasheer Zamata 40 TRUE NA NA FALSE 21 1
10 Sasheer Zamata 41 FALSE NA NA FALSE 21 1
# … with 604 more rows, and abbreviated variable names ¹update_anchor,
# ²n_episodes, ³season_fraction
snl_seasons
# A tibble: 46 × 5
sid year first_epid last_epid n_episodes
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1975 19751011 19760731 24
2 2 1976 19760918 19770521 22
3 3 1977 19770924 19780520 20
4 4 1978 19781007 19790526 20
5 5 1979 19791013 19800524 20
6 6 1980 19801115 19810411 13
7 7 1981 19811003 19820522 20
8 8 1982 19820925 19830514 20
9 9 1983 19831008 19840512 19
10 10 1984 19841006 19850413 17
# … with 36 more rows
Briefly describe the data
There are 3 datasets describing the data about various actors and their appearances in the saturday night live show. snl_actors dataset contains data about various actors ever involved with snl show. snl_casts dataset contains data about various actors appearances in snl. snl_seasons dataset contains details about seasons aired.
There are total 46 seasons with 906 episodes, 2306 actors from year 1975 to 2020.
length(unique(snl_seasons$sid))
[1] 46
length(unique(snl_actors$aid))
[1] 2306
min(snl_seasons$year)
[1] 1975
max(snl_seasons$year)
[1] 2020
sum(snl_seasons$n_episodes)
[1] 906
Analysis
Here our goal is to find the count of various genders in the saturday night live show over the years. To get the year, gender and count we have to combine 3 different tables i.e snl_actors, snl_casts and snl_seasons.
Join Data
Be sure to include a sanity check, and double-check that case count is correct!
From snl_actors table we get the “gender” column and will get get “sid” from snl_casts which can later be used to combine with snl_seasons. We can combine this with snl_casts as they have a common column “aid” and name the new dataframe as actors_casts. We then combine actors_casts with snl_seasons where we get the “year”.
#Merging snl_actors and snl_casts
<- merge(snl_actors, snl_casts, by = 'aid')
actors_casts
#merging actors_casts and snl_seasons based on gender
<- actors_casts %>%
actors_casts_seasons_gender merge(snl_seasons, by = 'sid') %>%
group_by(sid, year, gender) %>%
count(sid, gender)
actors_casts_seasons_gender
# A tibble: 95 × 4
# Groups: sid, year, gender [95]
sid year gender n
<dbl> <dbl> <chr> <int>
1 1 1975 female 3
2 1 1975 male 6
3 2 1976 female 3
4 2 1976 male 5
5 3 1977 female 3
6 3 1977 male 6
7 4 1978 female 3
8 4 1978 male 6
9 5 1979 female 3
10 5 1979 male 12
# … with 85 more rows
#plotting graph
%>%
actors_casts_seasons_gender ggplot(aes(year, n, col = gender)) +
geom_line() +
ylab("count") +
ggtitle("Various gender count over the years")
#merging actors_casts and snl_seasons
<- actors_casts %>%
actors_casts_seasons merge(snl_seasons, by = 'sid') %>%
group_by(sid, year) %>%
count(sid)
actors_casts_seasons
# A tibble: 46 × 3
# Groups: sid, year [46]
sid year n
<dbl> <dbl> <int>
1 1 1975 9
2 2 1976 8
3 3 1977 9
4 4 1978 9
5 5 1979 15
6 6 1980 15
7 7 1981 8
8 8 1982 8
9 9 1983 9
10 10 1984 10
# … with 36 more rows
mean(actors_casts_seasons$n)
[1] 13.34783
min(actors_casts_seasons$n)
[1] 8
max(actors_casts_seasons$n)
[1] 20
#plotting graph
%>%
actors_casts_seasons ggplot(aes(x = year)) +
geom_line(aes(y = n), color = "blue") +
geom_point(aes(y = n), color = "red") +
ylab("count") +
ggtitle("Count of cast over the years")
From the above visualization we can see that there is rise in number of employees in each season over the years. Number of females are consistently less less than the number of males. The second graph shows that the number of actors over the years. We can infer that there has been increase in the actors over years with minimum number of actors per season are 8 recorded in between 1975 and 1985 and maximum are 20 in 2020.