Challenge 8

challenge_8
railroads
snl
faostat
debt
Joining Data
Author

Henry Mitrano

Published

January 29, 2022

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. join two or more data sets and analyze some aspect of the joined data

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • military marriages ⭐⭐
  • faostat ⭐⭐
  • railroads ⭐⭐⭐
  • fed_rate ⭐⭐⭐
  • debt ⭐⭐⭐
  • us_hh ⭐⭐⭐⭐
  • snl ⭐⭐⭐⭐⭐
data1 <- read_csv("_data/snl_actors.csv")
data2 <- read_csv("_data/snl_casts.csv")
head(data1)
# A tibble: 6 × 4
  aid            url           type  gender 
  <chr>          <chr>         <chr> <chr>  
1 Kate McKinnon  /Cast/?KaMc   cast  female 
2 Alex Moffat    /Cast/?AlMo   cast  male   
3 Ego Nwodim     /Cast/?EgNw   cast  unknown
4 Chris Redd     /Cast/?ChRe   cast  male   
5 Kenan Thompson /Cast/?KeTh   cast  male   
6 Carey Mulligan /Guests/?3677 guest andy   
head(data2)
# A tibble: 6 × 8
  aid                sid featured first_epid last_epid update_…¹ n_epi…² seaso…³
  <chr>            <dbl> <lgl>         <dbl>     <dbl> <lgl>       <dbl>   <dbl>
1 A. Whitney Brown    11 TRUE       19860222        NA FALSE           8   0.444
2 A. Whitney Brown    12 TRUE             NA        NA FALSE          20   1    
3 A. Whitney Brown    13 TRUE             NA        NA FALSE          13   1    
4 A. Whitney Brown    14 TRUE             NA        NA FALSE          20   1    
5 A. Whitney Brown    15 TRUE             NA        NA FALSE          20   1    
6 A. Whitney Brown    16 TRUE             NA        NA FALSE          20   1    
# … with abbreviated variable names ¹​update_anchor, ²​n_episodes,
#   ³​season_fraction

Briefly describe the data

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

view(data1)
view(data2)

joined<-bind_rows(data1, data2)
view(joined)

Above, we run views on each of our two data sets so we can look at them in their entirety, and understand how they compare and try to pinpoint a way in which we can join them. We also run the default “bind_rows” command for collecting multiple data sets and morphing them into one.

When we do this we see that it has effectively failed. All of our fields from each table are still ther, now conjoined together into one table, but unfortunately we’ve lost every value from the SECOND table. Note that all the columns from dataset data1 (url, type, gender), the first table, still have all their previous values intact. This is because when running a bind_rows, it defaults to maintaining the data from the first argument passed, first and formost.

We need to merge our sets in a way that ensures we aren’t losing any information from either of our inputs. We can do that by using function “merge”, and passing in a common argument, a column that both original data sets share, to serve as the reference point of combining the data. We can think of this kind of like a “join” in SQL programming- we are combining 2 (or more) tables based on a single related key between them.

merged <- merge(data1,data2,by="aid")

view(merged)

Now, we have a proper combination of the two sets that still includes all original data from both. There are no values that need mutating or combination, but we do have a fair few of NA values. However, there appears to be some data that is missing, and not every cast member has a first_ep or last_ep value. We can leave it so that the data is a littl more comprehensive for those cast members that do, but it’s something I’d look at removing. ## Join Data

Be sure to include a sanity check, and double-check that case count is correct!