challenge_1
Reading in data and creating a post
Author

Paarth Tandon

Published

December 26, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

Code
# read in the data using readr
rail <- read_csv("_data/railroad_2012_clean_county.csv")
# view a few data points
head(rail)
# A tibble: 6 × 3
  state county               total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1

Reading the data was very straight forward using readr.

Describe the data

As seen by using head, this csv has three columns: state <chr>, county <chr>, and total_employees <dbl>. It contains two character based columns and one double column. Essentially, it contains the number of employees at each state, county pair.

Code
u_states <- rail$state %>%
                unique() %>%
                length()
sprintf('# unique states: %s', u_states)
[1] "# unique states: 53"
Code
u_counties <- rail$county %>%
                unique() %>%
                length()
sprintf('# unique counties: %s', u_counties)
[1] "# unique counties: 1709"
Code
range_emp <- rail$total_employees %>%
                range()
sprintf('range of total employees: [%s, %s]', range_emp[1], range_emp[2])
[1] "range of total employees: [1, 8207]"

One anomaly in this data is that there are 53 unique states, when in reality there are only 50 states in USA. We can find out what is causing this using filter.

Code
fifty_states <- c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY")

filter(rail, ! state %in% fifty_states)
# A tibble: 3 × 3
  state county        total_employees
  <chr> <chr>                   <dbl>
1 AE    APO                         2
2 AP    APO                         1
3 DC    WASHINGTON DC             279

The obvious row in this output is DC, as it represents Washington D.C., the capital zone. I was confused about AE and AP. After further research, I found out that AE is the Armed Forces in Europe, Africa, the Middle East, and Canada; AP is the Armed Forces in the Pacific. Mystery solved!

I believe that this data was collected from each railroad station in the United States. It is most likely collected for bookkeeping purposes, but I could see it being used for analysis of which railroad stations need more employees, and which are overstaffed. Of course, answering these questions would require more data to be combined with this dataset.