Challenge 2: Railroad Wrangling

challenge_2
Daniel Hannon
railroads
Author

Daniel Hannon

Published

March 7, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

Reading in the Data

Code
railroad_data <- readxl::read_excel("_data/StateCounty2012.xls")
slice(railroad_data, 1)
# A tibble: 1 × 6
  `TOTAL RAILROAD EMPLOYMENT BY STATE AND COUNTY` ...2  ...3  ...4  ...5  ...6 
  <chr>                                           <chr> <lgl> <chr> <lgl> <chr>
1 CALENDAR YEAR 2012                              <NA>  NA    <NA>  NA    <NA> 

Looking at the original spreadsheet, we can see based on the title that this data is about the number of railroad employees in the United states, categorized by state and county for the year 2012. The data does not start until line 3 and columns 2 and 4 contain no information, so we can read in the data, skip the first two rows and select out the empty columns.

Code
railroad_data <-readxl::read_excel("_data/StateCounty2012.xls", skip= 3) %>%
                select(STATE, COUNTY, TOTAL)
head(railroad_data)
# A tibble: 6 × 3
  STATE     COUNTY               TOTAL
  <chr>     <chr>                <dbl>
1 AE        APO                      2
2 AE Total1 <NA>                     2
3 AK        ANCHORAGE                7
4 AK        FAIRBANKS NORTH STAR     2
5 AK        JUNEAU                   3
6 AK        MATANUSKA-SUSITNA        2

We can see that each observation is the number of railway workers for a given county in 2012. However, there also contains observations about the total for each state. These observations contain an N/A in the county space and use the word Total in the state column. We can filter these out to have a data frame for all the state totals and one for all the unique counties.

Code
#A data frame of just totals
state_totals <- railroad_data %>%
                filter(grepl("Total", STATE)) %>%
                select(!COUNTY)

#A data frame of no totals 
removed_state_totals <- railroad_data %>%
                        filter(!grepl("Total", STATE))

state_totals
# A tibble: 54 × 2
   STATE     TOTAL
   <chr>     <dbl>
 1 AE Total1     2
 2 AK Total    103
 3 AL Total   4257
 4 AP Total1     1
 5 AR Total   3871
 6 AZ Total   3153
 7 CA Total  13137
 8 CO Total   3650
 9 CT Total   2592
10 DC Total    279
# … with 44 more rows
Code
head(removed_state_totals)
# A tibble: 6 × 3
  STATE COUNTY               TOTAL
  <chr> <chr>                <dbl>
1 AE    APO                      2
2 AK    ANCHORAGE                7
3 AK    FAIRBANKS NORTH STAR     2
4 AK    JUNEAU                   3
5 AK    MATANUSKA-SUSITNA        2
6 AK    SITKA                    1

The State totals should only have 50 observations, because there are only 50 states, but there is are 54. This is because Armed Forces Pacific, Armed Forces Europe, DC and the Grand Total are all saved in the frame. For the purposes of gaining insights about state totals specifically, these will be filtered out of the state_totals tibble before descriptive statistics are done.

Code
state_totals <- state_totals %>%
                filter(!STATE %in% c("AE Total1", "AP Total1", "DC Total", "Grand Total"))

Summary Statistics

State level

First lets look at the statistics of the state totals:

Code
summarize(state_totals, Mean = mean(TOTAL), Median = median(TOTAL), Minimum =min(TOTAL), Maximum = max(TOTAL), Std = sd(TOTAL), IQR= IQR(TOTAL))
# A tibble: 1 × 6
   Mean Median Minimum Maximum   Std   IQR
  <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl>
1  5103  3712.       4   19839 4777.  4953

The totals for each state vary heavily, with a standard deviation of over 4.7 thousand compared to the mean of 5.1 thousand. It also looks like the values are skewed with a few very large states pulling the average up. This can be seen by the mean being almost half a standard deviation higher than the median, and can also be seen by the range of values going from 4 to 19839. In order to try to get a closer look at the county level, we can summarize over our other tibble.

County Level

Code
summarize(removed_state_totals, Mean = mean(TOTAL, na.rm =TRUE), Median = median(TOTAL, na.rm =TRUE), Minimum =min(TOTAL, na.rm =TRUE), Maximum = max(TOTAL, na.rm =TRUE), Std = sd(TOTAL, na.rm =TRUE), IQR= IQR(TOTAL, na.rm =TRUE))
# A tibble: 1 × 6
   Mean Median Minimum Maximum   Std   IQR
  <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl>
1  87.4     21       1    8207  284.    58

Much like the State Total data, this data is highly varied and skewed. Again we can see that the mean is much higher than the mean, but in the case of this data the IQR is much smaller than the Standard Deviation. This also suggests that there are a few observations much higher than the average, and that these are skewing the data.

Code
removed_state_totals %>% 
  group_by(STATE)%>%
  select(TOTAL)%>%
  summarize_all(mean, na.rm= TRUE) %>%
  summarize(Mean = mean(TOTAL, na.rm =TRUE), Median = median(TOTAL, na.rm =TRUE), Minimum =min(TOTAL, na.rm =TRUE), Maximum = max(TOTAL, na.rm =TRUE), Std = sd(TOTAL, na.rm =TRUE), IQR= IQR(TOTAL, na.rm =TRUE))
Adding missing grouping variables: `STATE`
# A tibble: 1 × 6
   Mean Median Minimum Maximum   Std   IQR
  <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl>
1  116.   63.8       1     662  128.  89.3

The data above shows the descriptive statistics over the mean number of railroad employees per county per state. We can see that the range descreases, as the highest average number of workers per county in a state is 662, while the lowest is still 1. We can still see the skew from before, but it is less impactful with a lower standard deviation and a smaller gap between the median and mean. So while the average county has 87 workers according to the last summary, we can see that the average mean number of employees per county state wide is 115.