Challenge 1 Solution - Railroad Employment

challenge_1
railroads
Linus Jen
Reading in data and creating a post for US State/County Railroad Employment Data in 2012
Author

Linus Jen

Published

May 30, 2023

Code
library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

For this homework, I decided to look at the StateCounty2012.xls dataset. I opened the file and noticed that the first three rows are part of the header, so I skipped those lines using the skip argument in the read_xls() function.

Code
# Import data
data <- read_xls("_data/StateCounty2012.xls", skip=3)

Data preprocessing

To ensure that the data was properly loaded in, I also looked at the top 10 rows of the dataset. I also noticed that there were only 3 columns in the .xls file, but 5 columns in our loaded dataset. To explore further, I applied the is.na() function to all columns, suspecting that the second and fourth columns were loaded improperly and were fully blank. After confirming this, I removed those columns from the dataset. Lastly, I noticed that there were some totals included in our rows, and these occurred when the COUNTY value was empty. We need to be mindful not to remove CANADA though as an observation, even though it doesn’t have a COUNTY value!

Code
# Check head
print(head(data, 10))
# A tibble: 10 × 5
   STATE     ...2  COUNTY               ...4  TOTAL
   <chr>     <lgl> <chr>                <lgl> <dbl>
 1 AE        NA    APO                  NA        2
 2 AE Total1 NA    <NA>                 NA        2
 3 AK        NA    ANCHORAGE            NA        7
 4 AK        NA    FAIRBANKS NORTH STAR NA        2
 5 AK        NA    JUNEAU               NA        3
 6 AK        NA    MATANUSKA-SUSITNA    NA        2
 7 AK        NA    SITKA                NA        1
 8 AK        NA    SKAGWAY MUNICIPALITY NA       88
 9 AK Total  NA    <NA>                 NA      103
10 AL        NA    AUTAUGA              NA      102
Code
# Find dimensions
print(dim(data))
[1] 2990    5
Code
# Check if columns are completely blank
sapply(data, function(x) all(is.na(x)))  # We see that columns 2 and 4 are empty!
 STATE   ...2 COUNTY   ...4  TOTAL 
 FALSE   TRUE  FALSE   TRUE  FALSE 
Code
# Check where `COUNTY` is blank
data %>% filter(is.na(COUNTY))  # Notice that when `COUNTY` is empty, values are totals are not important

Thus, the following chunk of code cleans our dataset to only have US states, counties, and the total railroad employment.

Code
# Remove the 2 columns and when `COUNTY` is empty
data_clean <- data %>%
  select("STATE", "COUNTY", "TOTAL") %>%
  filter(!is.na(COUNTY) | (STATE == "CANADA"))

# Check dimensions
dim(data_clean)
[1] 2931    3

Describe the data

Overview / Summary

From the Excel file title, this dataset represents the total railroad employment by state and county for 2012. This data is on a county level, and most likely counts the number of railroad employees for each county, with summations for each state also included in the original .xls file (though it was removed for this dataset). This data might prove valuable if combined with other datasets, such as seeing how the employment per county or state might change over time (if historical data was gathered), checking for relationships being railroad tracks built or existing in each county and the number of employees in 2012 (if data was collected about the railroad track mileage itself), or how railroad employment per county relates back to state statistics, such as population, land size of the state, or unemployment rates of the state (assuming this data is gathered).

From the findings below, we see that there are a total of 256,094 reported railroad employees in 2012. There are a total of 54 unique “counties” that include 2 US territories, Canada, and the District of Columbia. Overall, there seems to be a skew in the distribution of where the railroad employees are, though further geographical analysis needs to be done to make any concrete assumptions.

Data Exploration

This section will provide a deeper dive into the data, and the exact commands used to get the information for the introduction and overview section above.

First, let’s look at the head of the data and the various data types.

Code
# Quick glance of data
head(data_clean, 10)
Code
glimpse(data_clean)
Rows: 2,931
Columns: 3
$ STATE  <chr> "AE", "AK", "AK", "AK", "AK", "AK", "AK", "AL", "AL", "AL", "AL…
$ COUNTY <chr> "APO", "ANCHORAGE", "FAIRBANKS NORTH STAR", "JUNEAU", "MATANUSK…
$ TOTAL  <dbl> 2, 7, 2, 3, 2, 1, 88, 102, 143, 1, 25, 154, 13, 29, 45, 13, 9, …
Code
# Check for missingness
colSums(is.na(data_clean))
 STATE COUNTY  TOTAL 
     0      1      0 
Code
# Lastly, check for total number of railroad employees reported
sum(data_clean$TOTAL)
[1] 256094

From the output above, we see that there are 2,931 observations (rows) with 3 columns of the state name, the county name, and the total number of people employed in that county/state combination. Additionally, we see that there is 1 missing value for the COUNTY column, which is a result of all people employed in Canada being aggregated into one row. Lastly, there were a total of 256,094 total reported railroad employees in 2012.

Next, let’s look at what states are included in our dataset and the number of counties make up each state.

Code
# Number of states included in dataset
print(paste0("Number of unique states: ", length(unique(data_clean$STATE))))
[1] "Number of unique states: 54"
Code
# Check if there are counties with no employes
data_clean %>% 
  filter(TOTAL == 0) %>%
  dim()
[1] 0 3
Code
# Check number of counties per state
num_counties <- table(data_clean$STATE)
print("Number of counties ")
[1] "Number of counties "
Code
num_counties[order(num_counties, decreasing=TRUE)]

    TX     GA     KY     MO     IL     IA     KS     NC     IN     VA     TN 
   221    152    119    115    103     99     95     94     92     92     91 
    NE     OH     MN     MI     MS     OK     AR     WI     AL     FL     PA 
    89     88     86     78     78     73     72     69     67     67     65 
    LA     NY     CO     CA     MT     WV     SD     ND     SC     WA     ID 
    63     61     57     55     53     53     52     49     46     39     36 
    OR     NM     UT     MD     WY     NJ     ME     AZ     VT     MA     NV 
    33     29     25     24     22     21     16     15     14     12     12 
    NH     CT     AK     RI     DE     HI     AE     AP CANADA     DC 
    10      8      6      5      3      3      1      1      1      1 

In the print statement, we see that there are 54 unique “states”. Upon further analysis, we see that this is because Canada is counted as a state in addition to the 50 US states, and several US territories and the US capital are included: “AE” is the armed forces Africa, “AP” is the armed forces in the Pacific, and DC is the District of Columbia. Additionally, we see that ONLY counties with railroad employees are included in this list, so we cannot assume that all counties are included in this dataset.

From our table, We see that Texas, Georgia, and Kentucky are the top three states that have the most counties with railroad employees.

Next, we’ll look at the railroad employment for each state.

Code
state_agg <- data_clean %>% 
  group_by(STATE) %>%
  summarise(total_employment = sum(TOTAL)) %>%
  arrange(desc(total_employment))
state_agg

From the table above, we see that Texas, Illinois, and New York make up the top three states with the most people employed working on railroads, while Hawaii, army forces in Africa, and army forces in the Pacific make up the states with the least number of railroad employees. This isn’t shocking. One hypothesis could be that many of the larger and more central states have more railroad employees. This is probably because products need to move across the country, so these states need more railroad tracks (and thus employees to build and maintain these tracks) for goods to move across the country.

Lastly, we’ll look at a simple distribution of the employee data across counties and states.

Code
# 5 number summary for county level
summary(data_clean$TOTAL)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    7.00   21.00   87.37   65.00 8207.00 
Code
# 5 number summary for state level
summary(state_agg$total_employment)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1    1652    3353    4742    5936   19839 

On the county level, the data is clearly skewed right, as the median of 21 is far below the mean of 87.37. We see that there’s a max of 8,207 railroad employees in one country, which is pretty large. However, there’s also a fair amount of counties with very little employees, as about 25% of counties have 7 or less railroad employees.

On the state level, the numbers are far greater for number of railroad employees. The distribution is still skewed right, with the mean of 3,353 still being less than the mean of 4,742. There’s a minimum of 1 railroad employee in a state, and a maximum of 19,839 railroad employees (in Texas, as seen in the previous table / code chunk).