Challenge 1 - Railroad Employment

challenge_1
railroads
Justine Shakespeare
Author

Justine Shakespeare

Published

February 15, 2023

Code
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Reading in the Data

First we’ll load the tidyverse and readxl packages and use the read_excel() function to read in the dataset. When I first loaded this dataset I noted that the data started on the third row, so I’ll specify that R should skip the first two lines when I import the data.

Code
library(tidyverse)
library(readxl)

RailroadEmployment <- read_excel("_data/StateCounty2012.xls", skip=2)

Data Description

This dataset contains information about the number of railroad employees by county and state in 2012. In these data, rows are individual counties and columns are the state, country, and number of employees. It is not clear from the dataset itself where this data is from/who published it.

Data Cleaning

Before summarizing the data, we’ll tidy it up a bit. Using the head() function we can see that the dataset includes 5 columns, but data is stored in only three.

Code
head(RailroadEmployment)
# A tibble: 6 × 5
  STATE     ...2  COUNTY               ...4  TOTAL
  <chr>     <lgl> <chr>                <lgl> <dbl>
1 AE        NA    APO                  NA        2
2 AE Total1 NA    <NA>                 NA        2
3 AK        NA    ANCHORAGE            NA        7
4 AK        NA    FAIRBANKS NORTH STAR NA        2
5 AK        NA    JUNEAU               NA        3
6 AK        NA    MATANUSKA-SUSITNA    NA        2

We can use the select() function to isolate only the columns with data and assign these to a new dataframe in R. We’ll use head() to check our work.

Code
RailroadEmployment <- select(RailroadEmployment, STATE, COUNTY, TOTAL)

head(RailroadEmployment)
# A tibble: 6 × 3
  STATE     COUNTY               TOTAL
  <chr>     <chr>                <dbl>
1 AE        APO                      2
2 AE Total1 <NA>                     2
3 AK        ANCHORAGE                7
4 AK        FAIRBANKS NORTH STAR     2
5 AK        JUNEAU                   3
6 AK        MATANUSKA-SUSITNA        2

Using the n_distinct() function we can see that there are 111 distinct values in the STATE column. Since we know there are 50 states in the US, this indicates there are some additional values that we may not have expected or want to keep for analysis.

Code
RailroadEmployment %>%
  select(STATE)%>%
  n_distinct(.)
[1] 111

We can use piping and the functions group_by(), summarize(), and print() to inspect the values of the STATE column more closely to get an idea of what is stored here.

Code
RailroadEmployment %>% 
  group_by(STATE) %>% 
  summarize(mean(TOTAL)) %>% 
  print(n=111)
# A tibble: 111 × 2
    STATE                                               `mean(TOTAL)`
    <chr>                                                       <dbl>
  1 1  Military designation.                                    NA   
  2 AE                                                           2   
  3 AE Total1                                                    2   
  4 AK                                                          17.2 
  5 AK Total                                                   103   
  6 AL                                                          63.5 
  7 AL Total                                                  4257   
  8 AP                                                           1   
  9 AP Total1                                                    1   
 10 AR                                                          53.8 
 11 AR Total                                                  3871   
 12 AZ                                                         210.  
 13 AZ Total                                                  3153   
 14 CA                                                         239.  
 15 CA Total                                                 13137   
 16 CANADA                                                     662   
 17 CO                                                          64.0 
 18 CO Total                                                  3650   
 19 CT                                                         324   
 20 CT Total                                                  2592   
 21 DC                                                         279   
 22 DC Total                                                   279   
 23 DE                                                         498.  
 24 DE Total                                                  1495   
 25 FL                                                         111.  
 26 FL Total                                                  7419   
 27 GA                                                          56.6 
 28 GA Total                                                  8605   
 29 Grand Total                                             255432   
 30 HI                                                           1.33
 31 HI Total                                                     4   
 32 IA                                                          40.6 
 33 IA Total                                                  4019   
 34 ID                                                          43.4 
 35 ID Total                                                  1563   
 36 IL                                                         186.  
 37 IL Total                                                 19131   
 38 IN                                                          92.8 
 39 IN Total                                                  8537   
 40 KS                                                          64.1 
 41 KS Total                                                  6092   
 42 KY                                                          40.4 
 43 KY Total                                                  4811   
 44 LA                                                          62.1 
 45 LA Total                                                  3915   
 46 MA                                                         282.  
 47 MA Total                                                  3379   
 48 MD                                                         196.  
 49 MD Total                                                  4709   
 50 ME                                                          40.9 
 51 ME Total                                                   654   
 52 MI                                                          50.4 
 53 MI Total                                                  3932   
 54 MN                                                          63.6 
 55 MN Total                                                  5467   
 56 MO                                                          73.2 
 57 MO Total                                                  8419   
 58 MS                                                          27.1 
 59 MS Total                                                  2111   
 60 MT                                                          62.8 
 61 MT Total                                                  3327   
 62 NC                                                          33.4 
 63 NC Total                                                  3143   
 64 ND                                                          45.0 
 65 ND Total                                                  2204   
 66 NE                                                         148.  
 67 NE Total                                                 13176   
 68 NH                                                          39.3 
 69 NH Total                                                   393   
 70 NJ                                                         397.  
 71 NJ Total                                                  8329   
 72 NM                                                          67.5 
 73 NM Total                                                  1958   
 74 NOTE:  Excludes 2,896 employees without an address.         NA   
 75 NV                                                          62.2 
 76 NV Total                                                   746   
 77 NY                                                         280.  
 78 NY Total                                                 17050   
 79 OH                                                         103.  
 80 OH Total                                                  9056   
 81 OK                                                          31.8 
 82 OK Total                                                  2318   
 83 OR                                                          70.4 
 84 OR Total                                                  2322   
 85 PA                                                         196.  
 86 PA Total                                                 12769   
 87 RI                                                          97.4 
 88 RI Total                                                   487   
 89 SC                                                          49.9 
 90 SC Total                                                  2296   
 91 SD                                                          18.2 
 92 SD Total                                                   949   
 93 TN                                                          54.4 
 94 TN Total                                                  4952   
 95 TX                                                          89.8 
 96 TX Total                                                 19839   
 97 UT                                                          76.7 
 98 UT Total                                                  1917   
 99 VA                                                          82.1 
100 VA Total                                                  7551   
101 VT                                                          18.5 
102 VT Total                                                   259   
103 WA                                                         134.  
104 WA Total                                                  5222   
105 WI                                                          54.7 
106 WI Total                                                  3773   
107 WV                                                          60.6 
108 WV Total                                                  3213   
109 WY                                                         131.  
110 WY Total                                                  2876   
111 <NA>                                                        NA   

There are a few issues we see here: It looks as if there are Total rows in this dataset - that is, there are rows that add up the total for each state. We don’t need those in our data, since we can use R functions to total employees by state. It also looks as if CANADA is listed as a state. Assuming we are only interested in the US, let’s remove that as well.

We can use the filter() and grepl() function to remove both the total rows and Canada from this dataset.

Code
RailroadEmployment <- RailroadEmployment %>% 
  filter(!grepl("Total", STATE))

RailroadEmployment <- RailroadEmployment %>% 
  filter(!grepl("CANADA", STATE))

Finally, we noticed we had some NAs in both the STATE and the TOTAL columns. We can use the na.omit() function to remove these from the dataset.

Code
RailroadEmployment <- print(na.omit(RailroadEmployment))
# A tibble: 2,930 × 3
   STATE COUNTY               TOTAL
   <chr> <chr>                <dbl>
 1 AE    APO                      2
 2 AK    ANCHORAGE                7
 3 AK    FAIRBANKS NORTH STAR     2
 4 AK    JUNEAU                   3
 5 AK    MATANUSKA-SUSITNA        2
 6 AK    SITKA                    1
 7 AK    SKAGWAY MUNICIPALITY    88
 8 AL    AUTAUGA                102
 9 AL    BALDWIN                143
10 AL    BARBOUR                  1
# … with 2,920 more rows

Let’s run the same code chunk from earlier to check that our edits have been made and the data is clean.

Code
RailroadEmployment %>% 
  group_by(STATE) %>% 
  summarize(mean(TOTAL)) %>% 
  print(n=111)
# A tibble: 53 × 2
   STATE `mean(TOTAL)`
   <chr>         <dbl>
 1 AE             2   
 2 AK            17.2 
 3 AL            63.5 
 4 AP             1   
 5 AR            53.8 
 6 AZ           210.  
 7 CA           239.  
 8 CO            64.0 
 9 CT           324   
10 DC           279   
11 DE           498.  
12 FL           111.  
13 GA            56.6 
14 HI             1.33
15 IA            40.6 
16 ID            43.4 
17 IL           186.  
18 IN            92.8 
19 KS            64.1 
20 KY            40.4 
21 LA            62.1 
22 MA           282.  
23 MD           196.  
24 ME            40.9 
25 MI            50.4 
26 MN            63.6 
27 MO            73.2 
28 MS            27.1 
29 MT            62.8 
30 NC            33.4 
31 ND            45.0 
32 NE           148.  
33 NH            39.3 
34 NJ           397.  
35 NM            67.5 
36 NV            62.2 
37 NY           280.  
38 OH           103.  
39 OK            31.8 
40 OR            70.4 
41 PA           196.  
42 RI            97.4 
43 SC            49.9 
44 SD            18.2 
45 TN            54.4 
46 TX            89.8 
47 UT            76.7 
48 VA            82.1 
49 VT            18.5 
50 WA           134.  
51 WI            54.7 
52 WV            60.6 
53 WY           131.  

This is more like it! Now we have only 53 distinct values listed in the STATE columns: the 50 states, Washington DC (listed as DC), and two overseas military designations (listed as AE and AP)

Let’s move on to summarizing the data.


Data Summary

Using the dim() function we can see below that the dataframe has 2,930 rows/observations and 3 columns/variables.

Code
dim(RailroadEmployment)
[1] 2930    3

We can use the head() function to see more information about this dataset, including the column names, and the type of data stored in each column. The first two columns, STATE and COUNTY, are character variables, while the last column, TOTAL, is a double variable, which means it contains numeric data.

Code
head(RailroadEmployment)
# A tibble: 6 × 3
  STATE COUNTY               TOTAL
  <chr> <chr>                <dbl>
1 AE    APO                      2
2 AK    ANCHORAGE                7
3 AK    FAIRBANKS NORTH STAR     2
4 AK    JUNEAU                   3
5 AK    MATANUSKA-SUSITNA        2
6 AK    SITKA                    1

Using the summary() function we can see that the minimum number of employees in county is 1, the maximum is 8207, the median is 21 employees, and the mean is a little over 87 employees.

Code
summary(RailroadEmployment)
    STATE              COUNTY              TOTAL        
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00  

Let’s take a closer look at this dataset to get a sense of which county and which state each have the most railraod employees. We can use piping and the arrange() function in R to see which counties have the most railroad employees:

Code
RailroadEmployment %>% 
  arrange(desc(TOTAL))
# A tibble: 2,930 × 3
   STATE COUNTY           TOTAL
   <chr> <chr>            <dbl>
 1 IL    COOK              8207
 2 TX    TARRANT           4235
 3 NE    DOUGLAS           3797
 4 NY    SUFFOLK           3685
 5 VA    INDEPENDENT CITY  3249
 6 FL    DUVAL             3073
 7 CA    SAN BERNARDINO    2888
 8 CA    LOS ANGELES       2545
 9 TX    HARRIS            2535
10 NE    LINCOLN           2289
# … with 2,920 more rows

This shows that Cook County in Illinois has the most railroad employees. Given that the major city, Chicago is located in Cook County, this is not surprising. It is interesting that Cook county has almost twice as many railroad employees as the next county on the list, Tarrant county in Texas.

This data can also shed light on which states have the most railroad employees. We can use piping and the group_by(), summarize(), sum() and arrange() functions to show the states with the most railroad employees:

Code
employees_by_state <- RailroadEmployment %>% 
  group_by(STATE) %>% 
  summarize(total_state_employees=sum(TOTAL))

employees_by_state %>% 
  arrange(desc(total_state_employees))
# A tibble: 53 × 2
   STATE total_state_employees
   <chr>                 <dbl>
 1 TX                    19839
 2 IL                    19131
 3 NY                    17050
 4 NE                    13176
 5 CA                    13137
 6 PA                    12769
 7 OH                     9056
 8 GA                     8605
 9 IN                     8537
10 MO                     8419
# … with 43 more rows

This shows that Texas is the state with the most railroad employees.