Challenge 1 - Rail roads

challenge_1
Shaunak Padhye
railroads
Author

Shaunak Padhye

Published

May 11, 2023

Setup

We will load the required packages.

Code
library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE)

Reading in data files

The “StateCounty2012.xls” dataset will be used for this challenge. The dataset will be loaded using the read_excel() function from the readxl package. When the data was first loaded, it was observed that the first two rows did not contain useful information and hence those two rows were skipped.

Code
state_county <- read_excel("_data/StateCounty2012.xls", skip=2)
state_county <- data.frame(state_county)
head(state_county)
      STATE ...2               COUNTY ...4 TOTAL
1        AE   NA                  APO   NA     2
2 AE Total1   NA                 <NA>   NA     2
3        AK   NA            ANCHORAGE   NA     7
4        AK   NA FAIRBANKS NORTH STAR   NA     2
5        AK   NA               JUNEAU   NA     3
6        AK   NA    MATANUSKA-SUSITNA   NA     2

We can also observe that the second and the fourth columns contain null values and those columns can be dropped.

Code
state_county <- state_county[, -c(2,4)]
head(state_county)
      STATE               COUNTY TOTAL
1        AE                  APO     2
2 AE Total1                 <NA>     2
3        AK            ANCHORAGE     7
4        AK FAIRBANKS NORTH STAR     2
5        AK               JUNEAU     3
6        AK    MATANUSKA-SUSITNA     2

Data Description

This dataset provides information on the number of railroad employees in each county and state for the year 2012. Each row represents a county, and the columns display the state, county, and number of employees. There is no information about the origin of the data.

Data Cleaning

Now our dataset is ready for further inspection. We can start by creating an initial summary of the data by using the summary() and str() functions. These functions provide some basic statistics for the dataset.

Code
summary(state_county)
    STATE              COUNTY              TOTAL         
 Length:2990        Length:2990        Min.   :     1.0  
 Class :character   Class :character   1st Qu.:     7.0  
 Mode  :character   Mode  :character   Median :    22.0  
                                       Mean   :   256.9  
                                       3rd Qu.:    71.0  
                                       Max.   :255432.0  
                                       NA's   :5         
Code
str(state_county)
'data.frame':   2990 obs. of  3 variables:
 $ STATE : chr  "AE" "AE Total1" "AK" "AK" ...
 $ COUNTY: chr  "APO" NA "ANCHORAGE" "FAIRBANKS NORTH STAR" ...
 $ TOTAL : num  2 2 7 2 3 2 1 88 103 102 ...

We can observe that the STATE column has 2990 values. However, the number of states in the United States is much less. We can print out the unique values for this column to check.

Code
unique(state_county$STATE)
  [1] "AE"                                                 
  [2] "AE Total1"                                          
  [3] "AK"                                                 
  [4] "AK Total"                                           
  [5] "AL"                                                 
  [6] "AL Total"                                           
  [7] "AP"                                                 
  [8] "AP Total1"                                          
  [9] "AR"                                                 
 [10] "AR Total"                                           
 [11] "AZ"                                                 
 [12] "AZ Total"                                           
 [13] "CA"                                                 
 [14] "CA Total"                                           
 [15] "CO"                                                 
 [16] "CO Total"                                           
 [17] "CT"                                                 
 [18] "CT Total"                                           
 [19] "DC"                                                 
 [20] "DC Total"                                           
 [21] "DE"                                                 
 [22] "DE Total"                                           
 [23] "FL"                                                 
 [24] "FL Total"                                           
 [25] "GA"                                                 
 [26] "GA Total"                                           
 [27] "HI"                                                 
 [28] "HI Total"                                           
 [29] "IA"                                                 
 [30] "IA Total"                                           
 [31] "ID"                                                 
 [32] "ID Total"                                           
 [33] "IL"                                                 
 [34] "IL Total"                                           
 [35] "IN"                                                 
 [36] "IN Total"                                           
 [37] "KS"                                                 
 [38] "KS Total"                                           
 [39] "KY"                                                 
 [40] "KY Total"                                           
 [41] "LA"                                                 
 [42] "LA Total"                                           
 [43] "MA"                                                 
 [44] "MA Total"                                           
 [45] "MD"                                                 
 [46] "MD Total"                                           
 [47] "ME"                                                 
 [48] "ME Total"                                           
 [49] "MI"                                                 
 [50] "MI Total"                                           
 [51] "MN"                                                 
 [52] "MN Total"                                           
 [53] "MO"                                                 
 [54] "MO Total"                                           
 [55] "MS"                                                 
 [56] "MS Total"                                           
 [57] "MT"                                                 
 [58] "MT Total"                                           
 [59] "NC"                                                 
 [60] "NC Total"                                           
 [61] "ND"                                                 
 [62] "ND Total"                                           
 [63] "NE"                                                 
 [64] "NE Total"                                           
 [65] "NH"                                                 
 [66] "NH Total"                                           
 [67] "NJ"                                                 
 [68] "NJ Total"                                           
 [69] "NM"                                                 
 [70] "NM Total"                                           
 [71] "NV"                                                 
 [72] "NV Total"                                           
 [73] "NY"                                                 
 [74] "NY Total"                                           
 [75] "OH"                                                 
 [76] "OH Total"                                           
 [77] "OK"                                                 
 [78] "OK Total"                                           
 [79] "OR"                                                 
 [80] "OR Total"                                           
 [81] "PA"                                                 
 [82] "PA Total"                                           
 [83] "RI"                                                 
 [84] "RI Total"                                           
 [85] "SC"                                                 
 [86] "SC Total"                                           
 [87] "SD"                                                 
 [88] "SD Total"                                           
 [89] "TN"                                                 
 [90] "TN Total"                                           
 [91] "TX"                                                 
 [92] "TX Total"                                           
 [93] "UT"                                                 
 [94] "UT Total"                                           
 [95] "VA"                                                 
 [96] "VA Total"                                           
 [97] "VT"                                                 
 [98] "VT Total"                                           
 [99] "WA"                                                 
[100] "WA Total"                                           
[101] "WI"                                                 
[102] "WI Total"                                           
[103] "WV"                                                 
[104] "WV Total"                                           
[105] "WY"                                                 
[106] "WY Total"                                           
[107] "Grand Total"                                        
[108] NA                                                   
[109] "CANADA"                                             
[110] "1  Military designation."                           
[111] "NOTE:  Excludes 2,896 employees without an address."

To understand this dataset in more detail lets isolate the state “MA” and check the dataset.

Code
subset(state_county, STATE == "MA", select = c(COUNTY,TOTAL))
         COUNTY TOTAL
1137 BARNSTABLE    44
1138  BERKSHIRE    50
1139    BRISTOL   232
1140      ESSEX   314
1141   FRANKLIN   113
1142    HAMPDEN   202
1143  HAMPSHIRE    68
1144  MIDDLESEX   673
1145    NORFOLK   386
1146   PLYMOUTH   429
1147    SUFFOLK   558
1148  WORCESTER   310

As we can see above, the dataset contains the total for each county in “MA”.

We can also see the total number of emoloyees in “MA” by checking the “MA Total” state.

Code
subset(state_county, STATE == "MA Total", select = c(COUNTY,TOTAL))
     COUNTY TOTAL
1149   <NA>  3379

Based on this we understand that this dataset contains the total number of employees for each county in each state as well as the total number of employees for each state. We can split the dataset into two. One dataframe will contain the values for the states and their counties, while one dataframe will contain values for the entire states.

Code
match_pattern <- grepl("Total",state_county$STATE)

state_total <- state_county[match_pattern, ]
state_total
           STATE COUNTY  TOTAL
2      AE Total1   <NA>      2
9       AK Total   <NA>    103
77      AL Total   <NA>   4257
79     AP Total1   <NA>      1
152     AR Total   <NA>   3871
168     AZ Total   <NA>   3153
224     CA Total   <NA>  13137
282     CO Total   <NA>   3650
291     CT Total   <NA>   2592
293     DC Total   <NA>    279
297     DE Total   <NA>   1495
365     FL Total   <NA>   7419
518     GA Total   <NA>   8605
522     HI Total   <NA>      4
622     IA Total   <NA>   4019
659     ID Total   <NA>   1563
763     IL Total   <NA>  19131
856     IN Total   <NA>   8537
952     KS Total   <NA>   6092
1072    KY Total   <NA>   4811
1136    LA Total   <NA>   3915
1149    MA Total   <NA>   3379
1174    MD Total   <NA>   4709
1191    ME Total   <NA>    654
1270    MI Total   <NA>   3932
1357    MN Total   <NA>   5467
1473    MO Total   <NA>   8419
1552    MS Total   <NA>   2111
1606    MT Total   <NA>   3327
1701    NC Total   <NA>   3143
1751    ND Total   <NA>   2204
1841    NE Total   <NA>  13176
1852    NH Total   <NA>    393
1874    NJ Total   <NA>   8329
1904    NM Total   <NA>   1958
1917    NV Total   <NA>    746
1979    NY Total   <NA>  17050
2068    OH Total   <NA>   9056
2142    OK Total   <NA>   2318
2176    OR Total   <NA>   2322
2242    PA Total   <NA>  12769
2248    RI Total   <NA>    487
2295    SC Total   <NA>   2296
2348    SD Total   <NA>    949
2440    TN Total   <NA>   4952
2662    TX Total   <NA>  19839
2688    UT Total   <NA>   1917
2781    VA Total   <NA>   7551
2796    VT Total   <NA>    259
2836    WA Total   <NA>   5222
2906    WI Total   <NA>   3773
2960    WV Total   <NA>   3213
2983    WY Total   <NA>   2876
2984 Grand Total   <NA> 255432
Code
state_county <- state_county[!match_pattern, ]
head(state_county)
  STATE               COUNTY TOTAL
1    AE                  APO     2
3    AK            ANCHORAGE     7
4    AK FAIRBANKS NORTH STAR     2
5    AK               JUNEAU     3
6    AK    MATANUSKA-SUSITNA     2
7    AK                SITKA     1

Finally, we remove all the null values from both the datasets.

Code
state_total <- na.omit(state_total)
state_county <- na.omit(state_county)

Summary

We end up with two datasets: 1) One dataset contains the total number of employees for each county in each state 2) One dataset contains the total number of employees for each state