Challenge 2

challenge_2

Matt Eckstein

railroad_2012_clean_county.csv

Author

Matt Eckstein

Published

March 1, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

Loading and viewing dataset

Code

railroad_data <- read.csv("_data/railroad_2012_clean_county.csv")

head(railroad_data)

  state               county total_employees
1    AE                  APO               2
2    AK            ANCHORAGE               7
3    AK FAIRBANKS NORTH STAR               2
4    AK               JUNEAU               3
5    AK    MATANUSKA-SUSITNA               2
6    AK                SITKA               1

Code

summarize(railroad_data)

data frame with 0 columns and 1 row

These data were likely gathered from a survey of occupations across geographies conducted by a federal agency such as the Bureau of Labor Statistics.

Each case is a county in the United States with at least one railroad worker. (The state and county columns are both essential for defining a case, since some county names occur in more than one state, and the state column is necessary for disambiguation.) The total_employees column indicates the number of railroad employees in the relevant county.

Code

railroad_data %>%
  summarize(median = median(total_employees), mean = mean(total_employees))

  median     mean
1     21 87.17816

Of US counties with at least one railroad employee, the median county had 21 railroad employees, while the mean county had slightly more than 87. This suggests that a handful of counties with very large numbers of railroad employees are dragging the mean upwards relative to more typical counties.

##Railroad workers by state

Code

states <- select(railroad_data, state)
table(states)

states
 AE  AK  AL  AP  AR  AZ  CA  CO  CT  DC  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY 
  1   6  67   1  72  15  55  57   8   1   3  67 152   3  99  36 103  92  95 119 
 LA  MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK  OR 
 63  12  24  16  78  86 115  78  53  94  49  89  10  21  29  12  61  88  73  33 
 PA  RI  SC  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
 65   5  46  52  91 221  25  92  14  39  69  53  22

Code

prop.table(table(states))

states
          AE           AK           AL           AP           AR           AZ 
0.0003412969 0.0020477816 0.0228668942 0.0003412969 0.0245733788 0.0051194539 
          CA           CO           CT           DC           DE           FL 
0.0187713311 0.0194539249 0.0027303754 0.0003412969 0.0010238908 0.0228668942 
          GA           HI           IA           ID           IL           IN 
0.0518771331 0.0010238908 0.0337883959 0.0122866894 0.0351535836 0.0313993174 
          KS           KY           LA           MA           MD           ME 
0.0324232082 0.0406143345 0.0215017065 0.0040955631 0.0081911263 0.0054607509 
          MI           MN           MO           MS           MT           NC 
0.0266211604 0.0293515358 0.0392491468 0.0266211604 0.0180887372 0.0320819113 
          ND           NE           NH           NJ           NM           NV 
0.0167235495 0.0303754266 0.0034129693 0.0071672355 0.0098976109 0.0040955631 
          NY           OH           OK           OR           PA           RI 
0.0208191126 0.0300341297 0.0249146758 0.0112627986 0.0221843003 0.0017064846 
          SC           SD           TN           TX           UT           VA 
0.0156996587 0.0177474403 0.0310580205 0.0754266212 0.0085324232 0.0313993174 
          VT           WA           WI           WV           WY 
0.0047781570 0.0133105802 0.0235494881 0.0180887372 0.0075085324

Among all states and state-equivalents, the number of counties and county-equivalents that have at least one railroad worker ranges from one (in Washington, DC and each of the two military entities) to 221 (in Texas). This is roughly commensurate with what one might expect, given the overall number of counties in each state. About 7.5% of all counties and county-equivalents that have at least one railroad worker are in Texas. (Although the overall impact is small, note that the data in the proportional table are slightly distorted by the fact that the table aggregates railroad worker data for all of Virginia’s independent cities as one entry rather than breaking them out as separate county-equivalents.)

Code

railroad_data %>%
  select(state, county)%>%
  n_distinct()

[1] 2930

This shows that there are 2390 cases (consisting of state-county combinations) in the data

Grouping by state and finding mean and median both overall and for counties within them

Note that the function mfv() used to calculate the mode to find these summary statistics is part of the package modeest. I ran install.packages(“modeest”) in my console rather than adding it to the Quarto document in order to avoid causing an unwanted install on the computer of someone else running the code in the Quarto document.

Code

library(modeest)

Registered S3 method overwritten by 'rmutil':
  method         from
  print.response httr

Code

railroad_data %>%
  summarize(mean(total_employees))

  mean(total_employees)
1              87.17816

Code

railroad_data %>%
  summarize(median(total_employees))

  median(total_employees)
1                      21

Code

railroad_data %>%
  summarize(mfv(total_employees))

  mfv(total_employees)
1                    1

Code

railroad_data %>%
  summarize(min(total_employees))

  min(total_employees)
1                    1

Code

railroad_data %>%
  summarize(max(total_employees))

  max(total_employees)
1                 8207

Code

railroad_data %>%
  summarize(IQR(total_employees))

  IQR(total_employees)
1                   58

Code

railroad_data %>%
  group_by(state) %>%
  summarize(mean(total_employees))

# A tibble: 53 x 2
   state `mean(total_employees)`
   <chr>                   <dbl>
 1 AE                        2  
 2 AK                       17.2
 3 AL                       63.5
 4 AP                        1  
 5 AR                       53.8
 6 AZ                      210. 
 7 CA                      239. 
 8 CO                       64.0
 9 CT                      324  
10 DC                      279  
# ... with 43 more rows

Code

railroad_data %>%
  group_by(state) %>%
  summarize(median(total_employees))

# A tibble: 53 x 2
   state `median(total_employees)`
   <chr>                     <dbl>
 1 AE                          2  
 2 AK                          2.5
 3 AL                         26  
 4 AP                          1  
 5 AR                         16.5
 6 AZ                         94  
 7 CA                         61  
 8 CO                         10  
 9 CT                        125  
10 DC                        279  
# ... with 43 more rows

Code

railroad_data %>%
  group_by(state) %>%
  summarize(mfv(total_employees))

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
i Please use `reframe()` instead.
i When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.

# A tibble: 165 x 2
# Groups:   state [53]
   state `mfv(total_employees)`
   <chr>                  <int>
 1 AE                         2
 2 AK                         2
 3 AL                         7
 4 AL                        11
 5 AP                         1
 6 AR                         5
 7 AZ                         3
 8 AZ                        10
 9 AZ                        18
10 AZ                        37
# ... with 155 more rows

Code

railroad_data %>%
  group_by(state) %>%
  summarize(sd(total_employees))

# A tibble: 53 x 2
   state `sd(total_employees)`
   <chr>                 <dbl>
 1 AE                     NA  
 2 AK                     34.8
 3 AL                    130. 
 4 AP                     NA  
 5 AR                    131. 
 6 AZ                    228. 
 7 CA                    549. 
 8 CO                    128. 
 9 CT                    520. 
10 DC                     NA  
# ... with 43 more rows

Code

railroad_data %>%
  group_by(state) %>%
  summarize(min(total_employees))

# A tibble: 53 x 2
   state `min(total_employees)`
   <chr>                  <int>
 1 AE                         2
 2 AK                         1
 3 AL                         1
 4 AP                         1
 5 AR                         1
 6 AZ                         3
 7 CA                         1
 8 CO                         1
 9 CT                        26
10 DC                       279
# ... with 43 more rows

Code

railroad_data %>%
  group_by(state) %>%
  summarize(max(total_employees))

# A tibble: 53 x 2
   state `max(total_employees)`
   <chr>                  <int>
 1 AE                         2
 2 AK                        88
 3 AL                       990
 4 AP                         1
 5 AR                       972
 6 AZ                       749
 7 CA                      2888
 8 CO                       553
 9 CT                      1561
10 DC                       279
# ... with 43 more rows

Code

railroad_data %>%
  group_by(state) %>%
  summarize(IQR(total_employees))

# A tibble: 53 x 2
   state `IQR(total_employees)`
   <chr>                  <dbl>
 1 AE                       0  
 2 AK                       4  
 3 AL                      47  
 4 AP                       0  
 5 AR                      33.8
 6 AZ                     296  
 7 CA                     188  
 8 CO                      39  
 9 CT                     167. 
10 DC                       0  
# ... with 43 more rows

Explaining and interpreting the above

I chose to calculate the measures of central tendency and dispersion for the total results and grouped by state. I considered a county (within states, to prevent counting counties in different states with the same name together) subgroup, but did not include it when I realized this produced the same results as overall analysis, since each county (within a state) has only one value.

I found it notable how much the mean number of railroad workers per county varies by state. Some of this variation is accounted for by the fact that some states (e.g. California) have relatively few counties for the size of their populations and thus have many people (and, hence, railroad workers) in each county. Other states, such as the Dakotas, which have many counties relative to the sizes of their populations, do not have many railroad workers in their average county. Some interesting factors cause variation in this general pattern, though. Some states, such as Hawaii, do not have very railroad-friendly geography and have fewer railroad workers per county than one might otherwise expect. Also, Nebraska stands out as a bit of an outlier on the high side relative to other relatively lightly populated Midwestern states with large numbers of counties, in part since Omaha is a significant railroad hub (https://www.greatamericanstations.com/stations/omaha-ne-oma/).

It’s also notable that some states have a different number of railroad employees in every county with at least one, creating a large number of modes for some states.