challenge_2
railroads
Author

Emma Rasmussen

Published

August 16, 2022

Code
library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

##Reading in the Data/Tidying

Reading the data set into R and skipping the first 3 rows so the header includes state, county, and total

Code
StateCounty2012OG<-read_excel("_data/StateCounty2012.xls",
                            skip=3)
StateCounty2012OG
# A tibble: 2,990 × 5
   STATE     ...2  COUNTY               ...4  TOTAL
   <chr>     <lgl> <chr>                <lgl> <dbl>
 1 AE        NA    APO                  NA        2
 2 AE Total1 NA    <NA>                 NA        2
 3 AK        NA    ANCHORAGE            NA        7
 4 AK        NA    FAIRBANKS NORTH STAR NA        2
 5 AK        NA    JUNEAU               NA        3
 6 AK        NA    MATANUSKA-SUSITNA    NA        2
 7 AK        NA    SITKA                NA        1
 8 AK        NA    SKAGWAY MUNICIPALITY NA       88
 9 AK Total  NA    <NA>                 NA      103
10 AL        NA    AUTAUGA              NA      102
# … with 2,980 more rows
# ℹ Use `print(n = ...)` to see more rows

Selecting for columns that contain variables:

Code
StateCounty2012<- select(StateCounty2012OG, "STATE", "COUNTY", "TOTAL")
StateCounty2012
# A tibble: 2,990 × 3
   STATE     COUNTY               TOTAL
   <chr>     <chr>                <dbl>
 1 AE        APO                      2
 2 AE Total1 <NA>                     2
 3 AK        ANCHORAGE                7
 4 AK        FAIRBANKS NORTH STAR     2
 5 AK        JUNEAU                   3
 6 AK        MATANUSKA-SUSITNA        2
 7 AK        SITKA                    1
 8 AK        SKAGWAY MUNICIPALITY    88
 9 AK Total  <NA>                   103
10 AL        AUTAUGA                102
# … with 2,980 more rows
# ℹ Use `print(n = ...)` to see more rows

Removing state employee total rows:

Code
StateCounty2012<-filter(StateCounty2012, `COUNTY` !="NA")
StateCounty2012
# A tibble: 2,930 × 3
   STATE COUNTY               TOTAL
   <chr> <chr>                <dbl>
 1 AE    APO                      2
 2 AK    ANCHORAGE                7
 3 AK    FAIRBANKS NORTH STAR     2
 4 AK    JUNEAU                   3
 5 AK    MATANUSKA-SUSITNA        2
 6 AK    SITKA                    1
 7 AK    SKAGWAY MUNICIPALITY    88
 8 AL    AUTAUGA                102
 9 AL    BALDWIN                143
10 AL    BARBOUR                  1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows

Renaming the TOTAL column to more accurate describe the variable:

Code
StateCounty2012<-rename(StateCounty2012, Total_employees=TOTAL)
StateCounty2012
# A tibble: 2,930 × 3
   STATE COUNTY               Total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows

The code below returns the highest and lowest number of counties (with railroad employees) by state.

Code
table(StateCounty2012$STATE)%>%
  max()
[1] 221
Code
table(StateCounty2012$STATE)%>%
  min()
[1] 1

Describe the data

This data was likely gathered from government employment records or employment records directly from the railroads or Department of Transportation. Cases- counties within states, variables- states, and the number of employees at railroads within each county.

Dimensions of the “tidied” dataset:

Code
dim(StateCounty2012)
[1] 2930    3

The data set has 2930 rows (counties with railroads) and 3 columns (State, County, and number of employees).

The code below creates a table with a count of counties by state. For example, Florida (FL) has 67 counties with railroads.

Code
table(StateCounty2012$STATE)

 AE  AK  AL  AP  AR  AZ  CA  CO  CT  DC  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY 
  1   6  67   1  72  15  55  57   8   1   3  67 152   3  99  36 103  92  95 119 
 LA  MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK  OR 
 63  12  24  16  78  86 115  78  53  94  49  89  10  21  29  12  61  88  73  33 
 PA  RI  SC  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
 65   5  46  52  91 221  25  92  14  39  69  53  22 

I was shocked to see that Texas has 221 counties with people who work for railroads, compared to MA with 12 counties. A quick Google Search says Texas actually has 254 counties. Since Texas has the greatest number of counties with railroad employees, I will look closer at the distribution of employees across Texas counties.

Provide Grouped Summary Statistics

Filtering the subgroup of New Hampshire counties: (I used to live on the NH border)

Code
filter(StateCounty2012, STATE == "NH")
# A tibble: 10 × 3
   STATE COUNTY       Total_employees
   <chr> <chr>                  <dbl>
 1 NH    BELKNAP                    2
 2 NH    CARROLL                   12
 3 NH    CHESHIRE                  28
 4 NH    COOS                      19
 5 NH    GRAFTON                    7
 6 NH    HILLSBOROUGH             136
 7 NH    MERRIMACK                  9
 8 NH    ROCKINGHAM               146
 9 NH    STRAFFORD                 27
10 NH    SULLIVAN                   7

Calculating the Mode:

Code
NHTable<-filter(StateCounty2012, STATE == "NH")%>%
  arrange(Total_employees)%>%
  count(Total_employees)%>%
  arrange(desc(n))
NHTable$Total_employees[1]
[1] 7

While not super helpful, the most common number of employees at NH railroads by county is 7.

Other summary statistics for the subgroup of NH counties:

Code
StateCounty2012
# A tibble: 2,930 × 3
   STATE COUNTY               Total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
StateCounty2012%>%
  filter(STATE == "NH") %>%
  summarize("NHmin"= min(Total_employees), "NHmax"= max(Total_employees), "NHmean"= mean(Total_employees), "NHmedian"= median(Total_employees), "NHmode"= mode(Total_employees), "NHsd" = sd(Total_employees), "NHIQR"= IQR(Total_employees))
# A tibble: 1 × 7
  NHmin NHmax NHmean NHmedian NHmode   NHsd NHIQR
  <dbl> <dbl>  <dbl>    <dbl> <chr>   <dbl> <dbl>
1     2   146   39.3     15.5 numeric  54.3  20.2

Finding the county in each state with the largest number of employees:

Code
StateCountyLarge<-group_by(StateCounty2012, STATE)%>%
  arrange(StateCounty2012, ("STATE"), desc("Total_employees"))%>%
  slice(1)
StateCountyLarge
# A tibble: 53 × 3
# Groups:   STATE [53]
   STATE COUNTY        Total_employees
   <chr> <chr>                   <dbl>
 1 AE    APO                         2
 2 AK    ANCHORAGE                   7
 3 AL    AUTAUGA                   102
 4 AP    APO                         1
 5 AR    ARKANSAS                   11
 6 AZ    APACHE                    270
 7 CA    ALAMEDA                   346
 8 CO    ADAMS                     553
 9 CT    FAIRFIELD                 486
10 DC    WASHINGTON DC             279
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows

Finding the mean of the largest total employee numbers in each state (this is a pretty useless stat but I was just trying to find some summary statistics from a grouped/sliced data set). I don’t love this code, there are too many variables but it’s not working when I edit it any more to try to calculate the median etc.

Code
StateMax<-summarize(StateCountyLarge, mt= mean(Total_employees))
TotalStateMean<-summarize(StateMax, mt=mean(mt))
print(TotalStateMean)
# A tibble: 1 × 1
     mt
  <dbl>
1  93.0

The mean of the largest total employee numbers across the 53 included states/territories is 93.03774 employees.

Explain and Interpret

I chose NH because I was curious about the distribution of railroad employees by county in that area (I grew up on the NH border). There is lots of variability in NH railroad employment. Rockingham county has 146 employees, while Belknap county has only 2 railroad employees. I then tried to slice out the counties with the most employees by state to see how many employees there are on average at the “biggest” railroad counties in each state. I was surprised comparing the NH max to the mean of the largest railroads by county in each state(the second data analyzed that was sliced). I was surprise given the mean number of employees at the “largest” railroad/county in each state was only 93, but the max for NH is 146. I don’t think of NH has a big state or having a lot of railroads. One thing that might expplain this is that states like Texas (over 100) have more counties than NH (10) skewing the mean (93.0377). In comparison, NH would have larger employment numbers by county given it only has 10 counties.