Code
library(tidyverse)
::opts_chunk$set(echo = TRUE) knitr
Matt Eckstein
March 1, 2023
Loading and viewing dataset
state county total_employees
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
data frame with 0 columns and 1 row
These data were likely gathered from a survey of occupations across geographies conducted by a federal agency such as the Bureau of Labor Statistics.
Each case is a county in the United States with at least one railroad worker. (The state and county columns are both essential for defining a case, since some county names occur in more than one state, and the state column is necessary for disambiguation.) The total_employees column indicates the number of railroad employees in the relevant county.
median mean
1 21 87.17816
Of US counties with at least one railroad employee, the median county had 21 railroad employees, while the mean county had slightly more than 87. This suggests that a handful of counties with very large numbers of railroad employees are dragging the mean upwards relative to more typical counties.
##Railroad workers by state
states
AE AK AL AP AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
1 6 67 1 72 15 55 57 8 1 3 67 152 3 99 36 103 92 95 119
LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR
63 12 24 16 78 86 115 78 53 94 49 89 10 21 29 12 61 88 73 33
PA RI SC SD TN TX UT VA VT WA WI WV WY
65 5 46 52 91 221 25 92 14 39 69 53 22
states
AE AK AL AP AR AZ
0.0003412969 0.0020477816 0.0228668942 0.0003412969 0.0245733788 0.0051194539
CA CO CT DC DE FL
0.0187713311 0.0194539249 0.0027303754 0.0003412969 0.0010238908 0.0228668942
GA HI IA ID IL IN
0.0518771331 0.0010238908 0.0337883959 0.0122866894 0.0351535836 0.0313993174
KS KY LA MA MD ME
0.0324232082 0.0406143345 0.0215017065 0.0040955631 0.0081911263 0.0054607509
MI MN MO MS MT NC
0.0266211604 0.0293515358 0.0392491468 0.0266211604 0.0180887372 0.0320819113
ND NE NH NJ NM NV
0.0167235495 0.0303754266 0.0034129693 0.0071672355 0.0098976109 0.0040955631
NY OH OK OR PA RI
0.0208191126 0.0300341297 0.0249146758 0.0112627986 0.0221843003 0.0017064846
SC SD TN TX UT VA
0.0156996587 0.0177474403 0.0310580205 0.0754266212 0.0085324232 0.0313993174
VT WA WI WV WY
0.0047781570 0.0133105802 0.0235494881 0.0180887372 0.0075085324
Among all states and state-equivalents, the number of counties and county-equivalents that have at least one railroad worker ranges from one (in Washington, DC and each of the two military entities) to 221 (in Texas). This is roughly commensurate with what one might expect, given the overall number of counties in each state. About 7.5% of all counties and county-equivalents that have at least one railroad worker are in Texas. (Although the overall impact is small, note that the data in the proportional table are slightly distorted by the fact that the table aggregates railroad worker data for all of Virginia’s independent cities as one entry rather than breaking them out as separate county-equivalents.)
This shows that there are 2390 cases (consisting of state-county combinations) in the data
Note that the function mfv() used to calculate the mode to find these summary statistics is part of the package modeest. I ran install.packages(“modeest”) in my console rather than adding it to the Quarto document in order to avoid causing an unwanted install on the computer of someone else running the code in the Quarto document.
Registered S3 method overwritten by 'rmutil':
method from
print.response httr
mean(total_employees)
1 87.17816
median(total_employees)
1 21
mfv(total_employees)
1 1
min(total_employees)
1 1
max(total_employees)
1 8207
IQR(total_employees)
1 58
# A tibble: 53 x 2
state `mean(total_employees)`
<chr> <dbl>
1 AE 2
2 AK 17.2
3 AL 63.5
4 AP 1
5 AR 53.8
6 AZ 210.
7 CA 239.
8 CO 64.0
9 CT 324
10 DC 279
# ... with 43 more rows
# A tibble: 53 x 2
state `median(total_employees)`
<chr> <dbl>
1 AE 2
2 AK 2.5
3 AL 26
4 AP 1
5 AR 16.5
6 AZ 94
7 CA 61
8 CO 10
9 CT 125
10 DC 279
# ... with 43 more rows
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
i Please use `reframe()` instead.
i When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.
# A tibble: 165 x 2
# Groups: state [53]
state `mfv(total_employees)`
<chr> <int>
1 AE 2
2 AK 2
3 AL 7
4 AL 11
5 AP 1
6 AR 5
7 AZ 3
8 AZ 10
9 AZ 18
10 AZ 37
# ... with 155 more rows
# A tibble: 53 x 2
state `sd(total_employees)`
<chr> <dbl>
1 AE NA
2 AK 34.8
3 AL 130.
4 AP NA
5 AR 131.
6 AZ 228.
7 CA 549.
8 CO 128.
9 CT 520.
10 DC NA
# ... with 43 more rows
# A tibble: 53 x 2
state `min(total_employees)`
<chr> <int>
1 AE 2
2 AK 1
3 AL 1
4 AP 1
5 AR 1
6 AZ 3
7 CA 1
8 CO 1
9 CT 26
10 DC 279
# ... with 43 more rows
# A tibble: 53 x 2
state `max(total_employees)`
<chr> <int>
1 AE 2
2 AK 88
3 AL 990
4 AP 1
5 AR 972
6 AZ 749
7 CA 2888
8 CO 553
9 CT 1561
10 DC 279
# ... with 43 more rows
# A tibble: 53 x 2
state `IQR(total_employees)`
<chr> <dbl>
1 AE 0
2 AK 4
3 AL 47
4 AP 0
5 AR 33.8
6 AZ 296
7 CA 188
8 CO 39
9 CT 167.
10 DC 0
# ... with 43 more rows
I chose to calculate the measures of central tendency and dispersion for the total results and grouped by state. I considered a county (within states, to prevent counting counties in different states with the same name together) subgroup, but did not include it when I realized this produced the same results as overall analysis, since each county (within a state) has only one value.
I found it notable how much the mean number of railroad workers per county varies by state. Some of this variation is accounted for by the fact that some states (e.g. California) have relatively few counties for the size of their populations and thus have many people (and, hence, railroad workers) in each county. Other states, such as the Dakotas, which have many counties relative to the sizes of their populations, do not have many railroad workers in their average county. Some interesting factors cause variation in this general pattern, though. Some states, such as Hawaii, do not have very railroad-friendly geography and have fewer railroad workers per county than one might otherwise expect. Also, Nebraska stands out as a bit of an outlier on the high side relative to other relatively lightly populated Midwestern states with large numbers of counties, in part since Omaha is a significant railroad hub (https://www.greatamericanstations.com/stations/omaha-ne-oma/).
It’s also notable that some states have a different number of railroad employees in every county with at least one, creating a large number of modes for some states.
---
title: "Challenge 2"
author: "Matt Eckstein"
desription: "Challenge 2 - Matt Eckstein - Railroad Data"
date: "03/01/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- Matt Eckstein
- railroad_2012_clean_county.csv
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)
```
Loading and viewing dataset
```{r}
railroad_data <- read.csv("_data/railroad_2012_clean_county.csv")
head(railroad_data)
summarize(railroad_data)
```
These data were likely gathered from a survey of occupations across geographies conducted by a federal agency such as the Bureau of Labor Statistics.
Each case is a county in the United States with at least one railroad worker. (The state and county columns are both essential for defining a case, since some county names occur in more than one state, and the state column is necessary for disambiguation.) The total_employees column indicates the number of railroad employees in the relevant county.
```{r}
railroad_data %>%
summarize(median = median(total_employees), mean = mean(total_employees))
```
Of US counties with at least one railroad employee, the median county had 21 railroad employees, while the mean county had slightly more than 87. This suggests that a handful of counties with very large numbers of railroad employees are dragging the mean upwards relative to more typical counties.
##Railroad workers by state
```{r}
states <- select(railroad_data, state)
table(states)
prop.table(table(states))
```
Among all states and state-equivalents, the number of counties and county-equivalents that have at least one railroad worker ranges from one (in Washington, DC and each of the two military entities) to 221 (in Texas). This is roughly commensurate with what one might expect, given the overall number of counties in each state. About 7.5% of all counties and county-equivalents that have at least one railroad worker are in Texas. (Although the overall impact is small, note that the data in the proportional table are slightly distorted by the fact that the table aggregates railroad worker data for all of Virginia's independent cities as one entry rather than breaking them out as separate county-equivalents.)
```{r}
railroad_data %>%
select(state, county)%>%
n_distinct()
```
This shows that there are 2390 cases (consisting of state-county combinations) in the data
## Grouping by state and finding mean and median both overall and for counties within them
Note that the function mfv() used to calculate the mode to find these summary statistics is part of the package modeest. I ran install.packages("modeest") in my console rather than adding it to the Quarto document in order to avoid causing an unwanted install on the computer of someone else running the code in the Quarto document.
```{r}
library(modeest)
railroad_data %>%
summarize(mean(total_employees))
railroad_data %>%
summarize(median(total_employees))
railroad_data %>%
summarize(mfv(total_employees))
railroad_data %>%
summarize(min(total_employees))
railroad_data %>%
summarize(max(total_employees))
railroad_data %>%
summarize(IQR(total_employees))
railroad_data %>%
group_by(state) %>%
summarize(mean(total_employees))
railroad_data %>%
group_by(state) %>%
summarize(median(total_employees))
railroad_data %>%
group_by(state) %>%
summarize(mfv(total_employees))
railroad_data %>%
group_by(state) %>%
summarize(sd(total_employees))
railroad_data %>%
group_by(state) %>%
summarize(min(total_employees))
railroad_data %>%
group_by(state) %>%
summarize(max(total_employees))
railroad_data %>%
group_by(state) %>%
summarize(IQR(total_employees))
```
## Explaining and interpreting the above
I chose to calculate the measures of central tendency and dispersion for the total results and grouped by state. I considered a county (within states, to prevent counting counties in different states with the same name together) subgroup, but did not include it when I realized this produced the same results as overall analysis, since each county (within a state) has only one value.
I found it notable how much the mean number of railroad workers per county varies by state. Some of this variation is accounted for by the fact that some states (e.g. California) have relatively few counties for the size of their populations and thus have many people (and, hence, railroad workers) in each county. Other states, such as the Dakotas, which have many counties relative to the sizes of their populations, do not have many railroad workers in their average county. Some interesting factors cause variation in this general pattern, though. Some states, such as Hawaii, do not have very railroad-friendly geography and have fewer railroad workers per county than one might otherwise expect. Also, Nebraska stands out as a bit of an outlier on the high side relative to other relatively lightly populated Midwestern states with large numbers of counties, in part since Omaha is a significant railroad hub (https://www.greatamericanstations.com/stations/omaha-ne-oma/).
It's also notable that some states have a different number of railroad employees in every county with at least one, creating a large number of modes for some states.