Challenge 5

challenge_5
railroads
boonstra
Introduction to Visualization
Author

Nick Boonstra

Published

August 22, 2022

library(tidyverse)
library(ggplot2)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

rr_orig<-read_csv("_data/railroad_2012_clean_county.csv")

rr_orig
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows

Briefly describe the data

This data set records railroad employment numbers in the U.S. (and certain overseas locations)

print(dfSummary(rr_orig, varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

rr_orig

Dimensions: 2930 x 3
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
state [character]
1. TX
2. GA
3. KY
4. MO
5. IL
6. IA
7. KS
8. NC
9. IN
10. VA
[ 43 others ]
221(7.5%)
152(5.2%)
119(4.1%)
115(3.9%)
103(3.5%)
99(3.4%)
95(3.2%)
94(3.2%)
92(3.1%)
92(3.1%)
1748(59.7%)
0 (0.0%)
county [character]
1. WASHINGTON
2. JEFFERSON
3. FRANKLIN
4. LINCOLN
5. JACKSON
6. MADISON
7. MONTGOMERY
8. CLAY
9. MARION
10. MONROE
[ 1699 others ]
31(1.1%)
26(0.9%)
24(0.8%)
24(0.8%)
22(0.8%)
19(0.6%)
18(0.6%)
17(0.6%)
17(0.6%)
17(0.6%)
2715(92.7%)
0 (0.0%)
total_employees [numeric]
Mean (sd) : 87.2 (283.6)
min ≤ med ≤ max:
1 ≤ 21 ≤ 8207
IQR (CV) : 58 (3.3)
404 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-28

One piece of information I was particularly curious about is how average employees per county compared across states:

rr_orig %>% 
  group_by(state) %>% 
  summarise(mean_emp=mean(total_employees,na.rm=T)) %>% 
  arrange(desc(mean_emp)) %>% 
  slice(1:10)
# A tibble: 10 × 2
   state mean_emp
   <chr>    <dbl>
 1 DE        498.
 2 NJ        397.
 3 CT        324 
 4 MA        282.
 5 NY        280.
 6 DC        279 
 7 CA        239.
 8 AZ        210.
 9 PA        196.
10 MD        196.

The data show that Delaware has the highest number of average employees per county. This finding becomes even more interesting when investigating how many (or few) counties Delaware has:

rr_orig %>% 
  filter(state=="DE")
# A tibble: 3 × 3
  state county     total_employees
  <chr> <chr>                <dbl>
1 DE    KENT                   158
2 DE    NEW CASTLE            1275
3 DE    SUSSEX                  62

Clearly, New Castle county does a lot to offset the mean, especially given that the state of Delaware only has three counties. However, this is not the highest employment in the country:

rr_orig %>% 
  arrange(desc(total_employees)) %>% 
  slice(1:10)
# A tibble: 10 × 3
   state county           total_employees
   <chr> <chr>                      <dbl>
 1 IL    COOK                        8207
 2 TX    TARRANT                     4235
 3 NE    DOUGLAS                     3797
 4 NY    SUFFOLK                     3685
 5 VA    INDEPENDENT CITY            3249
 6 FL    DUVAL                       3073
 7 CA    SAN BERNARDINO              2888
 8 CA    LOS ANGELES                 2545
 9 TX    HARRIS                      2535
10 NE    LINCOLN                     2289

Perhaps unsurprisingly, Cook County, IL – home of a major transit center in Chicago – employs the most railroad workers of any county in the country. A bit more surprisingly, New Castle County’s 1,000+ employees are not actually enough for it to register in the top ten counties!

Tidy Data (as needed)

These data are already tidy!

Visualization

Using ggplot2, I was able to create a visualization overlaying a density function on top of a histogram of average number of employees per county, when grouped by state.

rr_orig %>% 
  group_by(state) %>% 
  summarise(mean_emp=mean(total_employees)) %>% 
  ggplot(aes(x=mean_emp)) +
  geom_histogram(aes(y=..density..),bins=50,alpha=0.5,fill="red") +
  geom_density(fill="blue",alpha=0.2) +
  theme_bw() +
  labs(title="Average Number of Employees per County, by State",
       x="Number of Employees",
       y="Density")

Because this data set only contains one value, I was not sure how I would create a bivariate visualization.