challenge_2
Railroad
Data wrangling: using group() and summarise()
Author

Shoshana Buck

Published

August 16, 2022

Code
library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

The state County data is looking at the total railroad employment by state and county in the United States and Canada using the calendar year of 2012. The source of this data is most likely taken from the United States Railroad Retirement Board. There is a total of 2,931 rows which reflect the counties and there are 3 columns which reflect the “State”, “County”, and the “total employees” within the county.

Code
# | Label: Reading in data
StateCounty<-read_excel("_data/StateCounty2012.xls",
                     skip = 4,
                     col_names= c("state", "delete",  "county",
                                  "delete", "employees"))%>%
  select(!contains("delete"))%>%
  filter(!str_detect(state, "Total"))

StateCounty<-head(StateCounty, -2)%>%
  mutate(county = ifelse(state=="CANADA", "CANADA", county))

StateCounty
# A tibble: 2,931 × 3
   state county               employees
   <chr> <chr>                    <dbl>
 1 AE    APO                          2
 2 AK    ANCHORAGE                    7
 3 AK    FAIRBANKS NORTH STAR         2
 4 AK    JUNEAU                       3
 5 AK    MATANUSKA-SUSITNA            2
 6 AK    SITKA                        1
 7 AK    SKAGWAY MUNICIPALITY        88
 8 AL    AUTAUGA                    102
 9 AL    BALDWIN                    143
10 AL    BARBOUR                      1
# … with 2,921 more rows
# ℹ Use `print(n = ...)` to see more rows

Describe the data

Though there are 2,929 state-counties there are only 53 unique States and 1708 unique counties, I found this by including n_distinct in the summarise() function.

Code
StateCounty %>% 
  summarise(across(c(state), n_distinct))
# A tibble: 1 × 1
  state
  <int>
1    54
Code
StateCounty %>% 
  summarise(across(c(county), n_distinct))
# A tibble: 1 × 1
  county
   <int>
1   1710

Provide Grouped Summary Statistics

Code
StateCounty %>%
  group_by(state)
# A tibble: 2,931 × 3
# Groups:   state [54]
   state county               employees
   <chr> <chr>                    <dbl>
 1 AE    APO                          2
 2 AK    ANCHORAGE                    7
 3 AK    FAIRBANKS NORTH STAR         2
 4 AK    JUNEAU                       3
 5 AK    MATANUSKA-SUSITNA            2
 6 AK    SITKA                        1
 7 AK    SKAGWAY MUNICIPALITY        88
 8 AL    AUTAUGA                    102
 9 AL    BALDWIN                    143
10 AL    BARBOUR                      1
# … with 2,921 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
StateCounty %>% 
  summarise(total.employees=sum(employees))
# A tibble: 1 × 1
  total.employees
            <dbl>
1          256094
Code
StateCounty %>% 
  group_by(state,county) %>% 
  summarise(total.employees=sum(employees)) %>% 
  arrange(desc(total.employees))
# A tibble: 2,931 × 3
# Groups:   state [54]
   state county           total.employees
   <chr> <chr>                      <dbl>
 1 IL    COOK                        8207
 2 TX    TARRANT                     4235
 3 NE    DOUGLAS                     3797
 4 NY    SUFFOLK                     3685
 5 VA    INDEPENDENT CITY            3249
 6 FL    DUVAL                       3073
 7 CA    SAN BERNARDINO              2888
 8 CA    LOS ANGELES                 2545
 9 TX    HARRIS                      2535
10 NE    LINCOLN                     2289
# … with 2,921 more rows
# ℹ Use `print(n = ...)` to see more rows

Explain and Interpret

During the calendar year of 2012 the United States Railroad Retirement Board had 255,395 employees in the United States and Canada. I thought it was interesting that Cook county of Illinois had 8207 employees, which was the most employees as of 2012.