Challenge 2 Instructions

challenge_2
railroad
question
Author

Roy Yoon

Published

August 16, 2022

Code
library(tidyverse)
#library(readr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Dataset “railroad_2012_clean_county.csv”

Code
railroad <- read_csv("_data/railroad_2012_clean_county.csv")

railroad
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
dim(railroad)
[1] 2930    3
Code
colnames(railroad)
[1] "state"           "county"          "total_employees"

There are three variable names: ‘state’, ‘county’, and ‘total_employees’.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
colnames(railroad)
[1] "state"           "county"          "total_employees"
Code
dim(railroad)
[1] 2930    3
Code
railroad 
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows

Summary Statistics

First I tried to attempt at making the data grouped by states with a single total employee count(regardless of the counties)

Code
state_grouped_railroad <- railroad %>%
    select(state, total_employees) %>%
    group_by(state) %>%
    tally(total_employees)

state_grouped_railroad <-rename(state_grouped_railroad, total_employees = n)

state_grouped_railroad
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                  2
 2 AK                103
 3 AL               4257
 4 AP                  1
 5 AR               3871
 6 AZ               3153
 7 CA              13137
 8 CO               3650
 9 CT               2592
10 DC                279
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows

Better way of producing results above

Code
railroad
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
test_railroad <- railroad %>% group_by(state)

test_railroad
# A tibble: 2,930 × 3
# Groups:   state [53]
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
#finding the total employees for each state

test_railroad %>%summarise(
  total_employees = sum(total_employees)
)
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                  2
 2 AK                103
 3 AL               4257
 4 AP                  1
 5 AR               3871
 6 AZ               3153
 7 CA              13137
 8 CO               3650
 9 CT               2592
10 DC                279
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
#trying to find descending order 

#test_railroad <- test_railroad %>%summarise(
  #total_employees = sum(total_employees)
#)

#test_railroad %>% arrange(desc(total_employees))
Code
# finding the county(s) for each state that has the most number of employees
test_railroad %>% filter(total_employees == max(total_employees))
# A tibble: 53 × 3
# Groups:   state [53]
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    SKAGWAY MUNICIPALITY              88
 3 AL    JEFFERSON                        990
 4 AP    APO                                1
 5 AR    PULASKI                          972
 6 AZ    PIMA                             749
 7 CA    SAN BERNARDINO                  2888
 8 CO    ADAMS                            553
 9 CT    NEW HAVEN                       1561
10 DC    WASHINGTON DC                    279
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
#finding the county(s) for each state that has the least number of employees. In this case, there are multiple counties from a state that have the same minimum. 
test_railroad %>% filter(total_employees == min(total_employees))
# A tibble: 170 × 3
# Groups:   state [53]
   state county   total_employees
   <chr> <chr>              <dbl>
 1 AE    APO                    2
 2 AK    SITKA                  1
 3 AL    BARBOUR                1
 4 AL    HENRY                  1
 5 AP    APO                    1
 6 AR    NEWTON                 1
 7 AZ    GREENLEE               3
 8 CA    MONO                   1
 9 CO    BENT                   1
10 CO    CHEYENNE               1
# … with 160 more rows
# ℹ Use `print(n = ...)` to see more rows

Explain and Interpret

Work is still in progress.

Thought process:

  • examine which state has the most/least ‘total_employees’

  • identify which county has the most/least ‘total_employees’ in the state with the most/least ‘total_employees’

  • look at overall, which county has the most/least ‘total_employees’, and how does that compare to the state values

  • examine the average, min, max across states/counties