Challenge 2 Instructions


Roy Yoon


August 16, 2022

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Dataset “railroad_2012_clean_county.csv”

railroad <- read_csv("_data/railroad_2012_clean_county.csv")

# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows
[1] 2930    3
[1] "state"           "county"          "total_employees"

There are three variable names: ‘state’, ‘county’, and ‘total_employees’.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

[1] "state"           "county"          "total_employees"
[1] 2930    3
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows

Summary Statistics

First I tried to attempt at making the data grouped by states with a single total employee count(regardless of the counties)

state_grouped_railroad <- railroad %>%
    select(state, total_employees) %>%
    group_by(state) %>%

state_grouped_railroad <-rename(state_grouped_railroad, total_employees = n)

# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                  2
 2 AK                103
 3 AL               4257
 4 AP                  1
 5 AR               3871
 6 AZ               3153
 7 CA              13137
 8 CO               3650
 9 CT               2592
10 DC                279
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows

Better way of producing results above

# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows
test_railroad <- railroad %>% group_by(state)

# A tibble: 2,930 × 3
# Groups:   state [53]
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows
#finding the total employees for each state

test_railroad %>%summarise(
  total_employees = sum(total_employees)
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                  2
 2 AK                103
 3 AL               4257
 4 AP                  1
 5 AR               3871
 6 AZ               3153
 7 CA              13137
 8 CO               3650
 9 CT               2592
10 DC                279
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
#trying to find descending order 

#test_railroad <- test_railroad %>%summarise(
  #total_employees = sum(total_employees)

#test_railroad %>% arrange(desc(total_employees))
# finding the county(s) for each state that has the most number of employees
test_railroad %>% filter(total_employees == max(total_employees))
# A tibble: 53 × 3
# Groups:   state [53]
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    SKAGWAY MUNICIPALITY              88
 3 AL    JEFFERSON                        990
 4 AP    APO                                1
 5 AR    PULASKI                          972
 6 AZ    PIMA                             749
 7 CA    SAN BERNARDINO                  2888
 8 CO    ADAMS                            553
 9 CT    NEW HAVEN                       1561
10 DC    WASHINGTON DC                    279
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
#finding the county(s) for each state that has the least number of employees. In this case, there are multiple counties from a state that have the same minimum. 
test_railroad %>% filter(total_employees == min(total_employees))
# A tibble: 170 × 3
# Groups:   state [53]
   state county   total_employees
   <chr> <chr>              <dbl>
 1 AE    APO                    2
 2 AK    SITKA                  1
 3 AL    BARBOUR                1
 4 AL    HENRY                  1
 5 AP    APO                    1
 6 AR    NEWTON                 1
 7 AZ    GREENLEE               3
 8 CA    MONO                   1
 9 CO    BENT                   1
10 CO    CHEYENNE               1
# … with 160 more rows
# ℹ Use `print(n = ...)` to see more rows

Explain and Interpret

Work is still in progress.

Thought process:

  • examine which state has the most/least ‘total_employees’

  • identify which county has the most/least ‘total_employees’ in the state with the most/least ‘total_employees’

  • look at overall, which county has the most/least ‘total_employees’, and how does that compare to the state values

  • examine the average, min, max across states/counties