Challenge2_KatiePopiela

1) Read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc).

2) Provide summary statistics for different interesting groups within the data, and interpret those statistics.

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(readr)
library(dplyr)

#For this challenge, I will read in "railroad_2012_clean_county.csv." This dataset shows the number of railroad employees in each county in each state.
countyrr_data <- read.csv("_data/railroad_2012_clean_county.csv")
head(countyrr_data)
  state               county total_employees
1    AE                  APO               2
2    AK            ANCHORAGE               7
3    AK FAIRBANKS NORTH STAR               2
4    AK               JUNEAU               3
5    AK    MATANUSKA-SUSITNA               2
6    AK                SITKA               1
dim(countyrr_data)
[1] 2930    3
#For confirmation, I got the dimensions of the table. With 2,930 rows, it must be filtered down and grouped in some way in order to present specific information. As an example, I will filter the data down to one state and put in for the most and fewest number of RR employees in that state.

countyrr_NY <- countyrr_data %>%
  filter(state=="NY")
min(countyrr_NY$total_employees)
[1] 5
max(countyrr_NY$total_employees)
[1] 3685
countyrr_NY %>%
  group_by(total_employees) %>%
  slice_min(order_by = county)
# A tibble: 54 × 3
# Groups:   total_employees [54]
   state county    total_employees
   <chr> <chr>               <int>
 1 NY    LEWIS                   5
 2 NY    YATES                   6
 3 NY    SCHUYLER                7
 4 NY    TOMPKINS                8
 5 NY    CORTLAND               11
 6 NY    SENECA                 13
 7 NY    SULLIVAN               14
 8 NY    JEFFERSON              19
 9 NY    FULTON                 20
10 NY    CHEMUNG                21
# … with 44 more rows
# ℹ Use `print(n = ...)` to see more rows
#There is a massive difference between the NY counties with the most and fewest RR employees - Suffolk County (3685) and Lewis (5). Furthermore, despite the large disparity between these two numbers, the average number of RR employees in NY is relatively low at 279.508. The SD (standard deviation) is 590.779. The variance came out as 349,019.9 which doesn't really make sense to me, though.
countyrr_NY %>%
  summarize(mean = mean(total_employees, na.rm = TRUE), sd = sd(total_employees, na.rm = TRUE), var = var(total_employees, na.rm = TRUE)) 
      mean      sd      var
1 279.5082 590.779 349019.9