challenge_2
Data wrangling: using group() and summarise()
Author

Paarth Tandon

Published

December 28, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

Code
# read in the data using readr
rail <- read_csv("_data/railroad_2012_clean_county.csv")
# view a few data points
head(rail)
state county total_employees
AE APO 2
AK ANCHORAGE 7
AK FAIRBANKS NORTH STAR 2
AK JUNEAU 3
AK MATANUSKA-SUSITNA 2
AK SITKA 1

Describe the data

This part is the same as in challenge 1.

As seen by using head, this csv has three columns: state <chr>, county <chr>, and total_employees <dbl>. It contains two character based columns and one double column. Essentially, it contains the number of employees at each state, county pair.

Code
u_states <- rail$state %>%
                unique() %>%
                length()
sprintf('# unique states: %s', u_states)
[1] "# unique states: 53"
Code
u_counties <- rail$county %>%
                unique() %>%
                length()
sprintf('# unique counties: %s', u_counties)
[1] "# unique counties: 1709"
Code
range_emp <- rail$total_employees %>%
                range()
sprintf('range of total employees: [%s, %s]', range_emp[1], range_emp[2])
[1] "range of total employees: [1, 8207]"

I believe that this data was collected from each railroad station in the United States. It is most likely collected for bookkeeping purposes, but I could see it being used for analysis of which railroad stations need more employees, and which are overstaffed. Of course, answering these questions would require more data to be combined with this dataset.

Provide Grouped Summary Statistics

Code
rail %>%
    group_by(`state`) %>%
    summarise(n_counties=n(),
              sum_emp=sum(`total_employees`),
              mean_emp=mean(`total_employees`),
              sd_emp=sd(`total_employees`)) %>%
    arrange(desc(`mean_emp`))
state n_counties sum_emp mean_emp sd_emp
DE 3 1495 498.333333 674.3236117
NJ 21 8329 396.619048 338.2173083
CT 8 2592 324.000000 520.1983138
MA 12 3379 281.583333 203.8299016
NY 61 17050 279.508197 590.7790231
DC 1 279 279.000000 NA
CA 55 13137 238.854546 549.4691518
AZ 15 3153 210.200000 227.7819132
PA 65 12769 196.446154 293.0664937
MD 24 4709 196.208333 233.2815017
IL 103 19131 185.737864 829.1464659
NE 89 13176 148.044944 511.5815697
WA 39 5222 133.897436 255.7140462
WY 22 2876 130.727273 168.9799035
FL 67 7419 110.731343 386.0113274
OH 88 9056 102.909091 147.9123410
RI 5 487 97.400000 129.0186033
IN 92 8537 92.793478 233.0627675
TX 221 19839 89.769231 350.1155344
VA 92 7551 82.076087 340.7355403
UT 25 1917 76.680000 142.5724143
MO 115 8419 73.208696 208.1161744
OR 33 2322 70.363636 108.4495327
NM 29 1958 67.517241 112.7198235
KS 95 6092 64.126316 167.3644413
CO 57 3650 64.035088 127.7507289
MN 86 5467 63.569767 122.3857489
AL 67 4257 63.537313 130.1652014
MT 53 3327 62.773585 122.9539691
NV 12 746 62.166667 94.7953138
LA 63 3915 62.142857 101.4812643
WV 53 3213 60.622642 85.7537230
GA 152 8605 56.611842 113.1291853
WI 69 3773 54.681159 82.1718953
TN 91 4952 54.417582 94.8181963
AR 72 3871 53.763889 131.1367948
MI 78 3932 50.410256 109.7598838
SC 46 2296 49.913044 53.9122027
ND 49 2204 44.979592 92.4696999
ID 36 1563 43.416667 95.5478564
ME 16 654 40.875000 38.1153950
IA 99 4019 40.595960 76.7957693
KY 119 4811 40.428571 76.9114141
NH 10 393 39.300000 54.3324131
NC 94 3143 33.436170 58.5875980
OK 73 2318 31.753425 55.8621271
MS 78 2111 27.064103 46.6866703
VT 14 259 18.500000 24.5443084
SD 52 949 18.250000 34.6041338
AK 6 103 17.166667 34.7644454
AE 1 2 2.000000 NA
HI 3 4 1.333333 0.5773503
AP 1 1 1.000000 NA

Above is a dataframe that has five columns. The first column is the state. The second column is how many counties are in that state. The third column is how many employees are in that state. The fourth column is the mean employees per county in that state. The final column is the standard deviation of employees per county in that state. Sometimes, the standard deviation cannot be calculated, as there is only one sample (county) in that state.

As seen in the dataframe, Delaware (DE) has the highest mean employees of all the states.

Code
rail %>%
    filter(`state`=="DE") %>%
    arrange(desc(`total_employees`))
state county total_employees
DE NEW CASTLE 1275
DE KENT 158
DE SUSSEX 62

Taking a look specifically at Delaware, it seems that their employees are mostly at New Castle, while Kent and Sussex have very few employees in comparison.

Code
rail %>%
    filter(`state`=="NJ") %>%
    arrange(desc(`total_employees`))
state county total_employees
NJ ESSEX 1097
NJ MIDDLESEX 955
NJ HUDSON 871
NJ MONMOUTH 862
NJ UNION 738
NJ OCEAN 589
NJ BERGEN 513
NJ BURLINGTON 464
NJ CAMDEN 427
NJ MERCER 361
NJ MORRIS 296
NJ GLOUCESTER 270
NJ PASSAIC 231
NJ SUSSEX 178
NJ SOMERSET 148
NJ WARREN 115
NJ HUNTERDON 68
NJ ATLANTIC 58
NJ CUMBERLAND 39
NJ SALEM 30
NJ CAPE MAY 19

In comparison, looking at New Jersey (NJ, second highest mean), tells a different story. New Jersey has 21 counties, and the employees are much more dispersed that in Delaware. This can also be noted by comparing their standard deviations, where New Jersey is half as dispersed as Delaware.

Explain and Interpret

I first chose to group by state, as it gives a bigger picture as to which states have a higher concentration of employees. After which, I focused on the top two states by mean employees, and compared them.