Challenge 2

challenge_2

Data wrangling: using group() and summarise()

Author

Paarth Tandon

Published

December 28, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

Code

# read in the data using readr
rail <- read_csv("_data/railroad_2012_clean_county.csv")
# view a few data points
head(rail)

state	county	total_employees
AE	APO	2
AK	ANCHORAGE	7
AK	FAIRBANKS NORTH STAR	2
AK	JUNEAU	3
AK	MATANUSKA-SUSITNA	2
AK	SITKA	1

Describe the data

This part is the same as in challenge 1.

As seen by using head, this csv has three columns: state <chr>, county <chr>, and total_employees <dbl>. It contains two character based columns and one double column. Essentially, it contains the number of employees at each state, county pair.

Code

u_states <- rail$state %>%
                unique() %>%
                length()
sprintf('# unique states: %s', u_states)

[1] "# unique states: 53"

Code

u_counties <- rail$county %>%
                unique() %>%
                length()
sprintf('# unique counties: %s', u_counties)

[1] "# unique counties: 1709"

Code

range_emp <- rail$total_employees %>%
                range()
sprintf('range of total employees: [%s, %s]', range_emp[1], range_emp[2])

[1] "range of total employees: [1, 8207]"

I believe that this data was collected from each railroad station in the United States. It is most likely collected for bookkeeping purposes, but I could see it being used for analysis of which railroad stations need more employees, and which are overstaffed. Of course, answering these questions would require more data to be combined with this dataset.

Provide Grouped Summary Statistics

Code

rail %>%
    group_by(`state`) %>%
    summarise(n_counties=n(),
              sum_emp=sum(`total_employees`),
              mean_emp=mean(`total_employees`),
              sd_emp=sd(`total_employees`)) %>%
    arrange(desc(`mean_emp`))

state	n_counties	sum_emp	mean_emp	sd_emp
DE	3	1495	498.333333	674.3236117
NJ	21	8329	396.619048	338.2173083
CT	8	2592	324.000000	520.1983138
MA	12	3379	281.583333	203.8299016
NY	61	17050	279.508197	590.7790231
DC	1	279	279.000000	NA
CA	55	13137	238.854546	549.4691518
AZ	15	3153	210.200000	227.7819132
PA	65	12769	196.446154	293.0664937
MD	24	4709	196.208333	233.2815017
IL	103	19131	185.737864	829.1464659
NE	89	13176	148.044944	511.5815697
WA	39	5222	133.897436	255.7140462
WY	22	2876	130.727273	168.9799035
FL	67	7419	110.731343	386.0113274
OH	88	9056	102.909091	147.9123410
RI	5	487	97.400000	129.0186033
IN	92	8537	92.793478	233.0627675
TX	221	19839	89.769231	350.1155344
VA	92	7551	82.076087	340.7355403
UT	25	1917	76.680000	142.5724143
MO	115	8419	73.208696	208.1161744
OR	33	2322	70.363636	108.4495327
NM	29	1958	67.517241	112.7198235
KS	95	6092	64.126316	167.3644413
CO	57	3650	64.035088	127.7507289
MN	86	5467	63.569767	122.3857489
AL	67	4257	63.537313	130.1652014
MT	53	3327	62.773585	122.9539691
NV	12	746	62.166667	94.7953138
LA	63	3915	62.142857	101.4812643
WV	53	3213	60.622642	85.7537230
GA	152	8605	56.611842	113.1291853
WI	69	3773	54.681159	82.1718953
TN	91	4952	54.417582	94.8181963
AR	72	3871	53.763889	131.1367948
MI	78	3932	50.410256	109.7598838
SC	46	2296	49.913044	53.9122027
ND	49	2204	44.979592	92.4696999
ID	36	1563	43.416667	95.5478564
ME	16	654	40.875000	38.1153950
IA	99	4019	40.595960	76.7957693
KY	119	4811	40.428571	76.9114141
NH	10	393	39.300000	54.3324131
NC	94	3143	33.436170	58.5875980
OK	73	2318	31.753425	55.8621271
MS	78	2111	27.064103	46.6866703
VT	14	259	18.500000	24.5443084
SD	52	949	18.250000	34.6041338
AK	6	103	17.166667	34.7644454
AE	1	2	2.000000	NA
HI	3	4	1.333333	0.5773503
AP	1	1	1.000000	NA

Above is a dataframe that has five columns. The first column is the state. The second column is how many counties are in that state. The third column is how many employees are in that state. The fourth column is the mean employees per county in that state. The final column is the standard deviation of employees per county in that state. Sometimes, the standard deviation cannot be calculated, as there is only one sample (county) in that state.

As seen in the dataframe, Delaware (DE) has the highest mean employees of all the states.

Code

rail %>%
    filter(`state`=="DE") %>%
    arrange(desc(`total_employees`))

state	county	total_employees
DE	NEW CASTLE	1275
DE	KENT	158
DE	SUSSEX	62

Taking a look specifically at Delaware, it seems that their employees are mostly at New Castle, while Kent and Sussex have very few employees in comparison.

Code

rail %>%
    filter(`state`=="NJ") %>%
    arrange(desc(`total_employees`))

state	county	total_employees
NJ	ESSEX	1097
NJ	MIDDLESEX	955
NJ	HUDSON	871
NJ	MONMOUTH	862
NJ	UNION	738
NJ	OCEAN	589
NJ	BERGEN	513
NJ	BURLINGTON	464
NJ	CAMDEN	427
NJ	MERCER	361
NJ	MORRIS	296
NJ	GLOUCESTER	270
NJ	PASSAIC	231
NJ	SUSSEX	178
NJ	SOMERSET	148
NJ	WARREN	115
NJ	HUNTERDON	68
NJ	ATLANTIC	58
NJ	CUMBERLAND	39
NJ	SALEM	30
NJ	CAPE MAY	19

In comparison, looking at New Jersey (NJ, second highest mean), tells a different story. New Jersey has 21 counties, and the employees are much more dispersed that in Delaware. This can also be noted by comparing their standard deviations, where New Jersey is half as dispersed as Delaware.

Explain and Interpret

I first chose to group by state, as it gives a bigger picture as to which states have a higher concentration of employees. After which, I focused on the top two states by mean employees, and compared them.