Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Paarth Tandon
December 28, 2022
This part is the same as in challenge 1.
As seen by using head
, this csv has three columns: state <chr>
, county <chr>
, and total_employees <dbl>
. It contains two character based columns and one double column. Essentially, it contains the number of employees at each state
, county
pair.
[1] "# unique states: 53"
[1] "# unique counties: 1709"
[1] "range of total employees: [1, 8207]"
I believe that this data was collected from each railroad station in the United States. It is most likely collected for bookkeeping purposes, but I could see it being used for analysis of which railroad stations need more employees, and which are overstaffed. Of course, answering these questions would require more data to be combined with this dataset.
state | n_counties | sum_emp | mean_emp | sd_emp |
---|---|---|---|---|
DE | 3 | 1495 | 498.333333 | 674.3236117 |
NJ | 21 | 8329 | 396.619048 | 338.2173083 |
CT | 8 | 2592 | 324.000000 | 520.1983138 |
MA | 12 | 3379 | 281.583333 | 203.8299016 |
NY | 61 | 17050 | 279.508197 | 590.7790231 |
DC | 1 | 279 | 279.000000 | NA |
CA | 55 | 13137 | 238.854546 | 549.4691518 |
AZ | 15 | 3153 | 210.200000 | 227.7819132 |
PA | 65 | 12769 | 196.446154 | 293.0664937 |
MD | 24 | 4709 | 196.208333 | 233.2815017 |
IL | 103 | 19131 | 185.737864 | 829.1464659 |
NE | 89 | 13176 | 148.044944 | 511.5815697 |
WA | 39 | 5222 | 133.897436 | 255.7140462 |
WY | 22 | 2876 | 130.727273 | 168.9799035 |
FL | 67 | 7419 | 110.731343 | 386.0113274 |
OH | 88 | 9056 | 102.909091 | 147.9123410 |
RI | 5 | 487 | 97.400000 | 129.0186033 |
IN | 92 | 8537 | 92.793478 | 233.0627675 |
TX | 221 | 19839 | 89.769231 | 350.1155344 |
VA | 92 | 7551 | 82.076087 | 340.7355403 |
UT | 25 | 1917 | 76.680000 | 142.5724143 |
MO | 115 | 8419 | 73.208696 | 208.1161744 |
OR | 33 | 2322 | 70.363636 | 108.4495327 |
NM | 29 | 1958 | 67.517241 | 112.7198235 |
KS | 95 | 6092 | 64.126316 | 167.3644413 |
CO | 57 | 3650 | 64.035088 | 127.7507289 |
MN | 86 | 5467 | 63.569767 | 122.3857489 |
AL | 67 | 4257 | 63.537313 | 130.1652014 |
MT | 53 | 3327 | 62.773585 | 122.9539691 |
NV | 12 | 746 | 62.166667 | 94.7953138 |
LA | 63 | 3915 | 62.142857 | 101.4812643 |
WV | 53 | 3213 | 60.622642 | 85.7537230 |
GA | 152 | 8605 | 56.611842 | 113.1291853 |
WI | 69 | 3773 | 54.681159 | 82.1718953 |
TN | 91 | 4952 | 54.417582 | 94.8181963 |
AR | 72 | 3871 | 53.763889 | 131.1367948 |
MI | 78 | 3932 | 50.410256 | 109.7598838 |
SC | 46 | 2296 | 49.913044 | 53.9122027 |
ND | 49 | 2204 | 44.979592 | 92.4696999 |
ID | 36 | 1563 | 43.416667 | 95.5478564 |
ME | 16 | 654 | 40.875000 | 38.1153950 |
IA | 99 | 4019 | 40.595960 | 76.7957693 |
KY | 119 | 4811 | 40.428571 | 76.9114141 |
NH | 10 | 393 | 39.300000 | 54.3324131 |
NC | 94 | 3143 | 33.436170 | 58.5875980 |
OK | 73 | 2318 | 31.753425 | 55.8621271 |
MS | 78 | 2111 | 27.064103 | 46.6866703 |
VT | 14 | 259 | 18.500000 | 24.5443084 |
SD | 52 | 949 | 18.250000 | 34.6041338 |
AK | 6 | 103 | 17.166667 | 34.7644454 |
AE | 1 | 2 | 2.000000 | NA |
HI | 3 | 4 | 1.333333 | 0.5773503 |
AP | 1 | 1 | 1.000000 | NA |
Above is a dataframe that has five columns. The first column is the state. The second column is how many counties are in that state. The third column is how many employees are in that state. The fourth column is the mean employees per county in that state. The final column is the standard deviation of employees per county in that state. Sometimes, the standard deviation cannot be calculated, as there is only one sample (county) in that state.
As seen in the dataframe, Delaware (DE) has the highest mean employees of all the states.
state | county | total_employees |
---|---|---|
DE | NEW CASTLE | 1275 |
DE | KENT | 158 |
DE | SUSSEX | 62 |
Taking a look specifically at Delaware, it seems that their employees are mostly at New Castle, while Kent and Sussex have very few employees in comparison.
state | county | total_employees |
---|---|---|
NJ | ESSEX | 1097 |
NJ | MIDDLESEX | 955 |
NJ | HUDSON | 871 |
NJ | MONMOUTH | 862 |
NJ | UNION | 738 |
NJ | OCEAN | 589 |
NJ | BERGEN | 513 |
NJ | BURLINGTON | 464 |
NJ | CAMDEN | 427 |
NJ | MERCER | 361 |
NJ | MORRIS | 296 |
NJ | GLOUCESTER | 270 |
NJ | PASSAIC | 231 |
NJ | SUSSEX | 178 |
NJ | SOMERSET | 148 |
NJ | WARREN | 115 |
NJ | HUNTERDON | 68 |
NJ | ATLANTIC | 58 |
NJ | CUMBERLAND | 39 |
NJ | SALEM | 30 |
NJ | CAPE MAY | 19 |
In comparison, looking at New Jersey (NJ, second highest mean), tells a different story. New Jersey has 21 counties, and the employees are much more dispersed that in Delaware. This can also be noted by comparing their standard deviations, where New Jersey is half as dispersed as Delaware.
I first chose to group by state, as it gives a bigger picture as to which states have a higher concentration of employees. After which, I focused on the top two states by mean employees, and compared them.
---
title: "Challenge 2"
author: "Paarth Tandon"
description: "Data wrangling: using group() and summarise()"
date: "12/28/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
df-print: kable
categories:
- challenge_2
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Data
```{r}
# read in the data using readr
rail <- read_csv("_data/railroad_2012_clean_county.csv")
# view a few data points
head(rail)
```
## Describe the data
This part is the same as in challenge 1.
As seen by using `head`, this csv has three columns: `state <chr>`, `county <chr>`, and `total_employees <dbl>`. It contains two character based columns and one double column. Essentially, it contains the number of employees at each `state`, `county` pair.
```{r}
#| label: summary
u_states <- rail$state %>%
unique() %>%
length()
sprintf('# unique states: %s', u_states)
u_counties <- rail$county %>%
unique() %>%
length()
sprintf('# unique counties: %s', u_counties)
range_emp <- rail$total_employees %>%
range()
sprintf('range of total employees: [%s, %s]', range_emp[1], range_emp[2])
```
I believe that this data was collected from each railroad station in the United States. It is most likely collected for bookkeeping purposes, but I could see it being used for analysis of which railroad stations need more employees, and which are overstaffed. Of course, answering these questions would require more data to be combined with this dataset.
## Provide Grouped Summary Statistics
```{r}
rail %>%
group_by(`state`) %>%
summarise(n_counties=n(),
sum_emp=sum(`total_employees`),
mean_emp=mean(`total_employees`),
sd_emp=sd(`total_employees`)) %>%
arrange(desc(`mean_emp`))
```
Above is a dataframe that has five columns. The first column is the state. The second column is how many counties are in that state. The third column is how many employees are in that state. The fourth column is the mean employees per county in that state. The final column is the standard deviation of employees per county in that state. Sometimes, the standard deviation cannot be calculated, as there is only one sample (county) in that state.
As seen in the dataframe, Delaware (DE) has the highest mean employees of all the states.
```{r}
rail %>%
filter(`state`=="DE") %>%
arrange(desc(`total_employees`))
```
Taking a look specifically at Delaware, it seems that their employees are mostly at New Castle, while Kent and Sussex have very few employees in comparison.
```{r}
rail %>%
filter(`state`=="NJ") %>%
arrange(desc(`total_employees`))
```
In comparison, looking at New Jersey (NJ, second highest mean), tells a different story. New Jersey has 21 counties, and the employees are much more dispersed that in Delaware. This can also be noted by comparing their standard deviations, where New Jersey is half as dispersed as Delaware.
### Explain and Interpret
I first chose to group by state, as it gives a bigger picture as to which states have a higher concentration of employees. After which, I focused on the top two states by mean employees, and compared them.