Code
library(tidyverse)
#library(readr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Roy Yoon
August 16, 2022
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows
[1] 2930 3
[1] "state" "county" "total_employees"
There are three variable names: ‘state’, ‘county’, and ‘total_employees’.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
[1] "state" "county" "total_employees"
[1] 2930 3
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows
First I tried to attempt at making the data grouped by states with a single total employee count(regardless of the counties)
# A tibble: 53 × 2
state total_employees
<chr> <dbl>
1 AE 2
2 AK 103
3 AL 4257
4 AP 1
5 AR 3871
6 AZ 3153
7 CA 13137
8 CO 3650
9 CT 2592
10 DC 279
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows
# A tibble: 2,930 × 3
# Groups: state [53]
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,920 more rows
# ℹ Use `print(n = ...)` to see more rows
# A tibble: 53 × 2
state total_employees
<chr> <dbl>
1 AE 2
2 AK 103
3 AL 4257
4 AP 1
5 AR 3871
6 AZ 3153
7 CA 13137
8 CO 3650
9 CT 2592
10 DC 279
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
# A tibble: 53 × 3
# Groups: state [53]
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK SKAGWAY MUNICIPALITY 88
3 AL JEFFERSON 990
4 AP APO 1
5 AR PULASKI 972
6 AZ PIMA 749
7 CA SAN BERNARDINO 2888
8 CO ADAMS 553
9 CT NEW HAVEN 1561
10 DC WASHINGTON DC 279
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
# A tibble: 170 × 3
# Groups: state [53]
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK SITKA 1
3 AL BARBOUR 1
4 AL HENRY 1
5 AP APO 1
6 AR NEWTON 1
7 AZ GREENLEE 3
8 CA MONO 1
9 CO BENT 1
10 CO CHEYENNE 1
# … with 160 more rows
# ℹ Use `print(n = ...)` to see more rows
Work is still in progress.
Thought process:
examine which state has the most/least ‘total_employees’
identify which county has the most/least ‘total_employees’ in the state with the most/least ‘total_employees’
look at overall, which county has the most/least ‘total_employees’, and how does that compare to the state values
examine the average, min, max across states/counties
---
title: "Challenge 2 Instructions"
author: "Roy Yoon"
desription: "Data wrangling: using group() and summarise()"
date: "08/16/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroad
- question
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
#library(readr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Dataset "railroad_2012_clean_county.csv"
```{r initial inspection}
railroad <- read_csv("_data/railroad_2012_clean_county.csv")
railroad
dim(railroad)
colnames(railroad)
```
There are three variable names: 'state', 'county', and 'total_employees'.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
```{r}
#| label: summary
colnames(railroad)
dim(railroad)
railroad
```
## Summary Statistics
First I tried to attempt at making the data grouped by states with a single total employee count(regardless of the counties)
```{r}
state_grouped_railroad <- railroad %>%
select(state, total_employees) %>%
group_by(state) %>%
tally(total_employees)
state_grouped_railroad <-rename(state_grouped_railroad, total_employees = n)
state_grouped_railroad
```
## Better way of producing results above
```{r}
railroad
test_railroad <- railroad %>% group_by(state)
test_railroad
```
```{r}
#finding the total employees for each state
test_railroad %>%summarise(
total_employees = sum(total_employees)
)
#trying to find descending order
#test_railroad <- test_railroad %>%summarise(
#total_employees = sum(total_employees)
#)
#test_railroad %>% arrange(desc(total_employees))
```
```{r}
# finding the county(s) for each state that has the most number of employees
test_railroad %>% filter(total_employees == max(total_employees))
```
```{r}
#finding the county(s) for each state that has the least number of employees. In this case, there are multiple counties from a state that have the same minimum.
test_railroad %>% filter(total_employees == min(total_employees))
```
### Explain and Interpret
Work is still in progress.
Thought process:
* examine which state has the most/least 'total_employees'
* identify which county has the most/least 'total_employees' in the state with the most/least 'total_employees'
* look at overall, which county has the most/least 'total_employees', and how does that compare to the state values
* examine the average, min, max across states/counties