Code
library(tidyverse)
library(readxl)
library(summarytools)
::opts_chunk$set(echo = TRUE) knitr
Sue-Ellen Duffy
February 23, 2023
I analyzed the “railroad_2012_county_clean.csv” data for Challenge 1. This data describes the Total Number of Railroad Employees by County and State in the United States in 2012. Upon first glance the data contains 3 columns and 2,930 rows. The columns are: state, county, and total_employees
Rows: 2930 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, county
dbl (1): total_employees
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,920 more rows
Running the dfsummary(data) function shows us:
The data is complete: there are no missing data.
Top ten states ranked with the most counties. Texas has the most counties of any other state, accounting for 7.5% of all counties in the United States.
There are multiples of county names. We see in the following graph the top 10 county names that are used in the United States. (There are 31 Washington county names in this data plot, that’s far more than I thought there were in the United States!)
Data Frame Summary
data
Dimensions: 2930 x 3
Duplicates: 0
-----------------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- ----------------- -------------------------- --------------------- -------------------- ---------- ---------
1 state 1. TX 221 ( 7.5%) I 2930 0
[character] 2. GA 152 ( 5.2%) I (100.0%) (0.0%)
3. KY 119 ( 4.1%)
4. MO 115 ( 3.9%)
5. IL 103 ( 3.5%)
6. IA 99 ( 3.4%)
7. KS 95 ( 3.2%)
8. NC 94 ( 3.2%)
9. IN 92 ( 3.1%)
10. VA 92 ( 3.1%)
[ 43 others ] 1748 (59.7%) IIIIIIIIIII
2 county 1. WASHINGTON 31 ( 1.1%) 2930 0
[character] 2. JEFFERSON 26 ( 0.9%) (100.0%) (0.0%)
3. FRANKLIN 24 ( 0.8%)
4. LINCOLN 24 ( 0.8%)
5. JACKSON 22 ( 0.8%)
6. MADISON 19 ( 0.6%)
7. MONTGOMERY 18 ( 0.6%)
8. CLAY 17 ( 0.6%)
9. MARION 17 ( 0.6%)
10. MONROE 17 ( 0.6%)
[ 1699 others ] 2715 (92.7%) IIIIIIIIIIIIIIIIII
3 total_employees Mean (sd) : 87.2 (283.6) 404 distinct values : 2930 0
[numeric] min < med < max: : (100.0%) (0.0%)
1 < 21 < 8207 :
IQR (CV) : 58 (3.3) :
:
-----------------------------------------------------------------------------------------------------------------
There are only 50 recognized states, so we need to dig a little deeper to find out what the three additional ‘states’ represent.
[1] "AE" "AK" "AL" "AP" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA"
[16] "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC"
[31] "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN"
[46] "TX" "UT" "VA" "VT" "WA" "WI" "WV" "WY"
AE, AP, and DC are the three non-states cases. AE and AP are military addresses. DC is Washington DC.
---
title: "Railroad Employees Challenge 1"
author: "Sue-Ellen Duffy"
desription: "Railroad Employees"
date: "02/23/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_1
- Sue-Ellen Duffy
- Railroad Employee Dataset
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(readxl)
library(summarytools)
knitr::opts_chunk$set(echo = TRUE)
```
## Reading in the Data
I analyzed the "railroad_2012_county_clean.csv" data for Challenge 1. This data describes the Total Number of Railroad Employees by County and State in the United States in 2012. Upon first glance the data contains 3 columns and 2,930 rows. The columns are: state, county, and total_employees
```{r}
#Read in data and rename railroad_2012_clean_county as data
data <- rename(read_csv("_data/railroad_2012_clean_county.csv"))
#Preview data
data
```
## Summary of Data
Running the dfsummary(data) function shows us:
- The data is complete: there are no missing data.
- Top ten states ranked with the most counties. Texas has the most counties of any other state, accounting for 7.5% of all counties in the United States.
- There are multiples of county names. We see in the following graph the top 10 county names that are used in the United States. (There are 31 Washington county names in this data plot, that's far more than I thought there were in the United States!)
```{r}
dfSummary(data)
```
```{r}
#How many states are represented in the data?
data %>%
select(state) %>%
n_distinct(.)
```
There are only 50 recognized states, so we need to dig a little deeper to find out what the three additional 'states' represent.
```{r}
#Show unique state data
unique(data$state)
```
AE, AP, and DC are the three non-states cases. AE and AP are military addresses. DC is Washington DC.