Railroad Employees Challenge 1

challenge_1

Sue-Ellen Duffy

Railroad Employee Dataset

Author

Sue-Ellen Duffy

Published

February 23, 2023

Code

library(tidyverse)
library(readxl)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE)

Reading in the Data

I analyzed the “railroad_2012_county_clean.csv” data for Challenge 1. This data describes the Total Number of Railroad Employees by County and State in the United States in 2012. Upon first glance the data contains 3 columns and 2,930 rows. The columns are: state, county, and total_employees

Code

#Read in data and rename railroad_2012_clean_county as data
data <- rename(read_csv("_data/railroad_2012_clean_county.csv"))

Rows: 2930 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, county
dbl (1): total_employees

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

#Preview data 
data

# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows

Summary of Data

Running the dfsummary(data) function shows us:

The data is complete: there are no missing data.
Top ten states ranked with the most counties. Texas has the most counties of any other state, accounting for 7.5% of all counties in the United States.
There are multiples of county names. We see in the following graph the top 10 county names that are used in the United States. (There are 31 Washington county names in this data plot, that’s far more than I thought there were in the United States!)

Code

dfSummary(data)

Data Frame Summary  
data  
Dimensions: 2930 x 3  
Duplicates: 0  

-----------------------------------------------------------------------------------------------------------------
No   Variable          Stats / Values             Freqs (% of Valid)    Graph                Valid      Missing  
---- ----------------- -------------------------- --------------------- -------------------- ---------- ---------
1    state             1. TX                       221 ( 7.5%)          I                    2930       0        
     [character]       2. GA                       152 ( 5.2%)          I                    (100.0%)   (0.0%)   
                       3. KY                       119 ( 4.1%)                                                   
                       4. MO                       115 ( 3.9%)                                                   
                       5. IL                       103 ( 3.5%)                                                   
                       6. IA                        99 ( 3.4%)                                                   
                       7. KS                        95 ( 3.2%)                                                   
                       8. NC                        94 ( 3.2%)                                                   
                       9. IN                        92 ( 3.1%)                                                   
                       10. VA                       92 ( 3.1%)                                                   
                       [ 43 others ]              1748 (59.7%)          IIIIIIIIIII                              

2    county            1. WASHINGTON                31 ( 1.1%)                               2930       0        
     [character]       2. JEFFERSON                 26 ( 0.9%)                               (100.0%)   (0.0%)   
                       3. FRANKLIN                  24 ( 0.8%)                                                   
                       4. LINCOLN                   24 ( 0.8%)                                                   
                       5. JACKSON                   22 ( 0.8%)                                                   
                       6. MADISON                   19 ( 0.6%)                                                   
                       7. MONTGOMERY                18 ( 0.6%)                                                   
                       8. CLAY                      17 ( 0.6%)                                                   
                       9. MARION                    17 ( 0.6%)                                                   
                       10. MONROE                   17 ( 0.6%)                                                   
                       [ 1699 others ]            2715 (92.7%)          IIIIIIIIIIIIIIIIII                       

3    total_employees   Mean (sd) : 87.2 (283.6)   404 distinct values   :                    2930       0        
     [numeric]         min < med < max:                                 :                    (100.0%)   (0.0%)   
                       1 < 21 < 8207                                    :                                        
                       IQR (CV) : 58 (3.3)                              :                                        
                                                                        :                                        
-----------------------------------------------------------------------------------------------------------------

Code

#How many states are represented in the data?
data %>%
  select(state) %>%
  n_distinct(.)

[1] 53

There are only 50 recognized states, so we need to dig a little deeper to find out what the three additional ‘states’ represent.

Code

#Show unique state data
unique(data$state)

 [1] "AE" "AK" "AL" "AP" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA"
[16] "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC"
[31] "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN"
[46] "TX" "UT" "VA" "VT" "WA" "WI" "WV" "WY"

AE, AP, and DC are the three non-states cases. AE and AP are military addresses. DC is Washington DC.