Challenge 1 - Railroad Employees

challenge_1
railroads
Joseph Vincent
Author

Joseph Vincent

Published

February 15, 2023

Code
library(tidyverse)
library(summarytools)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE)

Reading in the data

  • railroad_2012_clean_county.csv ⭐
Code
# loading in dataset and assigning to variable 'railroad'
# using head to preview the dataset

railroad <- read_csv("_data/railroad_2012_clean_county.csv")
Rows: 2930 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, county
dbl (1): total_employees

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
head(railroad)
# A tibble: 6 × 3
  state county               total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1

Describing the dataset

This data set consists of 3 columns: State, County and Total Employees. It appears that the data set is showing the number of railroad employees by county in the United States.

The least number of railroad employees in a given county is 1, and the greatest number of railroad employees in a given county is over 8,000.

There is a mean number of employees per county of about 87, but with a large standard deviation (283).

I learned that there are 31 counties with the name “Washington” and 26 counties with the name “Jefferson”.

Data summary

Code
# finding the dimensions of 'railroad'
dim(railroad)
[1] 2930    3
Code
# finding the column names of 'railroad'
colnames(railroad)
[1] "state"           "county"          "total_employees"
Code
#using summary tools
dfSummary(railroad)
Data Frame Summary  
railroad  
Dimensions: 2930 x 3  
Duplicates: 0  

-----------------------------------------------------------------------------------------------------------------
No   Variable          Stats / Values             Freqs (% of Valid)    Graph                Valid      Missing  
---- ----------------- -------------------------- --------------------- -------------------- ---------- ---------
1    state             1. TX                       221 ( 7.5%)          I                    2930       0        
     [character]       2. GA                       152 ( 5.2%)          I                    (100.0%)   (0.0%)   
                       3. KY                       119 ( 4.1%)                                                   
                       4. MO                       115 ( 3.9%)                                                   
                       5. IL                       103 ( 3.5%)                                                   
                       6. IA                        99 ( 3.4%)                                                   
                       7. KS                        95 ( 3.2%)                                                   
                       8. NC                        94 ( 3.2%)                                                   
                       9. IN                        92 ( 3.1%)                                                   
                       10. VA                       92 ( 3.1%)                                                   
                       [ 43 others ]              1748 (59.7%)          IIIIIIIIIII                              

2    county            1. WASHINGTON                31 ( 1.1%)                               2930       0        
     [character]       2. JEFFERSON                 26 ( 0.9%)                               (100.0%)   (0.0%)   
                       3. FRANKLIN                  24 ( 0.8%)                                                   
                       4. LINCOLN                   24 ( 0.8%)                                                   
                       5. JACKSON                   22 ( 0.8%)                                                   
                       6. MADISON                   19 ( 0.6%)                                                   
                       7. MONTGOMERY                18 ( 0.6%)                                                   
                       8. CLAY                      17 ( 0.6%)                                                   
                       9. MARION                    17 ( 0.6%)                                                   
                       10. MONROE                   17 ( 0.6%)                                                   
                       [ 1699 others ]            2715 (92.7%)          IIIIIIIIIIIIIIIIII                       

3    total_employees   Mean (sd) : 87.2 (283.6)   404 distinct values   :                    2930       0        
     [numeric]         min < med < max:                                 :                    (100.0%)   (0.0%)   
                       1 < 21 < 8207                                    :                                        
                       IQR (CV) : 58 (3.3)                              :                                        
                                                                        :                                        
-----------------------------------------------------------------------------------------------------------------
Code
# practice selecting employees column and calc min/max manually
employees <- select(railroad, total_employees)
head(employees)
# A tibble: 6 × 1
  total_employees
            <dbl>
1               2
2               7
3               2
4               3
5               2
6               1
Code
min(employees)
[1] 1
Code
max(employees)
[1] 8207