challenge_2
railroads
Nisarg Shah
Data wrangling: using group() and summarise()
Author

Nisarg Shah

Published

August 16, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv
Code
railroad <- read.csv("_data/railroad_2012_clean_county.csv")
head(railroad)
  state               county total_employees
1    AE                  APO               2
2    AK            ANCHORAGE               7
3    AK FAIRBANKS NORTH STAR               2
4    AK               JUNEAU               3
5    AK    MATANUSKA-SUSITNA               2
6    AK                SITKA               1

Describe the data

From the datafram summary we can see that there are 3 columns: state, county, total_employees. The data shows us the number of railroad employees for each county of that state.

state: This variable represents the state where the county is located. There are 53 states represented. county: This variable represents the county within each state. There are 1,709 counties represented in this dataset total_employees: This variable represents the total number of employees in each county. The mean value is 87.2, with a standard deviation of 283.6. The minimum value is 1 employee, the maximum value is 8,207 employees, and the median value is 21 employees.

In addition to the above information, there are also counts and percentages provided for each state and county. For example, 1748 (59.7%) of the counties are located in the 10th state (VA), and 2715 (92.7%) of the counties have a total of 0 employees.

Code
print(summarytools::dfSummary(railroad,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',table.classes = 'table-condensed')

Data Frame Summary

railroad

Dimensions: 2930 x 3
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
state [character]
1. TX
2. GA
3. KY
4. MO
5. IL
6. IA
7. KS
8. NC
9. IN
10. VA
[ 43 others ]
221 ( 7.5% )
152 ( 5.2% )
119 ( 4.1% )
115 ( 3.9% )
103 ( 3.5% )
99 ( 3.4% )
95 ( 3.2% )
94 ( 3.2% )
92 ( 3.1% )
92 ( 3.1% )
1748 ( 59.7% )
0 (0.0%)
county [character]
1. WASHINGTON
2. JEFFERSON
3. FRANKLIN
4. LINCOLN
5. JACKSON
6. MADISON
7. MONTGOMERY
8. CLAY
9. MARION
10. MONROE
[ 1699 others ]
31 ( 1.1% )
26 ( 0.9% )
24 ( 0.8% )
24 ( 0.8% )
22 ( 0.8% )
19 ( 0.6% )
18 ( 0.6% )
17 ( 0.6% )
17 ( 0.6% )
17 ( 0.6% )
2715 ( 92.7% )
0 (0.0%)
total_employees [integer]
Mean (sd) : 87.2 (283.6)
min ≤ med ≤ max:
1 ≤ 21 ≤ 8207
IQR (CV) : 58 (3.3)
404 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-03-27

Provide Grouped Summary Statistics

The data shows that California has the highest sum of employees with 13,137 and Illinois has the highest dispersion with a value of 829.15. On the other hand, AP has only one county with one employee, and AE has one county with two employees. Additionally, there are some states where the mean employees per county is quite high, such as Connecticut with 324 employees per county, and some where it is quite low, such as Hawaii with only 1.33 employees per county. Overall, the data provides information on the distribution of employees across states and their counties in the US.

Code
railroad %>%
  group_by(state) %>%
  select(state, county, total_employees) %>%
  summarize(county_count = n(), mean_employees = mean(total_employees), sum_employees = sum(total_employees), dispersion = sd(total_employees, na.rm = TRUE))
# A tibble: 53 × 5
   state county_count mean_employees sum_employees dispersion
   <chr>        <int>          <dbl>         <int>      <dbl>
 1 AE               1            2               2       NA  
 2 AK               6           17.2           103       34.8
 3 AL              67           63.5          4257      130. 
 4 AP               1            1               1       NA  
 5 AR              72           53.8          3871      131. 
 6 AZ              15          210.           3153      228. 
 7 CA              55          239.          13137      549. 
 8 CO              57           64.0          3650      128. 
 9 CT               8          324            2592      520. 
10 DC               1          279             279       NA  
# … with 43 more rows

When we group the data by county we see that some counties just have 1 state while other counties have multiple.

Code
railroad %>%
  group_by(county) %>%
  select(county, state, total_employees) %>%
  summarize(state_count = n(), mean_employees = mean(total_employees), sum_employees = sum(total_employees), dispersion = sd(total_employees, na.rm = TRUE))
# A tibble: 1,709 × 5
   county    state_count mean_employees sum_employees dispersion
   <chr>           <int>          <dbl>         <int>      <dbl>
 1 ABBEVILLE           1         124              124      NA   
 2 ACADIA              1          13               13      NA   
 3 ACCOMACK            1           4                4      NA   
 4 ADA                 1          81               81      NA   
 5 ADAIR               4           7.25            29       9.32
 6 ADAMS              12          73.2            878     155.  
 7 ADDISON             1           8                8      NA   
 8 AIKEN               1         193              193      NA   
 9 AITKIN              1          19               19      NA   
10 ALACHUA             1          22               22      NA   
# … with 1,699 more rows

Explain and Interpret

The data suggests that the distribution of total employees is highly variable, with a large standard deviation relative to the mean. The middle 50% of the data falls within a relatively narrow range of 58, centered around a median value of 21. The maximum value in the dataset is very high at 8207, which suggests that there may be some extreme outliers in the data.