challenge_2
railroads
Matthew_Weiner
Data wrangling: using group() and summarise()
Author

Matthew Weiner

Published

March 20, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Introduction

For this challege I chose to analyze the railroads dataset

Getting Started

I first had to load in the needed package and then read in the CSV file. I then used the ‘head()’ command to quickly view the first few entries in the dataset.

Code
library(readr)
rr <- read_csv("_data/railroad_2012_clean_county.csv")
head(rr)
# A tibble: 6 × 3
  state county               total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1

Understand the Data

To get a better idea of the contents of the dataset, I looked at the name of the columns.

Code
colnames(rr)
[1] "state"           "county"          "total_employees"

This shows us that there are 3 columns: “state”, “county”, and “total_employees”.

Based on the name of the columns and the name of the dataset, I am concluding that this data represents the number of railroad workers that are employed by each county in the United States.

This data was likely collected through some kind of government census of their employees as reailroad workers are often considered government employees.

Grouping Results

Through the use of dplyr commands such as group_by() and summarize(), we are able to get a better understanding of some details about our data.

The following code block shows the mean number of employees per state in ascending order. I chose to display the head and tail of this resulting table as I believe it is interesting to see which states employ the most railroad workers and which states employ the least.

Code
average_employees_per_state <- rr %>% 
  group_by(state) %>% 
  summarize(mean_employees = mean(total_employees)) %>%
  arrange(mean_employees)
  head(average_employees_per_state)
# A tibble: 6 × 2
  state mean_employees
  <chr>          <dbl>
1 AP              1   
2 HI              1.33
3 AE              2   
4 AK             17.2 
5 SD             18.2 
6 VT             18.5 
Code
  tail(average_employees_per_state)
# A tibble: 6 × 2
  state mean_employees
  <chr>          <dbl>
1 DC              279 
2 NY              280.
3 MA              282.
4 CT              324 
5 NJ              397.
6 DE              498.

We can then write a similar script to look at the median number of employees per state:

Code
median_employees_per_state <- rr %>% 
  group_by(state) %>% 
  summarize(med_employees = median(total_employees)) %>%
  arrange(med_employees)
  head(median_employees_per_state)
# A tibble: 6 × 2
  state med_employees
  <chr>         <dbl>
1 AP              1  
2 HI              1  
3 AE              2  
4 AK              2.5
5 SD              5  
6 ND              8  
Code
  tail(median_employees_per_state)
# A tibble: 6 × 2
  state med_employees
  <chr>         <dbl>
1 MD             108.
2 CT             125 
3 DE             158 
4 MA             271 
5 DC             279 
6 NJ             296 

When looking at the results of this query, we can see that the median numbers of the top 5 and bottom 5 states are all less than the results we found from the query of the mean number of employees. What this tells us is that the data is skewed to the right, i.e., there are some entries for each state that are substantially larger than the rest of the dataset causing the mean value to be inflated.

Conclusion

In this challenge, we took a look at the railroads dataset. Through our investigation we were able to determine what the data was describing, as well as look at some interesting results regarding the mean and median number of employees per state. This has allowed us to conclude that most states are skewed to the right, meaning that they each have counties with employee numbers that are much larger than the rest of the counties.