Challenge 2 - Railroads

challenge_2

railroads

Pranav Komaravolu

dplyr

Data wrangling: using group() and summarise()

Author

Pranav Bharadwaj Komaravolu

Published

March 7, 2023

library(tidyverse)
library(readr)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

data <- read_csv("_data/railroad_2012_clean_county.csv")
data

# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows

Describe the data

The column names of the different columns are as follows:

colnames(data)

[1] "state"           "county"          "total_employees"

The classes of values in the different columns are as follows:

as.data.frame(sapply(data, class))

                sapply(data, class)
state                     character
county                    character
total_employees             numeric

Since we observe two columns with character type we can assume them to be categorical values and the number of distinct values in these columns are as follows:

data %>%
  select(state, county) %>%
  summarize_all(n_distinct)

# A tibble: 1 × 2
  state county
  <int>  <int>
1    53   1709

So we can see that there are 53 states and 1709 counties. ## Grouped Summary Statistics

We observed total_employees to be a numerical value, so we can get the measures of central tendency (mean) and dispersion(std) grouped on state as follows:

data %>%
  group_by(state) %>%
  select(state, county, total_employees) %>%
  summarize(county_count = n(), mean_employees = mean(total_employees), sum_employees = sum(total_employees), dispersion = sd(total_employees, na.rm = TRUE))

# A tibble: 53 × 5
   state county_count mean_employees sum_employees dispersion
   <chr>        <int>          <dbl>         <dbl>      <dbl>
 1 AE               1            2               2       NA  
 2 AK               6           17.2           103       34.8
 3 AL              67           63.5          4257      130. 
 4 AP               1            1               1       NA  
 5 AR              72           53.8          3871      131. 
 6 AZ              15          210.           3153      228. 
 7 CA              55          239.          13137      549. 
 8 CO              57           64.0          3650      128. 
 9 CT               8          324            2592      520. 
10 DC               1          279             279       NA  
# … with 43 more rows

Now we’ll try to observe the same when grouped by counties:

data %>%
  group_by(county) %>%
  select(county, state, total_employees) %>%
  summarize(state_count = n(), mean_employees = mean(total_employees), sum_employees = sum(total_employees), dispersion = sd(total_employees, na.rm = TRUE))

# A tibble: 1,709 × 5
   county    state_count mean_employees sum_employees dispersion
   <chr>           <int>          <dbl>         <dbl>      <dbl>
 1 ABBEVILLE           1         124              124      NA   
 2 ACADIA              1          13               13      NA   
 3 ACCOMACK            1           4                4      NA   
 4 ADA                 1          81               81      NA   
 5 ADAIR               4           7.25            29       9.32
 6 ADAMS              12          73.2            878     155.  
 7 ADDISON             1           8                8      NA   
 8 AIKEN               1         193              193      NA   
 9 AITKIN              1          19               19      NA   
10 ALACHUA             1          22               22      NA   
# … with 1,699 more rows

Interpretation

When we split the data based on state we observed that some of the states have only one county, while most of them have more than one county thus providing us with a dispersion measure, whereas when it comes to the counties we observe some common county names such as ADAMS are present in more than one state. Thus when we group by state we tend to observe more relevant data as to how many employees are present in a region than that of the counties.