challenge_2
State County dataset
Data wrangling: using group() and summarise()
Author

Mekhala Kumar

Published

August 16, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Reading in Data

The dataset used is State County (the excel file). While reading in the data, it was also cleaned by removing the first few rows and last few rows which had unnecessary text. Since the country Canada was also present in the last row of the dataset but is a different country, it has been removed as well.

Code
library(readr)
library(readxl)
library(stringr)
railroad_df <- read_excel("_data/StateCounty2012.xls",skip=3,col_names= c("state", "delete", "county", "delete", "total_employees"))%>%
select(!contains("delete"))%>%
  filter(!str_detect(state, "Total"))
tail(railroad_df, 10)
# A tibble: 10 × 3
   state                                               county     total_employ…¹
   <chr>                                               <chr>      <chr>         
 1 WY                                                  PLATTE     129           
 2 WY                                                  SHERIDAN   252           
 3 WY                                                  SUBLETTE   3             
 4 WY                                                  SWEETWATER 196           
 5 WY                                                  UINTA      49            
 6 WY                                                  WASHAKIE   10            
 7 WY                                                  WESTON     37            
 8 CANADA                                              <NA>       662           
 9 1  Military designation.                            <NA>       <NA>          
10 NOTE:  Excludes 2,896 employees without an address. <NA>       <NA>          
# … with abbreviated variable name ¹​total_employees
Code
railroad_df<-head(railroad_df, -2)
tail(railroad_df, 10)
# A tibble: 10 × 3
   state  county     total_employees
   <chr>  <chr>      <chr>          
 1 WY     NIOBRARA   51             
 2 WY     PARK       29             
 3 WY     PLATTE     129            
 4 WY     SHERIDAN   252            
 5 WY     SUBLETTE   3              
 6 WY     SWEETWATER 196            
 7 WY     UINTA      49             
 8 WY     WASHAKIE   10             
 9 WY     WESTON     37             
10 CANADA <NA>       662            
Code
railroad_df<-head(railroad_df, -1)
tail(railroad_df, 10)
# A tibble: 10 × 3
   state county     total_employees
   <chr> <chr>      <chr>          
 1 WY    NATRONA    92             
 2 WY    NIOBRARA   51             
 3 WY    PARK       29             
 4 WY    PLATTE     129            
 5 WY    SHERIDAN   252            
 6 WY    SUBLETTE   3              
 7 WY    SWEETWATER 196            
 8 WY    UINTA      49             
 9 WY    WASHAKIE   10             
10 WY    WESTON     37             
Code
railroad_df = railroad_df[-1,] 
railroad_df <- transform(railroad_df,employees = as.numeric(total_employees))
 View(railroad_df)

Data description

The data consists of 3 columns- state, county and employees. Since the employees column contained character datatype, total_employees was created with the double type of variable. There are 53 states included in the data and 1709 counties. The number of employees by state can also be seen in the third table.

Code
head(railroad_df)
  state               county total_employees employees
1    AE                  APO               2         2
2    AK            ANCHORAGE               7         7
3    AK FAIRBANKS NORTH STAR               2         2
4    AK               JUNEAU               3         3
5    AK    MATANUSKA-SUSITNA               2         2
6    AK                SITKA               1         1
Code
railroad_df%>%
  select(state)%>%
  n_distinct(.)
[1] 53
Code
railroad_df%>%
  select(county)%>%
  n_distinct(.)
[1] 1709
Code
railroad_df %>%
  group_by(state) %>%
  summarise(employees2 = sum(employees))
# A tibble: 53 × 2
   state employees2
   <chr>      <dbl>
 1 AE             2
 2 AK           103
 3 AL          4257
 4 AP             1
 5 AR          3871
 6 AZ          3153
 7 CA         13137
 8 CO          3650
 9 CT          2592
10 DC           279
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows

Grouped Summary Statistics and Interpretation

First, the summary statistics were checked for a country wide comparison. Then a few states were selected in such a way that a few states had many employees (around 13000) and a few states selected had a small number of employees (100-200). This was done in order to check how the central tendency and dispersion varied across counties in a particular state.
Both California and Nebraska, which had a larger number of employees, had a higher mean for the total employees among the counties compared to the mean number of employees across states. This is possibly due to the fact that they had more employees. Similarly, it was observed that in the case where there were a lower number of employees, the mean for the total employees across counties were lower and the standard deviation across counties was lower (in Vermont and Alaska). As can be seen below, the District of Columbia cannot be analysed in this way as it consists of a single county.
Finally, the total number of employees in each county of a state were arranged in descending order because a county with a higher number of employees indicates that there are more job opportunities or perhaps a higher population in the area.

Code
library(dplyr)
library(summarytools)

railroad_df%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
  mean.employees median.employees min.employees max.employees sd.employees
1       87.17816               21             1          8207     283.6359
  var.employees IQR.employees
1      80449.32            58
Code
railroad_df%>% filter(state == "CA")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
  mean.employees median.employees min.employees max.employees sd.employees
1       238.8545               61             1          2888     549.4692
  var.employees IQR.employees
1      301916.3           188
Code
railroad_df%>% filter(state == "CA")%>%group_by(county)%>%arrange(county, desc(employees))
# A tibble: 55 × 4
# Groups:   county [55]
   state county       total_employees employees
   <chr> <chr>        <chr>               <dbl>
 1 CA    ALAMEDA      346                   346
 2 CA    AMADOR       9                       9
 3 CA    BUTTE        69                     69
 4 CA    CALAVERAS    30                     30
 5 CA    COLUSA       2                       2
 6 CA    CONTRA COSTA 348                   348
 7 CA    EL DORADO    103                   103
 8 CA    FRESNO       341                   341
 9 CA    GLENN        4                       4
10 CA    HUMBOLDT     2                       2
# … with 45 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
railroad_df%>% filter(state == "NE")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
  mean.employees median.employees min.employees max.employees sd.employees
1       148.0449               15             1          3797     511.5816
  var.employees IQR.employees
1      261715.7            66
Code
railroad_df%>% filter(state == "NE")%>%group_by(county)%>%arrange(county, desc(employees))
# A tibble: 89 × 4
# Groups:   county [89]
   state county    total_employees employees
   <chr> <chr>     <chr>               <dbl>
 1 NE    ADAMS     77                     77
 2 NE    ANTELOPE  2                       2
 3 NE    ARTHUR    2                       2
 4 NE    BANNER    8                       8
 5 NE    BLAINE    2                       2
 6 NE    BOONE     2                       2
 7 NE    BOX BUTTE 1168                 1168
 8 NE    BROWN     1                       1
 9 NE    BUFFALO   107                   107
10 NE    BURT      9                       9
# … with 79 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
railroad_df%>% filter(state == "DC")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
  mean.employees median.employees min.employees max.employees sd.employees
1            279              279           279           279           NA
  var.employees IQR.employees
1            NA             0
Code
railroad_df%>% filter(state == "VT")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
  mean.employees median.employees min.employees max.employees sd.employees
1           18.5              8.5             3            83     24.54431
  var.employees IQR.employees
1      602.4231             8
Code
railroad_df%>% filter(state == "VT")%>%group_by(county)%>%arrange(county, desc(employees))
# A tibble: 14 × 4
# Groups:   county [14]
   state county     total_employees employees
   <chr> <chr>      <chr>               <dbl>
 1 VT    ADDISON    8                       8
 2 VT    BENNINGTON 8                       8
 3 VT    CALEDONIA  3                       3
 4 VT    CHITTENDEN 40                     40
 5 VT    ESSEX      4                       4
 6 VT    FRANKLIN   83                     83
 7 VT    GRAND ISLE 5                       5
 8 VT    LAMOILLE   4                       4
 9 VT    ORANGE     9                       9
10 VT    ORLEANS    13                     13
11 VT    RUTLAND    59                     59
12 VT    WASHINGTON 3                       3
13 VT    WINDHAM    10                     10
14 VT    WINDSOR    10                     10
Code
railroad_df%>% filter(state == "AK")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
  mean.employees median.employees min.employees max.employees sd.employees
1       17.16667              2.5             1            88     34.76445
  var.employees IQR.employees
1      1208.567             4
Code
railroad_df%>% filter(state == "AK")%>%group_by(county)%>%arrange(county, desc(employees))
# A tibble: 6 × 4
# Groups:   county [6]
  state county               total_employees employees
  <chr> <chr>                <chr>               <dbl>
1 AK    ANCHORAGE            7                       7
2 AK    FAIRBANKS NORTH STAR 2                       2
3 AK    JUNEAU               3                       3
4 AK    MATANUSKA-SUSITNA    2                       2
5 AK    SITKA                1                       1
6 AK    SKAGWAY MUNICIPALITY 88                     88