challenge_2
State County dataset
Author

Jyoti Rani

Published

August 22, 2022

Reading in Data

I have used the dataset State County (the excel file). While reading in the data, it was also cleaned by removing the first few rows and last few rows which had unnecessary text.

Code
railroad_df <- read_excel("_data/StateCounty2012.xls",skip=3,col_names= c("state", "delete", "county", "delete", "total_employees"))%>%
select(!contains("delete"))%>%
  filter(!str_detect(state, "Total"))
New names:
• `delete` -> `delete...2`
• `delete` -> `delete...4`
Code
tail(railroad_df, 10)
# A tibble: 10 × 3
   state                                               county     total_employ…¹
   <chr>                                               <chr>      <chr>         
 1 WY                                                  PLATTE     129           
 2 WY                                                  SHERIDAN   252           
 3 WY                                                  SUBLETTE   3             
 4 WY                                                  SWEETWATER 196           
 5 WY                                                  UINTA      49            
 6 WY                                                  WASHAKIE   10            
 7 WY                                                  WESTON     37            
 8 CANADA                                              <NA>       662           
 9 1  Military designation.                            <NA>       <NA>          
10 NOTE:  Excludes 2,896 employees without an address. <NA>       <NA>          
# … with abbreviated variable name ¹​total_employees
Code
railroad_df<-head(railroad_df, -2)
tail(railroad_df, 10)
# A tibble: 10 × 3
   state  county     total_employees
   <chr>  <chr>      <chr>          
 1 WY     NIOBRARA   51             
 2 WY     PARK       29             
 3 WY     PLATTE     129            
 4 WY     SHERIDAN   252            
 5 WY     SUBLETTE   3              
 6 WY     SWEETWATER 196            
 7 WY     UINTA      49             
 8 WY     WASHAKIE   10             
 9 WY     WESTON     37             
10 CANADA <NA>       662            
Code
railroad_df<-head(railroad_df, -1)
tail(railroad_df, 10)
# A tibble: 10 × 3
   state county     total_employees
   <chr> <chr>      <chr>          
 1 WY    NATRONA    92             
 2 WY    NIOBRARA   51             
 3 WY    PARK       29             
 4 WY    PLATTE     129            
 5 WY    SHERIDAN   252            
 6 WY    SUBLETTE   3              
 7 WY    SWEETWATER 196            
 8 WY    UINTA      49             
 9 WY    WASHAKIE   10             
10 WY    WESTON     37             
Code
railroad_df = railroad_df[-1,] 
railroad_df <- transform(railroad_df,employees = as.numeric(total_employees))
head(railroad_df)
  state               county total_employees employees
1    AE                  APO               2         2
2    AK            ANCHORAGE               7         7
3    AK FAIRBANKS NORTH STAR               2         2
4    AK               JUNEAU               3         3
5    AK    MATANUSKA-SUSITNA               2         2
6    AK                SITKA               1         1

Data description

The data consists of 3 columns- state, county and employees. There are 53 states included in the data and 1709 counties. The number of employees by state can also be seen in the third table.

Code
head(railroad_df)
  state               county total_employees employees
1    AE                  APO               2         2
2    AK            ANCHORAGE               7         7
3    AK FAIRBANKS NORTH STAR               2         2
4    AK               JUNEAU               3         3
5    AK    MATANUSKA-SUSITNA               2         2
6    AK                SITKA               1         1
Code
railroad_df%>%
  select(state)%>%
  n_distinct(.)
[1] 53
Code
railroad_df%>%
  select(county)%>%
  n_distinct(.)
[1] 1709
Code
railroad_df %>%
  group_by(state) %>%
  summarise(employees2 = sum(employees))
# A tibble: 53 × 2
   state employees2
   <chr>      <dbl>
 1 AE             2
 2 AK           103
 3 AL          4257
 4 AP             1
 5 AR          3871
 6 AZ          3153
 7 CA         13137
 8 CO          3650
 9 CT          2592
10 DC           279
# … with 43 more rows

Grouped Summary Statistics and Interpretation

We use summary statistics were checked for a country wide comparison. Then a few states were selected in such a way that a few states had many employees (around 13000) and a few states selected had a small number of employees (100-200). This was done in order to check how the central tendency and dispersion varied across counties in a particular state. Both California and Nebraska, which had a larger number of employees, had a higher mean for the total employees among the counties compared to the mean number of employees across states. This is possibly due to the fact that they had more employees. Similarly, it was observed that in the case where there were a lower number of employees, the mean for the total employees across counties were lower and the standard deviation across counties was lower (in Vermont and Alaska). Finally, the total number of employees in each county of a state were arranged in descending order because a county with a higher number of employees indicates that there are more job opportunities or perhaps a higher population in the area.

Code
library(dplyr)
library(summarytools)
Warning: package 'summarytools' was built under R version 4.2.2

Attaching package: 'summarytools'
The following object is masked from 'package:tibble':

    view
Code
railroad_df%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
  mean.employees median.employees min.employees max.employees sd.employees
1       87.17816               21             1          8207     283.6359
  var.employees IQR.employees
1      80449.32            58
Code
railroad_df%>% filter(state == "CA")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
  mean.employees median.employees min.employees max.employees sd.employees
1       238.8545               61             1          2888     549.4692
  var.employees IQR.employees
1      301916.3           188
Code
railroad_df%>% filter(state == "CA")%>%group_by(county)%>%arrange(county, desc(employees))
# A tibble: 55 × 4
# Groups:   county [55]
   state county       total_employees employees
   <chr> <chr>        <chr>               <dbl>
 1 CA    ALAMEDA      346                   346
 2 CA    AMADOR       9                       9
 3 CA    BUTTE        69                     69
 4 CA    CALAVERAS    30                     30
 5 CA    COLUSA       2                       2
 6 CA    CONTRA COSTA 348                   348
 7 CA    EL DORADO    103                   103
 8 CA    FRESNO       341                   341
 9 CA    GLENN        4                       4
10 CA    HUMBOLDT     2                       2
# … with 45 more rows
Code
railroad_df%>% filter(state == "NE")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
  mean.employees median.employees min.employees max.employees sd.employees
1       148.0449               15             1          3797     511.5816
  var.employees IQR.employees
1      261715.7            66
Code
railroad_df%>% filter(state == "NE")%>%group_by(county)%>%arrange(county, desc(employees))
# A tibble: 89 × 4
# Groups:   county [89]
   state county    total_employees employees
   <chr> <chr>     <chr>               <dbl>
 1 NE    ADAMS     77                     77
 2 NE    ANTELOPE  2                       2
 3 NE    ARTHUR    2                       2
 4 NE    BANNER    8                       8
 5 NE    BLAINE    2                       2
 6 NE    BOONE     2                       2
 7 NE    BOX BUTTE 1168                 1168
 8 NE    BROWN     1                       1
 9 NE    BUFFALO   107                   107
10 NE    BURT      9                       9
# … with 79 more rows
Code
railroad_df%>% filter(state == "DC")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
  mean.employees median.employees min.employees max.employees sd.employees
1            279              279           279           279           NA
  var.employees IQR.employees
1            NA             0
Code
railroad_df%>% filter(state == "VT")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
  mean.employees median.employees min.employees max.employees sd.employees
1           18.5              8.5             3            83     24.54431
  var.employees IQR.employees
1      602.4231             8
Code
railroad_df%>% filter(state == "VT")%>%group_by(county)%>%arrange(county, desc(employees))
# A tibble: 14 × 4
# Groups:   county [14]
   state county     total_employees employees
   <chr> <chr>      <chr>               <dbl>
 1 VT    ADDISON    8                       8
 2 VT    BENNINGTON 8                       8
 3 VT    CALEDONIA  3                       3
 4 VT    CHITTENDEN 40                     40
 5 VT    ESSEX      4                       4
 6 VT    FRANKLIN   83                     83
 7 VT    GRAND ISLE 5                       5
 8 VT    LAMOILLE   4                       4
 9 VT    ORANGE     9                       9
10 VT    ORLEANS    13                     13
11 VT    RUTLAND    59                     59
12 VT    WASHINGTON 3                       3
13 VT    WINDHAM    10                     10
14 VT    WINDSOR    10                     10
Code
railroad_df%>% filter(state == "WY")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
  mean.employees median.employees min.employees max.employees sd.employees
1       130.7273             60.5             3           737     168.9799
  var.employees IQR.employees
1      28554.21           159
Code
railroad_df%>% filter(state == "WY")%>%group_by(county)%>%arrange(county, desc(employees))
# A tibble: 22 × 4
# Groups:   county [22]
   state county      total_employees employees
   <chr> <chr>       <chr>               <dbl>
 1 WY    ALBANY      90                     90
 2 WY    BIG HORN    70                     70
 3 WY    CAMPBELL    386                   386
 4 WY    CARBON      168                   168
 5 WY    CONVERSE    211                   211
 6 WY    CROOK       33                     33
 7 WY    FREMONT     42                     42
 8 WY    GOSHEN      244                   244
 9 WY    HOT SPRINGS 4                       4
10 WY    JOHNSON     18                     18
# … with 12 more rows