Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Mekhala Kumar
August 16, 2022
The dataset used is State County (the excel file). While reading in the data, it was also cleaned by removing the first few rows and last few rows which had unnecessary text. Since the country Canada was also present in the last row of the dataset but is a different country, it has been removed as well.
# A tibble: 10 × 3
state county total_employ…¹
<chr> <chr> <chr>
1 WY PLATTE 129
2 WY SHERIDAN 252
3 WY SUBLETTE 3
4 WY SWEETWATER 196
5 WY UINTA 49
6 WY WASHAKIE 10
7 WY WESTON 37
8 CANADA <NA> 662
9 1 Military designation. <NA> <NA>
10 NOTE: Excludes 2,896 employees without an address. <NA> <NA>
# … with abbreviated variable name ¹total_employees
# A tibble: 10 × 3
state county total_employees
<chr> <chr> <chr>
1 WY NIOBRARA 51
2 WY PARK 29
3 WY PLATTE 129
4 WY SHERIDAN 252
5 WY SUBLETTE 3
6 WY SWEETWATER 196
7 WY UINTA 49
8 WY WASHAKIE 10
9 WY WESTON 37
10 CANADA <NA> 662
# A tibble: 10 × 3
state county total_employees
<chr> <chr> <chr>
1 WY NATRONA 92
2 WY NIOBRARA 51
3 WY PARK 29
4 WY PLATTE 129
5 WY SHERIDAN 252
6 WY SUBLETTE 3
7 WY SWEETWATER 196
8 WY UINTA 49
9 WY WASHAKIE 10
10 WY WESTON 37
The data consists of 3 columns- state, county and employees. Since the employees column contained character datatype, total_employees was created with the double type of variable. There are 53 states included in the data and 1709 counties. The number of employees by state can also be seen in the third table.
state county total_employees employees
1 AE APO 2 2
2 AK ANCHORAGE 7 7
3 AK FAIRBANKS NORTH STAR 2 2
4 AK JUNEAU 3 3
5 AK MATANUSKA-SUSITNA 2 2
6 AK SITKA 1 1
[1] 53
[1] 1709
# A tibble: 53 × 2
state employees2
<chr> <dbl>
1 AE 2
2 AK 103
3 AL 4257
4 AP 1
5 AR 3871
6 AZ 3153
7 CA 13137
8 CO 3650
9 CT 2592
10 DC 279
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
First, the summary statistics were checked for a country wide comparison. Then a few states were selected in such a way that a few states had many employees (around 13000) and a few states selected had a small number of employees (100-200). This was done in order to check how the central tendency and dispersion varied across counties in a particular state.
Both California and Nebraska, which had a larger number of employees, had a higher mean for the total employees among the counties compared to the mean number of employees across states. This is possibly due to the fact that they had more employees. Similarly, it was observed that in the case where there were a lower number of employees, the mean for the total employees across counties were lower and the standard deviation across counties was lower (in Vermont and Alaska). As can be seen below, the District of Columbia cannot be analysed in this way as it consists of a single county.
Finally, the total number of employees in each county of a state were arranged in descending order because a county with a higher number of employees indicates that there are more job opportunities or perhaps a higher population in the area.
library(dplyr)
library(summarytools)
railroad_df%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
mean.employees median.employees min.employees max.employees sd.employees
1 87.17816 21 1 8207 283.6359
var.employees IQR.employees
1 80449.32 58
mean.employees median.employees min.employees max.employees sd.employees
1 238.8545 61 1 2888 549.4692
var.employees IQR.employees
1 301916.3 188
# A tibble: 55 × 4
# Groups: county [55]
state county total_employees employees
<chr> <chr> <chr> <dbl>
1 CA ALAMEDA 346 346
2 CA AMADOR 9 9
3 CA BUTTE 69 69
4 CA CALAVERAS 30 30
5 CA COLUSA 2 2
6 CA CONTRA COSTA 348 348
7 CA EL DORADO 103 103
8 CA FRESNO 341 341
9 CA GLENN 4 4
10 CA HUMBOLDT 2 2
# … with 45 more rows
# ℹ Use `print(n = ...)` to see more rows
mean.employees median.employees min.employees max.employees sd.employees
1 148.0449 15 1 3797 511.5816
var.employees IQR.employees
1 261715.7 66
# A tibble: 89 × 4
# Groups: county [89]
state county total_employees employees
<chr> <chr> <chr> <dbl>
1 NE ADAMS 77 77
2 NE ANTELOPE 2 2
3 NE ARTHUR 2 2
4 NE BANNER 8 8
5 NE BLAINE 2 2
6 NE BOONE 2 2
7 NE BOX BUTTE 1168 1168
8 NE BROWN 1 1
9 NE BUFFALO 107 107
10 NE BURT 9 9
# … with 79 more rows
# ℹ Use `print(n = ...)` to see more rows
mean.employees median.employees min.employees max.employees sd.employees
1 279 279 279 279 NA
var.employees IQR.employees
1 NA 0
mean.employees median.employees min.employees max.employees sd.employees
1 18.5 8.5 3 83 24.54431
var.employees IQR.employees
1 602.4231 8
# A tibble: 14 × 4
# Groups: county [14]
state county total_employees employees
<chr> <chr> <chr> <dbl>
1 VT ADDISON 8 8
2 VT BENNINGTON 8 8
3 VT CALEDONIA 3 3
4 VT CHITTENDEN 40 40
5 VT ESSEX 4 4
6 VT FRANKLIN 83 83
7 VT GRAND ISLE 5 5
8 VT LAMOILLE 4 4
9 VT ORANGE 9 9
10 VT ORLEANS 13 13
11 VT RUTLAND 59 59
12 VT WASHINGTON 3 3
13 VT WINDHAM 10 10
14 VT WINDSOR 10 10
mean.employees median.employees min.employees max.employees sd.employees
1 17.16667 2.5 1 88 34.76445
var.employees IQR.employees
1 1208.567 4
# A tibble: 6 × 4
# Groups: county [6]
state county total_employees employees
<chr> <chr> <chr> <dbl>
1 AK ANCHORAGE 7 7
2 AK FAIRBANKS NORTH STAR 2 2
3 AK JUNEAU 3 3
4 AK MATANUSKA-SUSITNA 2 2
5 AK SITKA 1 1
6 AK SKAGWAY MUNICIPALITY 88 88
---
title: "Challenge 2 "
author: "Mekhala Kumar"
description: "Data wrangling: using group() and summarise()"
date: "08/16/2022"
format:
html:
print: paged
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- State County dataset
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Reading in Data
The dataset used is State County (the excel file). While reading in the data, it was also cleaned by removing the first few rows and last few rows which had unnecessary text. Since the country Canada was also present in the last row of the dataset but is a different country, it has been removed as well.
```{r}
library(readr)
library(readxl)
library(stringr)
railroad_df <- read_excel("_data/StateCounty2012.xls",skip=3,col_names= c("state", "delete", "county", "delete", "total_employees"))%>%
select(!contains("delete"))%>%
filter(!str_detect(state, "Total"))
tail(railroad_df, 10)
railroad_df<-head(railroad_df, -2)
tail(railroad_df, 10)
railroad_df<-head(railroad_df, -1)
tail(railroad_df, 10)
railroad_df = railroad_df[-1,]
railroad_df <- transform(railroad_df,employees = as.numeric(total_employees))
View(railroad_df)
```
## Data description
The data consists of 3 columns- state, county and employees. Since the employees column contained character datatype, total_employees was created with the double type of variable. There are 53 states included in the data and 1709 counties. The number of employees by state can also be seen in the third table.
```{r}
#| label: summary
head(railroad_df)
railroad_df%>%
select(state)%>%
n_distinct(.)
railroad_df%>%
select(county)%>%
n_distinct(.)
railroad_df %>%
group_by(state) %>%
summarise(employees2 = sum(employees))
```
## Grouped Summary Statistics and Interpretation
First, the summary statistics were checked for a country wide comparison. Then a few states were selected in such a way that a few states had many employees (around 13000) and a few states selected had a small number of employees (100-200). This was done in order to check how the central tendency and dispersion varied across counties in a particular state.\
Both California and Nebraska, which had a larger number of employees, had a higher mean for the total employees among the counties compared to the mean number of employees across states. This is possibly due to the fact that they had more employees. Similarly, it was observed that in the case where there were a lower number of employees, the mean for the total employees across counties were lower and the standard deviation across counties was lower (in Vermont and Alaska). As can be seen below, the District of Columbia cannot be analysed in this way as it consists of a single county.\
Finally, the total number of employees in each county of a state were arranged in descending order because a county with a higher number of employees indicates that there are more job opportunities or perhaps a higher population in the area.
```{r}
library(dplyr)
library(summarytools)
railroad_df%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
railroad_df%>% filter(state == "CA")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
railroad_df%>% filter(state == "CA")%>%group_by(county)%>%arrange(county, desc(employees))
railroad_df%>% filter(state == "NE")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
railroad_df%>% filter(state == "NE")%>%group_by(county)%>%arrange(county, desc(employees))
railroad_df%>% filter(state == "DC")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
railroad_df%>% filter(state == "VT")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
railroad_df%>% filter(state == "VT")%>%group_by(county)%>%arrange(county, desc(employees))
railroad_df%>% filter(state == "AK")%>% summarise (mean.employees = mean(`employees`), median.employees = median(`employees`), min.employees = min(`employees`), max.employees = max(`employees`), sd.employees = sd(`employees`), var.employees = var(`employees`), IQR.employees = IQR(`employees`))
railroad_df%>% filter(state == "AK")%>%group_by(county)%>%arrange(county, desc(employees))
```