I have used the dataset State County (the excel file). While reading in the data, it was also cleaned by removing the first few rows and last few rows which had unnecessary text.
state county total_employees employees
1 AE APO 2 2
2 AK ANCHORAGE 7 7
3 AK FAIRBANKS NORTH STAR 2 2
4 AK JUNEAU 3 3
5 AK MATANUSKA-SUSITNA 2 2
6 AK SITKA 1 1
Data description
The data consists of 3 columns- state, county and employees. There are 53 states included in the data and 1709 counties. The number of employees by state can also be seen in the third table.
Code
head(railroad_df)
state county total_employees employees
1 AE APO 2 2
2 AK ANCHORAGE 7 7
3 AK FAIRBANKS NORTH STAR 2 2
4 AK JUNEAU 3 3
5 AK MATANUSKA-SUSITNA 2 2
6 AK SITKA 1 1
# A tibble: 53 × 2
state employees2
<chr> <dbl>
1 AE 2
2 AK 103
3 AL 4257
4 AP 1
5 AR 3871
6 AZ 3153
7 CA 13137
8 CO 3650
9 CT 2592
10 DC 279
# … with 43 more rows
Grouped Summary Statistics and Interpretation
We use summary statistics were checked for a country wide comparison. Then a few states were selected in such a way that a few states had many employees (around 13000) and a few states selected had a small number of employees (100-200). This was done in order to check how the central tendency and dispersion varied across counties in a particular state. Both California and Nebraska, which had a larger number of employees, had a higher mean for the total employees among the counties compared to the mean number of employees across states. This is possibly due to the fact that they had more employees. Similarly, it was observed that in the case where there were a lower number of employees, the mean for the total employees across counties were lower and the standard deviation across counties was lower (in Vermont and Alaska). Finally, the total number of employees in each county of a state were arranged in descending order because a county with a higher number of employees indicates that there are more job opportunities or perhaps a higher population in the area.
Code
library(dplyr)library(summarytools)
Warning: package 'summarytools' was built under R version 4.2.2
Attaching package: 'summarytools'
The following object is masked from 'package:tibble':
view
# A tibble: 55 × 4
# Groups: county [55]
state county total_employees employees
<chr> <chr> <chr> <dbl>
1 CA ALAMEDA 346 346
2 CA AMADOR 9 9
3 CA BUTTE 69 69
4 CA CALAVERAS 30 30
5 CA COLUSA 2 2
6 CA CONTRA COSTA 348 348
7 CA EL DORADO 103 103
8 CA FRESNO 341 341
9 CA GLENN 4 4
10 CA HUMBOLDT 2 2
# … with 45 more rows
# A tibble: 89 × 4
# Groups: county [89]
state county total_employees employees
<chr> <chr> <chr> <dbl>
1 NE ADAMS 77 77
2 NE ANTELOPE 2 2
3 NE ARTHUR 2 2
4 NE BANNER 8 8
5 NE BLAINE 2 2
6 NE BOONE 2 2
7 NE BOX BUTTE 1168 1168
8 NE BROWN 1 1
9 NE BUFFALO 107 107
10 NE BURT 9 9
# … with 79 more rows
# A tibble: 22 × 4
# Groups: county [22]
state county total_employees employees
<chr> <chr> <chr> <dbl>
1 WY ALBANY 90 90
2 WY BIG HORN 70 70
3 WY CAMPBELL 386 386
4 WY CARBON 168 168
5 WY CONVERSE 211 211
6 WY CROOK 33 33
7 WY FREMONT 42 42
8 WY GOSHEN 244 244
9 WY HOT SPRINGS 4 4
10 WY JOHNSON 18 18
# … with 12 more rows
Source Code
---title: "Challenge2_Jyoti"author: "Jyoti Rani"date: "2022-08-22"format: html: print: paged toc: true code-fold: true code-copy: true code-tools: truecategories: - challenge_2 - State County dataset---```{r setup, include=FALSE}library(tidyverse)library(readr)library(readxl)library(stringr)knitr::opts_chunk$set(echo =TRUE)```## Reading in DataI have used the dataset State County (the excel file). While reading in the data, it was also cleaned by removing the first few rows and last few rows which had unnecessary text.```{r}railroad_df <-read_excel("_data/StateCounty2012.xls",skip=3,col_names=c("state", "delete", "county", "delete", "total_employees"))%>%select(!contains("delete"))%>%filter(!str_detect(state, "Total"))tail(railroad_df, 10)railroad_df<-head(railroad_df, -2)tail(railroad_df, 10)railroad_df<-head(railroad_df, -1)tail(railroad_df, 10)railroad_df = railroad_df[-1,] railroad_df <-transform(railroad_df,employees =as.numeric(total_employees))head(railroad_df)```## Data descriptionThe data consists of 3 columns- state, county and employees. There are 53 states included in the data and 1709 counties. The number of employees by state can also be seen in the third table.```{r}head(railroad_df)railroad_df%>%select(state)%>%n_distinct(.)railroad_df%>%select(county)%>%n_distinct(.)railroad_df %>%group_by(state) %>%summarise(employees2 =sum(employees))```## Grouped Summary Statistics and InterpretationWe use summary statistics were checked for a country wide comparison. Then a few states were selected in such a way that a few states had many employees (around 13000) and a few states selected had a small number of employees (100-200). This was done in order to check how the central tendency and dispersion varied across counties in a particular state.Both California and Nebraska, which had a larger number of employees, had a higher mean for the total employees among the counties compared to the mean number of employees across states. This is possibly due to the fact that they had more employees. Similarly, it was observed that in the case where there were a lower number of employees, the mean for the total employees across counties were lower and the standard deviation across counties was lower (in Vermont and Alaska). Finally, the total number of employees in each county of a state were arranged in descending order because a county with a higher number of employees indicates that there are more job opportunities or perhaps a higher population in the area.```{r}library(dplyr)library(summarytools)railroad_df%>%summarise (mean.employees =mean(`employees`), median.employees =median(`employees`), min.employees =min(`employees`), max.employees =max(`employees`), sd.employees =sd(`employees`), var.employees =var(`employees`), IQR.employees =IQR(`employees`))railroad_df%>%filter(state =="CA")%>%summarise (mean.employees =mean(`employees`), median.employees =median(`employees`), min.employees =min(`employees`), max.employees =max(`employees`), sd.employees =sd(`employees`), var.employees =var(`employees`), IQR.employees =IQR(`employees`))railroad_df%>%filter(state =="CA")%>%group_by(county)%>%arrange(county, desc(employees))railroad_df%>%filter(state =="NE")%>%summarise (mean.employees =mean(`employees`), median.employees =median(`employees`), min.employees =min(`employees`), max.employees =max(`employees`), sd.employees =sd(`employees`), var.employees =var(`employees`), IQR.employees =IQR(`employees`))railroad_df%>%filter(state =="NE")%>%group_by(county)%>%arrange(county, desc(employees))railroad_df%>%filter(state =="DC")%>%summarise (mean.employees =mean(`employees`), median.employees =median(`employees`), min.employees =min(`employees`), max.employees =max(`employees`), sd.employees =sd(`employees`), var.employees =var(`employees`), IQR.employees =IQR(`employees`))railroad_df%>%filter(state =="VT")%>%summarise (mean.employees =mean(`employees`), median.employees =median(`employees`), min.employees =min(`employees`), max.employees =max(`employees`), sd.employees =sd(`employees`), var.employees =var(`employees`), IQR.employees =IQR(`employees`))railroad_df%>%filter(state =="VT")%>%group_by(county)%>%arrange(county, desc(employees))railroad_df%>%filter(state =="WY")%>%summarise (mean.employees =mean(`employees`), median.employees =median(`employees`), min.employees =min(`employees`), max.employees =max(`employees`), sd.employees =sd(`employees`), var.employees =var(`employees`), IQR.employees =IQR(`employees`))railroad_df%>%filter(state =="WY")%>%group_by(county)%>%arrange(county, desc(employees))```