library(tidyverse)
library(readr)
library(dplyr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 2 - Railroads
Read in the Data
<- read_csv("_data/railroad_2012_clean_county.csv")
data data
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,920 more rows
Describe the data
The column names of the different columns are as follows:
colnames(data)
[1] "state" "county" "total_employees"
The classes of values in the different columns are as follows:
as.data.frame(sapply(data, class))
sapply(data, class)
state character
county character
total_employees numeric
Since we observe two columns with character type we can assume them to be categorical values and the number of distinct values in these columns are as follows:
%>%
data select(state, county) %>%
summarize_all(n_distinct)
# A tibble: 1 × 2
state county
<int> <int>
1 53 1709
So we can see that there are 53 states and 1709 counties. ## Grouped Summary Statistics
We observed total_employees to be a numerical value, so we can get the measures of central tendency (mean) and dispersion(std) grouped on state as follows:
%>%
data group_by(state) %>%
select(state, county, total_employees) %>%
summarize(county_count = n(), mean_employees = mean(total_employees), sum_employees = sum(total_employees), dispersion = sd(total_employees, na.rm = TRUE))
# A tibble: 53 × 5
state county_count mean_employees sum_employees dispersion
<chr> <int> <dbl> <dbl> <dbl>
1 AE 1 2 2 NA
2 AK 6 17.2 103 34.8
3 AL 67 63.5 4257 130.
4 AP 1 1 1 NA
5 AR 72 53.8 3871 131.
6 AZ 15 210. 3153 228.
7 CA 55 239. 13137 549.
8 CO 57 64.0 3650 128.
9 CT 8 324 2592 520.
10 DC 1 279 279 NA
# … with 43 more rows
Now we’ll try to observe the same when grouped by counties:
%>%
data group_by(county) %>%
select(county, state, total_employees) %>%
summarize(state_count = n(), mean_employees = mean(total_employees), sum_employees = sum(total_employees), dispersion = sd(total_employees, na.rm = TRUE))
# A tibble: 1,709 × 5
county state_count mean_employees sum_employees dispersion
<chr> <int> <dbl> <dbl> <dbl>
1 ABBEVILLE 1 124 124 NA
2 ACADIA 1 13 13 NA
3 ACCOMACK 1 4 4 NA
4 ADA 1 81 81 NA
5 ADAIR 4 7.25 29 9.32
6 ADAMS 12 73.2 878 155.
7 ADDISON 1 8 8 NA
8 AIKEN 1 193 193 NA
9 AITKIN 1 19 19 NA
10 ALACHUA 1 22 22 NA
# … with 1,699 more rows
Interpretation
When we split the data based on state we observed that some of the states have only one county, while most of them have more than one county thus providing us with a dispersion measure, whereas when it comes to the counties we observe some common county names such as ADAMS are present in more than one state. Thus when we group by state we tend to observe more relevant data as to how many employees are present in a region than that of the counties.