Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Kris Smole
March 1, 2023
Railroads dataset contains distinct observations of total employee counts, by counties within US states/principalities.
Let’s review some of the basics of this dataset: it’s overall size, its content in its columns and rows (which are the variables and observations, respectively, and how this dataset might have been compiled. We can do this by looking at the dataset itself:
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,920 more rows
We can see in the results above, and confirmed in the results showing the entire dataframe contained in the file, that this dataset contains a count of rows (2930) and columns (3).
We also see in the left-most column, familiar state abbreviations, as well as some other abbreviations such as AE, AP, and DC, representing 53 state and military/governmental principalities of the US. Next column to the right, we see specific county names that relate to the identified states and principalities. The right-most column provides the total count of employees for the stated county and state.
The observations of this dataset are the rows of data, which record the number of employees within each county that the railroad operates, while the variable is the county and state combination for each railroad employment site.
How many counties have distinct names within this dataset?
The answer of 1709 distinct county names indicates that some county names repeat across various states, because the count of county names are not equal to 2930, the total number of distinct observations by counties within states/principalities. We are able to use a function named “distinct” on the county name column to find this answer.
So, we know this dataset represents information about a railroad or railroads that operate in the stated counties of the United States. In terms of the source of the data, It is likely the data was compiled by an individual railroad company, or some trade or regulatory or governmental body about one railroad, or about several railroads. The data does not include definitive data to know exactly which of these possibilities are true. Unless or until we know the exact source of the data, we can make suppositions through verifying the employee counts and locations of operations by railroad(s), if relevant and important to identify the specific data source.
Although the dataframe within the dataset file is a solid starting point, the dataframe in its current state contains minimalcomparative or analytical information about the employees of the railroad(s) working within various counties across the states/principalities. Thankfully, we have the tools to use the existing data in the dataset and find more information using sorting, arithmetic and statistical calculations.
Let’s start with some basic descriptive statistics of this dataset.
First, we see the total of all employees by county and state.
# A tibble: 1 × 1
`sum(total_employees)`
<dbl>
1 255432
The overall total count of employees for the railroad(s) in the counties and states of the US in this dataframe is 255,432 employees.
Below is the mean, or average count of railroad employees by county.
# A tibble: 1 × 1
`mean(total_employees)`
<dbl>
1 87.2
The counties of the US states in which these railroad operations exist have an average employee count of 87.2 employees. Because we speak in terms of individuals, we will use a whole number of 87 as the approximate average count of employees by county.
Next is the median of railroad employees by county.
# A tibble: 1 × 1
`median(total_employees)`
<dbl>
1 21
The counties of the US states in which these railroad operations exist have a median employee count of 21 employees. This statistical calculation results in a whole number because the median is the most frequently occurring observation within the dataframe, not a calculation per se, as the mean/average is.
And, finally, we have the mode of railroad employees by county
The counties of the US states in which these railroad operations exist have a mode of employee count of 1 employee. This statistical result in a whole number, like the result for the median, because the mode and median are the statistical occurrences of actual dataframe observations, not a calculation per se. The mode is the most frequently occurring observation within the dataframe. Remember this as we move forward to view the standard deviation and quantiles of this dataset.
Now we’ll look at the lowest employee count among the counties. The mode tells us that the count of 1 employee is the most frequent count of employees by county, so we have a good guess as to the lowest count of employees among the counties (ie: 1).
We see the lowest employee count among the counties is 1 employee.
We want to see the highest count of employees among the counties.
# A tibble: 1 × 1
`max(total_employees)`
<dbl>
1 8207
The highest count of employees among the counties is 8,207. We knew that of the 2930 counties, some of them had to have larger numbers of employee counts to get to an overall employee total count of 255,432, since employee counts of one would not bring us to 255,432 over 2930 counties if the employee count was 87, or 21, as we saw in the mean and median of the dataset.
In a bit, we will see in the quantiles and standard deviation of the dataframe that the employee count of 1 has many occurrences within the dataframe, as previously indicated by the mode. With over 255,432 employees across 53 states/principalities, a mode of 1, an average of 87, and a median of 21, and the largest county count of employees of 8,207, the dispersion of employee counts will likely be skewed closer to employee counts of 1 than to the employee count of 8,207. Does that mean more counties have employee counts of less than 87 than employee counts of 88 or more? Yes.
What are the quantiles of the dataframe?
# A tibble: 5 × 1
`quantile(total_employees)`
<dbl>
1 1
2 7
3 21
4 65
5 8207
The 1st quantile of the count of employees by county begins at 1, and ends with 6.
The 2nd quantile of the count of employees by county begins at 7 and ends with 20.
The 3rd quantile of the count of employees by county begins at 21 and ends with 64.
The 4th quantile of the count of employees by county begins at 65 and ends with 8207.
We are seeing that the dataset has some very large counts of employees, and some very low counts of employees. From what we see in the ranges of employee counts in the quantile output, the 4th quantile of counts of employees by county has a very large range of employee counts, from 65 employees to 8,207 employees. The range of 65 employees to 8,207 employees is a very large range of employee counts, given the other 3 quantiles of occurrences of employee counts by county span only the count of 1 to the count of 64. The 4th quantile requires more examination, that will likely be best done through creating a new dataframe of only the 4th quantile to allow statistical calculations to examine the larger range of the 4th quantile, to provide more meaningful analysis of all of the original dataframe, since the first 3 quantiles have reltaively little variation in employee counts in comparison to the 4th quantile.
This makes us curious about the standard deviation of the individual counts of employees by county. Like the name conveys, the standard deviation ‘standardizes’ the variance of each observation’s difference from the dataset mean of 87.17816. Or, said differently, the standard deviation helps make the variances of each employee count from the employee mean more comparable.
# A tibble: 2,930 × 1
total_employees
<dbl>
1 2
2 7
3 2
4 3
5 2
6 1
7 88
8 102
9 143
10 1
# … with 2,920 more rows
# A tibble: 1 × 1
`sd(total_employees)`
<dbl>
1 284.
We see the standard deviation of all employee counts by county in the dataset is 284 from the mean of 87.2. Standard deviation is different than the description of the variance between each employee count observation and the mean of the overall dataset employee count. Each employee count varies, or has a variance from the mean that is different from 283.6359. The standard deviation is a mathematical standarizing of the variances to help make the collection of variances more general, and more simple to understand. Examining the 4th quantile independently of the first 3 quantiles will provide a different standard deviation, mean, median, mode and set of quantiles. This may lead to yet another, new, separate dataframe so as to examine the resulting new quantiles, in pursuit of finding a more evenly distributed set of quantiles - quantiles that might approximate a normal distribution. That’s ahead in our work…
Often, viewers of data like this dataset, want to know information by state, which this dataset doesn’t provide in summary. We can now look at some descriptive statistical measures of the states. We look at the highest employee count by state, the lowest employee count by state, the mean (or average) of each state in alphbetical order, and median of each state in alphbetical order.
Below we have the number of employees summarizing the county employee counts of each state, in alphabetical order of state.
# A tibble: 53 × 2
state total_employees
<chr> <dbl>
1 AE 2
2 AK 103
3 AL 4257
4 AP 1
5 AR 3871
6 AZ 3153
7 CA 13137
8 CO 3650
9 CT 2592
10 DC 279
# … with 43 more rows
Knowing the states with the highest number of employees and the lowest number of employees might be of interest. What follows are those two summarized listings of the highest and lowest counts of employees among the counties by state, in alphabetic order of states. A rank order of the highest and lowest employee counts is in the works for the future.
# A tibble: 53 × 2
state total_employees
<chr> <dbl>
1 AE 2
2 AK 88
3 AL 990
4 AP 1
5 AR 972
6 AZ 749
7 CA 2888
8 CO 553
9 CT 1561
10 DC 279
# … with 43 more rows
# A tibble: 53 × 2
state total_employees
<chr> <dbl>
1 AE 2
2 AK 1
3 AL 1
4 AP 1
5 AR 1
6 AZ 3
7 CA 1
8 CO 1
9 CT 26
10 DC 279
# … with 43 more rows
The mean of each state’s county employee counts, or, the average number of employees of all the counties within each state. The list is in alphabetical order of the states.
# A tibble: 53 × 2
state total_employees
<chr> <dbl>
1 AE 2
2 AK 17.2
3 AL 63.5
4 AP 1
5 AR 53.8
6 AZ 210.
7 CA 239.
8 CO 64.0
9 CT 324
10 DC 279
# … with 43 more rows
The median of each state’s county employee counts, or, the median number of employees among the counties within each state, listed in alphabetical order of states.
We’ve uncovered more information of this dataset in the use of the tools of R Studio performed above. More statistical analysis of this dataset is ahead, using the information summarized by state, and by creating new dataframes from subsets of the original dataset.The questions could be a simple as: 1) How many counties have more than 87 employees? The count of 87 employees was selected because 87 is the overall mean of the dataset of employee count by county.
OR
OR
---
title: "Challenge 2 Using dataset 'railroads’"
author: "Kris Smole"
date: "03/01/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Description of the dataset 'Railroads'
Railroads dataset contains distinct observations of total employee counts, by counties within US states/principalities.
Let's review some of the basics of this dataset: it’s overall size, its content in its columns and rows (which are the variables and observations, respectively, and how this dataset might have been compiled. We can do this by looking at the dataset itself:
```{r}
railroad <-read_csv("_data/railroad_2012_clean_county.csv")
library(modeest)
```
```{r}
select(railroad, state, county, total_employees)
```
We can see in the results above, and confirmed in the results showing the entire dataframe contained in the file, that this dataset contains a count of rows (2930) and columns (3).
We also see in the left-most column, familiar state abbreviations, as well as some other abbreviations such as AE, AP, and DC, representing 53 state and military/governmental principalities of the US. Next column to the right, we see specific county names that relate to the identified states and principalities. The right-most column provides the total count of employees for the stated county and state.
The observations of this dataset are the rows of data, which record the number of employees within each county that the railroad operates, while the variable is the county and state combination for each railroad employment site.
How many counties have distinct names within this dataset?
```{r}
railroad%>%
select(county)%>%
n_distinct(.)
```
The answer of 1709 distinct county names indicates that some county names repeat across various states, because the count of county names are not equal to 2930, the total number of distinct observations by counties within states/principalities.
We are able to use a function named “distinct” on the county name column to find this answer.
So, we know this dataset represents information about a railroad or railroads that operate in the stated counties of the United States. In terms of the source of the data, It is likely the data was compiled by an individual railroad company, or some trade or regulatory or governmental body about one railroad, or about several railroads. The data does not include definitive data to know exactly which of these possibilities are true. Unless or until we know the exact source of the data, we can make suppositions through verifying the employee counts and locations of operations by railroad(s), if relevant and important to identify the specific data source.
Although the dataframe within the dataset file is a solid starting point, the dataframe in its current state contains minimalcomparative or analytical information about the employees of the railroad(s) working within various counties across the states/principalities. Thankfully, we have the tools to use the existing data in the dataset and find more information using sorting, arithmetic and statistical calculations.
Let's start with some basic descriptive statistics of this dataset.
First, we see the total of all employees by county and state.
```{r}
summarize(railroad,sum(`total_employees`))
```
The overall total count of employees for the railroad(s) in the counties and states of the US in this dataframe is 255,432 employees.
Below is the mean, or average count of railroad employees by county.
```{r}
summarize(railroad,mean(`total_employees`))
```
The counties of the US states in which these railroad operations exist have an average employee count of 87.2 employees. Because we speak in terms of individuals, we will use a whole number of 87 as the approximate average count of employees by county.
Next is the median of railroad employees by county.
```{r}
summarize(railroad, median(`total_employees`))
```
The counties of the US states in which these railroad operations exist have a median employee count of 21 employees. This statistical calculation results in a whole number because the median is the most frequently occurring observation within the dataframe, not a calculation per se, as the mean/average is.
And, finally, we have the mode of railroad employees by county
```{r}
mode=mfv(railroad$total_employees)
print (mode)
```
The counties of the US states in which these railroad operations exist have a mode of employee count of 1 employee. This statistical result in a whole number, like the result for the median, because the mode and median are the statistical occurrences of actual dataframe observations, not a calculation per se. The mode is the most frequently occurring observation within the dataframe. Remember this as we move forward to view the standard deviation and quantiles of this dataset.
Now we'll look at the lowest employee count among the counties. The mode tells us that the count of 1 employee is the most frequent count of employees by county, so we have a good guess as to the lowest count of employees among the counties (ie: 1).
```{r}
summarize(railroad, min(`total_employees`))
```
We see the lowest employee count among the counties is 1 employee.
We want to see the highest count of employees among the counties.
```{r}
summarize(railroad, max(`total_employees`))
```
The highest count of employees among the counties is 8,207. We knew that of the 2930 counties, some of them had to have larger numbers of employee counts to get to an overall employee total count of 255,432, since employee counts of one would not bring us to 255,432 over 2930 counties if the employee count was 87, or 21, as we saw in the mean and median of the dataset.
In a bit, we will see in the quantiles and standard deviation of the dataframe that the employee count of 1 has many occurrences within the dataframe, as previously indicated by the mode. With over 255,432 employees across 53 states/principalities, a mode of 1, an average of 87, and a median of 21, and the largest county count of employees of 8,207, the dispersion of employee counts will likely be skewed closer to employee counts of 1 than to the employee count of 8,207. Does that mean more counties have employee counts of less than 87 than employee counts of 88 or more? Yes.
What are the quantiles of the dataframe?
```{r}
summarize(railroad,quantile(`total_employees`))
```
The 1st quantile of the count of employees by county begins at 1, and ends with 6.
The 2nd quantile of the count of employees by county begins at 7 and ends with 20.
The 3rd quantile of the count of employees by county begins at 21 and ends with 64.
The 4th quantile of the count of employees by county begins at 65 and ends with 8207.
We are seeing that the dataset has some very large counts of employees, and some very low counts of employees. From what we see in the ranges of employee counts in the quantile output, the 4th quantile of counts of employees by county has a very large range of employee counts, from 65 employees to 8,207 employees. The range of 65 employees to 8,207 employees is a very large range of employee counts, given the other 3 quantiles of occurrences of employee counts by county span only the count of 1 to the count of 64. The 4th quantile requires more examination, that will likely be best done through creating a new dataframe of only the 4th quantile to allow statistical calculations to examine the larger range of the 4th quantile, to provide more meaningful analysis of all of the original dataframe, since the first 3 quantiles have reltaively little variation in employee counts in comparison to the 4th quantile.
This makes us curious about the standard deviation of the individual counts of employees by county. Like the name conveys, the standard deviation 'standardizes' the variance of each observation's difference from the dataset mean of 87.17816. Or, said differently, the standard deviation helps make the variances of each employee count from the employee mean more comparable.
```{r}
select(railroad,`total_employees`)
summarize(railroad,sd(`total_employees`))
```
We see the standard deviation of all employee counts by county in the dataset is 284 from the mean of 87.2. Standard deviation is different than the description of the variance between each employee count observation and the mean of the overall dataset employee count. Each employee count varies, or has a variance from the mean that is different from 283.6359. The standard deviation is a mathematical standarizing of the variances to help make the collection of variances more general, and more simple to understand. Examining the 4th quantile independently of the first 3 quantiles will provide a different standard deviation, mean, median, mode and set of quantiles. This may lead to yet another, new, separate dataframe so as to examine the resulting new quantiles, in pursuit of finding a more evenly distributed set of quantiles - quantiles that might approximate a normal distribution. That's ahead in our work...
## Grouped Summary Statistics - Looking at our Dataset by State totals
Often, viewers of data like this dataset, want to know information by state, which this dataset doesn’t provide in summary. We can now look at some descriptive statistical measures of the states. We look at the highest employee count by state, the lowest employee count by state, the mean (or average) of each state in alphbetical order, and median of each state in alphbetical order.
Below we have the number of employees summarizing the county employee counts of each state, in alphabetical order of state.
```{r}
railroad%>%
group_by(state)%>%
select (`total_employees`)%>%
summarize_all(sum, na.rm=TRUE)
```
Knowing the states with the highest number of employees and the lowest number of employees might be of interest. What follows are those two summarized listings of the highest and lowest counts of employees among the counties by state, in alphabetic order of states. A rank order of the highest and lowest employee counts is in the works for the future.
```{r}
railroad%>%
group_by(state)%>%
select (`total_employees`)%>%
summarize_all(max, na.rm=TRUE)
railroad%>%
group_by(state)%>%
select (`total_employees`)%>%
summarize_all(min, na.rm=TRUE)
```
The mean of each state’s county employee counts, or, the average number of employees of all the counties within each state. The list is in alphabetical order of the states.
```{r}
railroad%>%
group_by(state)%>%
select (`total_employees`)%>%
summarize_all(mean, na.rm=TRUE)
```
The median of each state’s county employee counts, or, the median number of employees among the counties within each state, listed in alphabetical order of states.
```{r}
railroad%>%
group_by(state)%>%
select (`total_employees`)%>%
summarize_all(median, na.rm=TRUE)
```
## CONCLUSION
We've uncovered more information of this dataset in the use of the tools of R Studio performed above. More statistical analysis of this dataset is ahead, using the information summarized by state, and by creating new dataframes from subsets of the original dataset.The questions could be a simple as: 1) How many counties have more than 87 employees? The count of 87 employees was selected because 87 is the overall mean of the dataset of employee count by county.
OR
2) How many counties have more than 65 employees? The count of 65 employees was selected because 65 is the count of employees of the beginning of the 4th quantile of the dataset.
OR
3) What is the standard deviation of the new dataframe of the 4th quantile? Does this standard deviation indicate we might want to further explore subsets of the 4th quantile and perform statistical analysis on that new dataframe?