Code
library(tidyverse)
library(dplyr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Poobigan Murugesan
May 9, 2023
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# ℹ 2,920 more rows
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
The dataset consists of 2930 rows and 3 columns.
By using the ‘str’ function, we can obtain information about the data types, lengths, contents, and other relevant information pertaining to each column. The dataset is comprised of three distinct columns: ‘State’ (which is of character type), ‘County’ (also of character type), and ‘Total Employees’ (which is of numeric type). From the dataset, we can infer that it pertains to railroad employees from various states and counties.
spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ state : chr [1:2930] "AE" "AK" "AK" "AK" ...
$ county : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
$ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
- attr(*, "spec")=
.. cols(
.. state = col_character(),
.. county = col_character(),
.. total_employees = col_double()
.. )
- attr(*, "problems")=<externalptr>
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
Number of unique states in the data
The dataset contains information on the total number of railroad employees across 53 distinct states.
Which county has the maximum number of employees?
# A tibble: 1 × 2
employee_count maxCounty
<dbl> <chr>
1 8207 COOK
The county that has the highest count of railroad workers in this dataset is COOK, with a total of 8207 employees.
Which county has the minimum number of employees?
# A tibble: 145 × 2
employee_count minCounty
<dbl> <chr>
1 1 SITKA
2 1 BARBOUR
3 1 HENRY
4 1 APO
5 1 NEWTON
6 1 MONO
7 1 BENT
8 1 CHEYENNE
9 1 COSTILLA
10 1 DOLORES
# ℹ 135 more rows
Based on the above output, we can determine that the minimum number of railroad employees in any county is 1, and there are a total of 145 counties with only one employee.
Grouping railroad employees by state to figure out which states have the most railroad employees.
# A tibble: 53 × 2
state employees
<chr> <dbl>
1 TX 19839
2 IL 19131
3 NY 17050
4 NE 13176
5 CA 13137
6 PA 12769
7 OH 9056
8 GA 8605
9 IN 8537
10 MO 8419
# ℹ 43 more rows
Based on the results shown above, we can see that the state of Texas has the largest number of railroad employees, with a count of 19839 followed by Illinois and New York with 19131 and 17050 employees respectively. Also, from the dimensions of the group dataset we can conclude that there are 53 distinct states where people are employed to work railroads.
Grouping the data by state and calculating the mean of total employees in each state based on the number of counties.
# A tibble: 53 × 2
state mean
<chr> <dbl>
1 DE 498.
2 NJ 397.
3 CT 324
4 MA 282.
5 NY 280.
6 DC 279
7 CA 239.
8 AZ 210.
9 PA 196.
10 MD 196.
# ℹ 43 more rows
Based on the results above, we can conclude that the state of DE has the highest average number of railroad employees per county, which is 498
Grouping the data by state and calculating the median of total employees in each state based on the number of counties.
# A tibble: 53 × 2
state median
<chr> <dbl>
1 NJ 296
2 DC 279
3 MA 271
4 DE 158
5 CT 125
6 MD 108.
7 AZ 94
8 PA 85
9 NY 71
10 CA 61
# ℹ 43 more rows
Mode - Extracting the state with highest number of counties with railroad employees.
# A tibble: 1 × 2
state counties
<chr> <int>
1 TX 221
The result of this specific code shows that Texas (TX) has the highest number of counties with railroad employees, with a count of 221.
Extracting information about railroad employees specifically from the counties in TX using a filter operation.
# A tibble: 221 × 3
state county total_employees
<chr> <chr> <dbl>
1 TX ANDERSON 241
2 TX ANDREWS 3
3 TX ANGELINA 53
4 TX ARANSAS 6
5 TX ARCHER 8
6 TX ARMSTRONG 12
7 TX ATASCOSA 64
8 TX AUSTIN 35
9 TX BAILEY 5
10 TX BANDERA 15
# ℹ 211 more rows
Double-checking the number of counties in TX, the filter and count commands confirm that there are indeed 221 counties with railroad employees in TX as previously determined.
Grouping the data by state and calculating the median of total employees in each state based on the number of counties.
# A tibble: 53 × 2
state std
<chr> <dbl>
1 IL 829.
2 DE 674.
3 NY 591.
4 CA 549.
5 CT 520.
6 NE 512.
7 FL 386.
8 TX 350.
9 VA 341.
10 NJ 338.
# ℹ 43 more rows
Grouping data by state and analyzing its summary
state county total_employees
Length:2930 Length:2930 Min. : 1.00
Class :character Class :character 1st Qu.: 7.00
Mode :character Mode :character Median : 21.00
Mean : 87.18
3rd Qu.: 65.00
Max. :8207.00
Based on the quartile distributions, we can conclude that many states have a low number of employees, indicating that the distribution of employees among states is not evenly distributed and is instead skewed.
I selected this specific group to analyze the distribution of railroad employees by state. The data shows that there are employees in 53 states, and the county with the highest number of employees is COOK in IL with 8207 employees, while 145 counties have only one railroad employee.
When we grouped the data by state, we found that TX has the highest number of railroad employees with 19839, while AP has the lowest with only one employee. DE has the highest mean number of employees with 498 per county, and AP has the lowest mean number with only one employee. NJ has the highest median number of employees per county with 296. Additionally, we found that TX has the maximum number of entries, which is the mode of the number of counties with 221 counties.
---
title: "Challenge 2"
author: "Poobigan Murugesan"
description: "Data wrangling: using group() and summarise()"
date: "05/09/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- poobigan murugesan
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
```{r}
df<- read_csv("_data/railroad_2012_clean_county.csv")
print(df,show_col_types=FALSE)
```
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
```{r}
#| label: summary
dim(df)
```
The dataset consists of 2930 rows and 3 columns.
By using the 'str' function, we can obtain information about the data types, lengths, contents, and other relevant information pertaining to each column. The dataset is comprised of three distinct columns: 'State' (which is of character type), 'County' (also of character type), and 'Total Employees' (which is of numeric type). From the dataset, we can infer that it pertains to railroad employees from various states and counties.
```{r}
str(df)
```
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
Number of unique states in the data
```{r}
count(distinct(select(df, state)))
```
The dataset contains information on the total number of railroad employees across 53 distinct states.
Which county has the maximum number of employees?
```{r}
df %>% summarise(employee_count = max(total_employees), maxCounty = county[which.max(total_employees)])
```
The county that has the highest count of railroad workers in this dataset is COOK, with a total of 8207 employees.
Which county has the minimum number of employees?
```{r}
df %>% summarize(employee_count = min(total_employees), minCounty = county[which(total_employees == min(total_employees))])
```
Based on the above output, we can determine that the minimum number of railroad employees in any county is 1, and there are a total of 145 counties with only one employee.
Grouping railroad employees by state to figure out which states have the most railroad employees.
```{r}
group <- df %>%
group_by(state) %>%
summarize(employees=sum(total_employees)) %>%
arrange(desc(employees))
group
```
Based on the results shown above, we can see that the state of Texas has the largest number of railroad employees, with a count of 19839 followed by Illinois and New York with 19131 and 17050 employees respectively. Also, from the dimensions of the group dataset we can conclude that there are 53 distinct states where people are employed to work railroads.
Grouping the data by state and calculating the mean of total employees in each state based on the number of counties.
```{r}
group <- df %>%
group_by(state) %>%
summarize(mean=mean(total_employees)) %>%
arrange(desc(mean))
group
```
Based on the results above, we can conclude that the state of DE has the highest average number of railroad employees per county, which is 498
Grouping the data by state and calculating the median of total employees in each state based on the number of counties.
```{r}
group <- df %>%
group_by(state) %>%
summarize(median=median(total_employees)) %>%
arrange(desc(median))
group
```
Mode - Extracting the state with highest number of counties with railroad employees.
```{r}
df %>%
group_by(state) %>%
summarize(counties = n()) %>%
arrange(desc(counties)) %>%
slice(1)
```
The result of this specific code shows that Texas (TX) has the highest number of counties with railroad employees, with a count of 221.
Extracting information about railroad employees specifically from the counties in TX using a filter operation.
```{r}
counties_tx <-filter(df, state=='TX')
counties_tx
```
Double-checking the number of counties in TX, the filter and count commands confirm that there are indeed 221 counties with railroad employees in TX as previously determined.
Grouping the data by state and calculating the median of total employees in each state based on the number of counties.
```{r}
group <- df %>%
group_by(state) %>%
summarize(std=sd(total_employees)) %>%
arrange(desc(std))
group
```
Grouping data by state and analyzing its summary
```{r}
group <- df %>%
group_by(state)
summary(group)
```
Based on the quartile distributions, we can conclude that many states have a low number of employees, indicating that the distribution of employees among states is not evenly distributed and is instead skewed.
### Explain and Interpret
I selected this specific group to analyze the distribution of railroad employees by state. The data shows that there are employees in 53 states, and the county with the highest number of employees is COOK in IL with 8207 employees, while 145 counties have only one railroad employee.
When we grouped the data by state, we found that TX has the highest number of railroad employees with 19839, while AP has the lowest with only one employee. DE has the highest mean number of employees with 498 per county, and AP has the lowest mean number with only one employee. NJ has the highest median number of employees per county with 296. Additionally, we found that TX has the maximum number of entries, which is the mode of the number of counties with 221 counties.