Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Maanusri Balasubramanian
May 3, 2023
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
[1] 2930 3
spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ state : chr [1:2930] "AE" "AK" "AK" "AK" ...
$ county : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
$ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
- attr(*, "spec")=
.. cols(
.. state = col_character(),
.. county = col_character(),
.. total_employees = col_double()
.. )
- attr(*, "problems")=<externalptr>
From the above results, we know that the dataset has 2930 rows and 3 columns.
The 3 columns in the dataset are: state, county and total_employees.
# A tibble: 6 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
We can infer that the dataset gives us the total number of railroad employees in a county in a state in 2012 from this.
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
# A tibble: 1 × 1
n
<int>
1 53
Total railroad employee information about 53 unique states is present in this dataset.
# A tibble: 1 × 2
employee_count maxCounty
<dbl> <chr>
1 8207 COOK
COOK is the county with the maximum number of railroad employees (8207).
# A tibble: 145 × 2
employee_count minCounty
<dbl> <chr>
1 1 SITKA
2 1 BARBOUR
3 1 HENRY
4 1 APO
5 1 NEWTON
6 1 MONO
7 1 BENT
8 1 CHEYENNE
9 1 COSTILLA
10 1 DOLORES
# ℹ 135 more rows
From the above result, we know that there are 145 counties with only 1 railroad employee (which is the minimum employee count in any county).
# A tibble: 53 × 2
state state_employees
<chr> <dbl>
1 TX 19839
2 IL 19131
3 NY 17050
4 NE 13176
5 CA 13137
6 PA 12769
7 OH 9056
8 GA 8605
9 IN 8537
10 MO 8419
# ℹ 43 more rows
From the above results we know that TX has the highest number of rail road employees (19839) and AP has the least number of railroad employees (1).
# A tibble: 53 × 2
state mean_state_employees
<chr> <dbl>
1 DE 498.
2 NJ 397.
3 CT 324
4 MA 282.
5 NY 280.
6 DC 279
7 CA 239.
8 AZ 210.
9 PA 196.
10 MD 196.
# ℹ 43 more rows
From the above results we know that DE has the highest mean number of rail road employees (498.33) and AP has the least mean number of railroad employees (1). That is in DE, the mean number of employees in each county is 498.33.
# Grouping in terms of state to summarise median total_employees in each state wrt number of counties and arranging accordingly
grouped_rr_state_median <- rr%>%
group_by(state)%>%
summarize(median_state_employees = median(total_employees))%>%
arrange(desc(`median_state_employees`))
grouped_rr_state_median
# A tibble: 53 × 2
state median_state_employees
<chr> <dbl>
1 NJ 296
2 DC 279
3 MA 271
4 DE 158
5 CT 125
6 MD 108.
7 AZ 94
8 PA 85
9 NY 71
10 CA 61
# ℹ 43 more rows
From the above results we can infer the median number of railroad employees in each state wrt the counties in the state with employees.
# A tibble: 1 × 2
state n
<chr> <int>
1 TX 221
We are finding the mode of the number of counties in any state with the above command. We are counting the number of entries for a particular state, which gives us the number of counties in the state with railroad employees. We then find the state with the maximum number of entries (mode).
We know that TX has the highest number of counties with railroad employees. There are 221 counties in TX with railroad employees.
# A tibble: 1 × 1
n
<int>
1 221
Verifying the number of counties in TX with the filter and count commands again. There are 221 counties as returned earlier.
# Grouping in terms of state to summarise standard deviation of total_employees in each state wrt number of counties and arranging accordingly
grouped_rr_state_sd <- rr%>%
group_by(state)%>%
summarize(sd_state_employees = sd(total_employees))%>%
arrange(desc(`sd_state_employees`))
grouped_rr_state_sd
# A tibble: 53 × 2
state sd_state_employees
<chr> <dbl>
1 IL 829.
2 DE 674.
3 NY 591.
4 CA 549.
5 CT 520.
6 NE 512.
7 FL 386.
8 TX 350.
9 VA 341.
10 NJ 338.
# ℹ 43 more rows
From the above result, we rank the states according to the standard deviation of railroad employee count.
state county total_employees
Length:2930 Length:2930 Min. : 1.00
Class :character Class :character 1st Qu.: 7.00
Mode :character Mode :character Median : 21.00
Mean : 87.18
3rd Qu.: 65.00
Max. :8207.00
From the quartile distributions we can infer that lot of the states have less number of employees, i.e the distribution of employees state-wise is skewed.
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
I chose this specific group to understand the distribution of railroad employees statewise. We can infer that there are railroad employees in 53 unique states, ‘COOK’ in IL is the county with the highest number of railroad employees (8207) and there are 145 counties with only 1 railroad employee.
From this grouping analysis based on states, we got to know that TX has the highest number of rail road employees (19839) and AP has the least number of railroad employees (1). And DE has the highest mean number of rail road employees (498.33) and AP has the least mean number of railroad employees (1) in its counties. NJ is the state with the highest median of employees county-wise: 296. We also find that TX is the state with maximum number of entries (mode of number of entries), which is in-turn the number of counties. So TX is the state with the maximum number of counties, which is 221.
---
title: "Challenge 2"
author: "Maanusri Balasubramanian"
description: "Data wrangling: using group() and summarise()"
date: "05/03/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- maanusri balasubramanian
- railroads
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
```{r}
# reading the CSV file
rr <- read_csv("_data/railroad_2012_clean_county.csv")
```
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
```{r}
#| label: summary
# printing number of rows and columns in the dataset
dim(rr)
# description of the dataset
str(rr)
```
From the above results, we know that the dataset has 2930 rows and 3 columns.
```{r}
colnames(rr)
```
The 3 columns in the dataset are: state, county and total_employees.
```{r}
head(rr)
```
We can infer that the dataset gives us the total number of railroad employees in a county in a state in 2012 from this.
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
# Count of unique states in the dataset
states <- select(rr, `state`)
# library("dplyr")
count(distinct(states))
```
Total railroad employee information about 53 unique states is present in this dataset.
```{r}
# retrieving information of the county with the maximum number of railroad employees
summarize(rr, employee_count = max(total_employees), maxCounty = county[which.max(total_employees)])
```
COOK is the county with the maximum number of railroad employees (8207).
```{r}
# retrieving information of the county with the maximum number of railroad employees
reframe(rr, employee_count = min(total_employees), minCounty = county[which(total_employees == min(total_employees))])
```
From the above result, we know that there are 145 counties with only 1 railroad employee (which is the minimum employee count in any county).
```{r}
# Grouping in terms of state to summarise total_employees in each state and arranging accordingly
grouped_rr_state <- rr%>%
group_by(state)%>%
summarize(state_employees = sum(total_employees))%>%
arrange(desc(`state_employees`))
grouped_rr_state
```
From the above results we know that TX has the highest number of rail road employees (19839) and AP has the least number of railroad employees (1).
```{r}
# Grouping in terms of state to summarise mean total_employees in each state wrt number of counties and arranging accordingly
grouped_rr_state_mean <- rr%>%
group_by(state)%>%
summarize(mean_state_employees = mean(total_employees))%>%
arrange(desc(`mean_state_employees`))
grouped_rr_state_mean
```
From the above results we know that DE has the highest mean number of rail road employees (498.33) and AP has the least mean number of railroad employees (1). That is in DE, the mean number of employees in each county is 498.33.
```{r}
# Grouping in terms of state to summarise median total_employees in each state wrt number of counties and arranging accordingly
grouped_rr_state_median <- rr%>%
group_by(state)%>%
summarize(median_state_employees = median(total_employees))%>%
arrange(desc(`median_state_employees`))
grouped_rr_state_median
```
From the above results we can infer the median number of railroad employees in each state wrt the counties in the state with employees.
```{r}
# Mode of entries - statewise i.e the state with the maximum number of counties with railroad employees
rr %>%
count(state, sort=TRUE)%>%
slice(1)
```
We are finding the mode of the number of counties in any state with the above command. We are counting the number of entries for a particular state, which gives us the number of counties in the state with railroad employees. We then find the state with the maximum number of entries (mode).
We know that TX has the highest number of counties with railroad employees. There are 221 counties in TX with railroad employees.
```{r}
# Filtering the employee information of counties in TX
texas_counties <- filter(rr, state == 'TX')
count(texas_counties)
```
Verifying the number of counties in TX with the filter and count commands again. There are 221 counties as returned earlier.
```{r}
# Grouping in terms of state to summarise standard deviation of total_employees in each state wrt number of counties and arranging accordingly
grouped_rr_state_sd <- rr%>%
group_by(state)%>%
summarize(sd_state_employees = sd(total_employees))%>%
arrange(desc(`sd_state_employees`))
grouped_rr_state_sd
```
From the above result, we rank the states according to the standard deviation of railroad employee count.
```{r}
# grouping dataset based on the state
grouped_rr_state <- rr%>%
group_by(state)
# summary of this subset gives us info like min, max, median, mean and quantiles
summary(grouped_rr_state)
```
From the quartile distributions we can infer that lot of the states have less number of employees, i.e the distribution of employees state-wise is skewed.
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
I chose this specific group to understand the distribution of railroad employees statewise. We can infer that there are railroad employees in 53 unique states, 'COOK' in IL is the county with the highest number of railroad employees (8207) and there are 145 counties with only 1 railroad employee.
From this grouping analysis based on states, we got to know that TX has the highest number of rail road employees (19839) and AP has the least number of railroad employees (1). And DE has the highest mean number of rail road employees (498.33) and AP has the least mean number of railroad employees (1) in its counties. NJ is the state with the highest median of employees county-wise: 296. We also find that TX is the state with maximum number of entries (mode of number of entries), which is in-turn the number of counties. So TX is the state with the maximum number of counties, which is 221.