Code
library(tidyverse)
options(readr.show_col_types = FALSE)
::opts_chunk$set(echo = TRUE) knitr
Pradhakshya Dhanakumar
March 2, 2023
Read the data from a .csv file
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,920 more rows
Dimensions of the dataset
We can see that that dataset has 2930 rows and 3 columns.
Summary of data variables
We can see. that the dataset has 119390 rows and 32 columns in total. Using the ‘str’ fucntion, we can get the type of data and other information like length, its contents etc for each column. The data has 3 different columns - State- Character type, County - Character typr, and Total Employee - Number typer information. We can see that this data is about the Rail Road employee belonging to different state and counties.
spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ state : chr [1:2930] "AE" "AK" "AK" "AK" ...
$ county : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
$ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
- attr(*, "spec")=
.. cols(
.. state = col_character(),
.. county = col_character(),
.. total_employees = col_double()
.. )
- attr(*, "problems")=<externalptr>
Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
Select
Error in check_for_XQuartz(file.path(R.home("modules"), "R_de.so")): X11 library is missing: install XQuartz from www.xquartz.org
We can filter out the data for the AL country by using the filter() function
# A tibble: 1 × 3
state county total_employees
<chr> <chr> <dbl>
1 AL CHILTON 72
Group By, Mean, Standard Deviation
We can find the average and standard deviation of the employee count for each state by using the groupby(), mean() and sd() functions.
Group by State:
# A tibble: 53 × 3
state MeanEmployees StandardDeviation
<chr> <dbl> <dbl>
1 AE 2 NA
2 AK 17.2 34.8
3 AL 63.5 130.
4 AP 1 NA
5 AR 53.8 131.
6 AZ 210. 228.
7 CA 239. 549.
8 CO 64.0 128.
9 CT 324 520.
10 DC 279 NA
# … with 43 more rows
Group by County:
# A tibble: 1,709 × 3
county MeanEmployees StandardDeviation
<chr> <dbl> <dbl>
1 ABBEVILLE 124 NA
2 ACADIA 13 NA
3 ACCOMACK 4 NA
4 ADA 81 NA
5 ADAIR 7.25 9.32
6 ADAMS 73.2 155.
7 ADDISON 8 NA
8 AIKEN 193 NA
9 AITKIN 19 NA
10 ALACHUA 22 NA
# … with 1,699 more rows
We can count the number of counties for each county using the groupby() and count() function
On analyzing the data using the above , we can see that not all states have multiple counties. There are states like AE, AP, DC and many more with just 1 county. Hence we see the value ‘N/A’ when we calculate the standard deviation for certain states and counties.
---
title: "Challenge 2"
author: "Pradhakshya Dhanakumar"
desription: "Worked with Rail Roads Data"
date: "03/02/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- Challenge 2
- Pradhakshya Dhanakumar
- RailRoads
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
options(readr.show_col_types = FALSE)
knitr::opts_chunk$set(echo = TRUE)
```
## Reading Data
Read the data from a .csv file
```{r}
data <- read_csv("_data/railroad_2012_clean_county.csv")
print(data,show_col_types = FALSE)
```
## Dataset Information
Dimensions of the dataset
```{r}
dim(data)
```
We can see that that dataset has 2930 rows and 3 columns.
Summary of data variables
We can see. that the dataset has 119390 rows and 32 columns in total. Using the 'str' fucntion, we can get the type of data and other information like length, its contents etc for each column. The data has 3 different columns - State- Character type, County - Character typr, and Total Employee - Number typer information. We can see that this data is about the Rail Road employee belonging to different state and counties.
```{r}
str(data)
```
## Group Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
Select
```{r}
df<-select(data,state, total_employees)
View(df)
```
We can filter out the data for the AL country by using the filter() function
```{r}
df<- data %>%
filter(county == "CHILTON")
print(df)
```
Group By, Mean, Standard Deviation
We can find the average and standard deviation of the employee count for each state by using the groupby(), mean() and sd() functions.
Group by State:
```{r}
data %>%
group_by(state) %>%
summarise(MeanEmployees = mean(total_employees, na.rm=TRUE), StandardDeviation = sd(total_employees, na.rm = TRUE))
```
Group by County:
```{r}
data %>%
group_by(county) %>%
summarise(MeanEmployees = mean(total_employees, na.rm=TRUE), StandardDeviation = sd(total_employees, na.rm = TRUE))
```
We can count the number of counties for each county using the groupby() and count() function
```{r}
data %>%
group_by(state) %>%
summarise(CountOfCounty = n())
```
## Interpretation
On analyzing the data using the above , we can see that not all states have multiple counties. There are states like AE, AP, DC and many more with just 1 county. Hence we see the value 'N/A' when we calculate the standard deviation for certain states and counties.