Challenge 2
Pradhakshya Dhanakumar
RailRoads
Author

Pradhakshya Dhanakumar

Published

March 2, 2023

Code
library(tidyverse)
options(readr.show_col_types = FALSE)
knitr::opts_chunk$set(echo = TRUE)

Reading Data

Read the data from a .csv file

Code
data <- read_csv("_data/railroad_2012_clean_county.csv")
print(data,show_col_types = FALSE)
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows

Dataset Information

Dimensions of the dataset

Code
dim(data)
[1] 2930    3

We can see that that dataset has 2930 rows and 3 columns.

Summary of data variables

We can see. that the dataset has 119390 rows and 32 columns in total. Using the ‘str’ fucntion, we can get the type of data and other information like length, its contents etc for each column. The data has 3 different columns - State- Character type, County - Character typr, and Total Employee - Number typer information. We can see that this data is about the Rail Road employee belonging to different state and counties.

Code
str(data)
spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state          : chr [1:2930] "AE" "AK" "AK" "AK" ...
 $ county         : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
 $ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   county = col_character(),
  ..   total_employees = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Group Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Select

Code
df<-select(data,state, total_employees)
View(df)
Error in check_for_XQuartz(file.path(R.home("modules"), "R_de.so")): X11 library is missing: install XQuartz from www.xquartz.org

We can filter out the data for the AL country by using the filter() function

Code
df<- data %>% 
  filter(county == "CHILTON")
print(df)
# A tibble: 1 × 3
  state county  total_employees
  <chr> <chr>             <dbl>
1 AL    CHILTON              72

Group By, Mean, Standard Deviation

We can find the average and standard deviation of the employee count for each state by using the groupby(), mean() and sd() functions.

Group by State:

Code
data %>% 
  group_by(state) %>% 
  summarise(MeanEmployees = mean(total_employees, na.rm=TRUE), StandardDeviation = sd(total_employees, na.rm = TRUE))
# A tibble: 53 × 3
   state MeanEmployees StandardDeviation
   <chr>         <dbl>             <dbl>
 1 AE              2                NA  
 2 AK             17.2              34.8
 3 AL             63.5             130. 
 4 AP              1                NA  
 5 AR             53.8             131. 
 6 AZ            210.              228. 
 7 CA            239.              549. 
 8 CO             64.0             128. 
 9 CT            324               520. 
10 DC            279                NA  
# … with 43 more rows

Group by County:

Code
data %>% 
  group_by(county) %>% 
  summarise(MeanEmployees = mean(total_employees, na.rm=TRUE), StandardDeviation = sd(total_employees, na.rm = TRUE))
# A tibble: 1,709 × 3
   county    MeanEmployees StandardDeviation
   <chr>             <dbl>             <dbl>
 1 ABBEVILLE        124                NA   
 2 ACADIA            13                NA   
 3 ACCOMACK           4                NA   
 4 ADA               81                NA   
 5 ADAIR              7.25              9.32
 6 ADAMS             73.2             155.  
 7 ADDISON            8                NA   
 8 AIKEN            193                NA   
 9 AITKIN            19                NA   
10 ALACHUA           22                NA   
# … with 1,699 more rows

We can count the number of counties for each county using the groupby() and count() function

Code
data %>% 
  group_by(state) %>% 
  summarise(CountOfCounty = n())
# A tibble: 53 × 2
   state CountOfCounty
   <chr>         <int>
 1 AE                1
 2 AK                6
 3 AL               67
 4 AP                1
 5 AR               72
 6 AZ               15
 7 CA               55
 8 CO               57
 9 CT                8
10 DC                1
# … with 43 more rows

Interpretation

On analyzing the data using the above , we can see that not all states have multiple counties. There are states like AE, AP, DC and many more with just 1 county. Hence we see the value ‘N/A’ when we calculate the standard deviation for certain states and counties.