Challenge 2
Pradhakshya Dhanakumar

March 2, 2023

options(readr.show_col_types = FALSE)
knitr::opts_chunk$set(echo = TRUE)

Reading Data

Read the data from a .csv file

data <- read_csv("_data/railroad_2012_clean_county.csv")
print(data,show_col_types = FALSE)
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows

Dataset Information

Dimensions of the dataset

[1] 2930    3

We can see that that dataset has 2930 rows and 3 columns.

Summary of data variables

The data has 3 different columns - State- Character type, County - Character type, and Total Employee - Number type information. We can see that this data is about the Rail Road employee belonging to different state and counties.

spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state          : chr [1:2930] "AE" "AK" "AK" "AK" ...
 $ county         : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
 $ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   county = col_character(),
  ..   total_employees = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Group Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.


df<-select(data,state, total_employees)
We can filter out the data for the AL country by using the filter() function

df<- data %>% 
  filter(county == "CHILTON")
# A tibble: 1 × 3
  state county  total_employees
  <chr> <chr>             <dbl>
1 AL    CHILTON              72

Group By, Mean, Standard Deviation

We can find the average and standard deviation of the employee count for each state by using the groupby(), mean() and sd() functions.

Group by State:

data %>% 
  group_by(state) %>% 
  summarise(MeanEmployees = mean(total_employees, na.rm=TRUE), StandardDeviation = sd(total_employees, na.rm = TRUE))
# A tibble: 53 × 3
   state MeanEmployees StandardDeviation
   <chr>         <dbl>             <dbl>
 1 AE              2                NA  
 2 AK             17.2              34.8
 3 AL             63.5             130. 
 4 AP              1                NA  
 5 AR             53.8             131. 
 6 AZ            210.              228. 
 7 CA            239.              549. 
 8 CO             64.0             128. 
 9 CT            324               520. 
10 DC            279                NA  
# … with 43 more rows

Group by County:

data %>% 
  group_by(county) %>% 
  summarise(MeanEmployees = mean(total_employees, na.rm=TRUE), StandardDeviation = sd(total_employees, na.rm = TRUE))
# A tibble: 1,709 × 3
   county    MeanEmployees StandardDeviation
   <chr>             <dbl>             <dbl>
 1 ABBEVILLE        124                NA   
 2 ACADIA            13                NA   
 3 ACCOMACK           4                NA   
 4 ADA               81                NA   
 5 ADAIR              7.25              9.32
 6 ADAMS             73.2             155.  
 7 ADDISON            8                NA   
 8 AIKEN            193                NA   
 9 AITKIN            19                NA   
10 ALACHUA           22                NA   
# … with 1,699 more rows

We can count the number of counties for each county using the groupby() and count() function

data %>% 
  group_by(state) %>% 
  summarise(CountOfCounty = n())
# A tibble: 53 × 2
   state CountOfCounty
   <chr>         <int>
 1 AE                1
 2 AK                6
 3 AL               67
 4 AP                1
 5 AR               72
 6 AZ               15
 7 CA               55
 8 CO               57
 9 CT                8
10 DC                1
# … with 43 more rows


On analyzing the data using the above , we can see that not all states have multiple counties. There are states like AE, AP, DC and many more with just 1 county. Hence we see the value ‘N/A’ when we calculate the standard deviation for certain states and counties.