Challenge 2: Data Wrangling

challenge2
dataset
Data Wrangling
Data wrangling: summary statistics of dataset
Author

Surya Praneeth Reddy Chirasani

Published

January 6, 2023

Code
library(tidyverse)
library("readxl")
knitr::opts_chunk$set(echo = TRUE)
Code
railroad_employment <-read_excel("_data/StateCounty2012.xls",
                     skip = 4,
                     col_names= c("state", "2nd_column",  "county",
                                  "4th_column", "employees"))%>%
  select(!contains("column"))%>%
  filter(!str_detect(state, "Total"))
head(railroad_employment)
# A tibble: 6 × 3
  state county               employees
  <chr> <chr>                    <dbl>
1 AE    APO                          2
2 AK    ANCHORAGE                    7
3 AK    FAIRBANKS NORTH STAR         2
4 AK    JUNEAU                       3
5 AK    MATANUSKA-SUSITNA            2
6 AK    SITKA                        1

The dataset has some intial rows that are not relevant for analysis to form a dataframe which I have skipped and remaining rows can be explained in 3 columns with State, county and employees. Each row has information on total number of employees working in a particular state and a particular county in that county. I have deleted rows which has information of total number of employees in a given state, which is redundant and can be calculated by summary statistics analysis

Code
railroad_employment<-head(railroad_employment, -2)
tail(railroad_employment)
# A tibble: 6 × 3
  state  county     employees
  <chr>  <chr>          <dbl>
1 WY     SUBLETTE           3
2 WY     SWEETWATER       196
3 WY     UINTA             49
4 WY     WASHAKIE          10
5 WY     WESTON            37
6 CANADA <NA>             662
Code
railroad_employment <- mutate(railroad_employment, county = ifelse(state=="CANADA", "CANADA", county))
tail(railroad_employment)
# A tibble: 6 × 3
  state  county     employees
  <chr>  <chr>          <dbl>
1 WY     SUBLETTE           3
2 WY     SWEETWATER       196
3 WY     UINTA             49
4 WY     WASHAKIE          10
5 WY     WESTON            37
6 CANADA CANADA           662
Code
railroad_employment %>% select(state, county) %>% summarise_all(list(~n_distinct(.)))
# A tibble: 1 × 2
  state county
  <int>  <int>
1    54   1710

There are total of 54 states of which 50 are US states, 1 federal district(Washington DC), 2 military designation places and last one is Canada. Total counties are 1710 including those of Canada and minitary designation places

Code
sum(railroad_employment[, "employees"])
[1] 256094

Total number of railway employees employed are 256094 which is about 0.08% of the total population of US(2011 Census Total population of 310 million)

Code
railroad_employment %>%
  group_by(state) %>%
  summarise(total_employees = sum(employees),
            num_counties = n())%>%
  arrange(desc(total_employees))
# A tibble: 54 × 3
   state total_employees num_counties
   <chr>           <dbl>        <int>
 1 TX              19839          221
 2 IL              19131          103
 3 NY              17050           61
 4 NE              13176           89
 5 CA              13137           55
 6 PA              12769           65
 7 OH               9056           88
 8 GA               8605          152
 9 IN               8537           92
10 MO               8419          115
# … with 44 more rows

The top 5 states with highest employees are states which has geographically large area and large number of counties which makes sense since you need more employees to canter to the area and the population