Code
library(tidyverse)
library(readxl)
::opts_chunk$set(echo = TRUE) knitr
Shaunak Padhye
May 11, 2023
We will load the required packages.
The “StateCounty2012.xls” dataset will be used for this challenge. The dataset will be loaded using the read_excel() function from the readxl package. When the data was first loaded, it was observed that the first two rows did not contain useful information and hence those two rows were skipped.
STATE ...2 COUNTY ...4 TOTAL
1 AE NA APO NA 2
2 AE Total1 NA <NA> NA 2
3 AK NA ANCHORAGE NA 7
4 AK NA FAIRBANKS NORTH STAR NA 2
5 AK NA JUNEAU NA 3
6 AK NA MATANUSKA-SUSITNA NA 2
We can also observe that the second and the fourth columns contain null values and those columns can be dropped.
This dataset provides information on the number of railroad employees in each county and state for the year 2012. Each row represents a county, and the columns display the state, county, and number of employees. There is no information about the origin of the data.
Now our dataset is ready for further inspection. We can start by creating an initial summary of the data by using the summary() and str() functions. These functions provide some basic statistics for the dataset.
STATE COUNTY TOTAL
Length:2990 Length:2990 Min. : 1.0
Class :character Class :character 1st Qu.: 7.0
Mode :character Mode :character Median : 22.0
Mean : 256.9
3rd Qu.: 71.0
Max. :255432.0
NA's :5
'data.frame': 2990 obs. of 3 variables:
$ STATE : chr "AE" "AE Total1" "AK" "AK" ...
$ COUNTY: chr "APO" NA "ANCHORAGE" "FAIRBANKS NORTH STAR" ...
$ TOTAL : num 2 2 7 2 3 2 1 88 103 102 ...
We can observe that the STATE column has 2990 values. However, the number of states in the United States is much less. We can print out the unique values for this column to check.
[1] "AE"
[2] "AE Total1"
[3] "AK"
[4] "AK Total"
[5] "AL"
[6] "AL Total"
[7] "AP"
[8] "AP Total1"
[9] "AR"
[10] "AR Total"
[11] "AZ"
[12] "AZ Total"
[13] "CA"
[14] "CA Total"
[15] "CO"
[16] "CO Total"
[17] "CT"
[18] "CT Total"
[19] "DC"
[20] "DC Total"
[21] "DE"
[22] "DE Total"
[23] "FL"
[24] "FL Total"
[25] "GA"
[26] "GA Total"
[27] "HI"
[28] "HI Total"
[29] "IA"
[30] "IA Total"
[31] "ID"
[32] "ID Total"
[33] "IL"
[34] "IL Total"
[35] "IN"
[36] "IN Total"
[37] "KS"
[38] "KS Total"
[39] "KY"
[40] "KY Total"
[41] "LA"
[42] "LA Total"
[43] "MA"
[44] "MA Total"
[45] "MD"
[46] "MD Total"
[47] "ME"
[48] "ME Total"
[49] "MI"
[50] "MI Total"
[51] "MN"
[52] "MN Total"
[53] "MO"
[54] "MO Total"
[55] "MS"
[56] "MS Total"
[57] "MT"
[58] "MT Total"
[59] "NC"
[60] "NC Total"
[61] "ND"
[62] "ND Total"
[63] "NE"
[64] "NE Total"
[65] "NH"
[66] "NH Total"
[67] "NJ"
[68] "NJ Total"
[69] "NM"
[70] "NM Total"
[71] "NV"
[72] "NV Total"
[73] "NY"
[74] "NY Total"
[75] "OH"
[76] "OH Total"
[77] "OK"
[78] "OK Total"
[79] "OR"
[80] "OR Total"
[81] "PA"
[82] "PA Total"
[83] "RI"
[84] "RI Total"
[85] "SC"
[86] "SC Total"
[87] "SD"
[88] "SD Total"
[89] "TN"
[90] "TN Total"
[91] "TX"
[92] "TX Total"
[93] "UT"
[94] "UT Total"
[95] "VA"
[96] "VA Total"
[97] "VT"
[98] "VT Total"
[99] "WA"
[100] "WA Total"
[101] "WI"
[102] "WI Total"
[103] "WV"
[104] "WV Total"
[105] "WY"
[106] "WY Total"
[107] "Grand Total"
[108] NA
[109] "CANADA"
[110] "1 Military designation."
[111] "NOTE: Excludes 2,896 employees without an address."
To understand this dataset in more detail lets isolate the state “MA” and check the dataset.
COUNTY TOTAL
1137 BARNSTABLE 44
1138 BERKSHIRE 50
1139 BRISTOL 232
1140 ESSEX 314
1141 FRANKLIN 113
1142 HAMPDEN 202
1143 HAMPSHIRE 68
1144 MIDDLESEX 673
1145 NORFOLK 386
1146 PLYMOUTH 429
1147 SUFFOLK 558
1148 WORCESTER 310
As we can see above, the dataset contains the total for each county in “MA”.
We can also see the total number of emoloyees in “MA” by checking the “MA Total” state.
COUNTY TOTAL
1149 <NA> 3379
Based on this we understand that this dataset contains the total number of employees for each county in each state as well as the total number of employees for each state. We can split the dataset into two. One dataframe will contain the values for the states and their counties, while one dataframe will contain values for the entire states.
STATE COUNTY TOTAL
2 AE Total1 <NA> 2
9 AK Total <NA> 103
77 AL Total <NA> 4257
79 AP Total1 <NA> 1
152 AR Total <NA> 3871
168 AZ Total <NA> 3153
224 CA Total <NA> 13137
282 CO Total <NA> 3650
291 CT Total <NA> 2592
293 DC Total <NA> 279
297 DE Total <NA> 1495
365 FL Total <NA> 7419
518 GA Total <NA> 8605
522 HI Total <NA> 4
622 IA Total <NA> 4019
659 ID Total <NA> 1563
763 IL Total <NA> 19131
856 IN Total <NA> 8537
952 KS Total <NA> 6092
1072 KY Total <NA> 4811
1136 LA Total <NA> 3915
1149 MA Total <NA> 3379
1174 MD Total <NA> 4709
1191 ME Total <NA> 654
1270 MI Total <NA> 3932
1357 MN Total <NA> 5467
1473 MO Total <NA> 8419
1552 MS Total <NA> 2111
1606 MT Total <NA> 3327
1701 NC Total <NA> 3143
1751 ND Total <NA> 2204
1841 NE Total <NA> 13176
1852 NH Total <NA> 393
1874 NJ Total <NA> 8329
1904 NM Total <NA> 1958
1917 NV Total <NA> 746
1979 NY Total <NA> 17050
2068 OH Total <NA> 9056
2142 OK Total <NA> 2318
2176 OR Total <NA> 2322
2242 PA Total <NA> 12769
2248 RI Total <NA> 487
2295 SC Total <NA> 2296
2348 SD Total <NA> 949
2440 TN Total <NA> 4952
2662 TX Total <NA> 19839
2688 UT Total <NA> 1917
2781 VA Total <NA> 7551
2796 VT Total <NA> 259
2836 WA Total <NA> 5222
2906 WI Total <NA> 3773
2960 WV Total <NA> 3213
2983 WY Total <NA> 2876
2984 Grand Total <NA> 255432
STATE COUNTY TOTAL
1 AE APO 2
3 AK ANCHORAGE 7
4 AK FAIRBANKS NORTH STAR 2
5 AK JUNEAU 3
6 AK MATANUSKA-SUSITNA 2
7 AK SITKA 1
Finally, we remove all the null values from both the datasets.
We end up with two datasets: 1) One dataset contains the total number of employees for each county in each state 2) One dataset contains the total number of employees for each state
---
title: "Challenge 1 - Rail roads"
author: "Shaunak Padhye"
desription: "Data Reading and Data Cleaning"
date: "05/11/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_1
- Shaunak Padhye
- railroads
---
## Setup
We will load the required packages.
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE)
```
## Reading in data files
The "StateCounty2012.xls" dataset will be used for this challenge. The dataset will be loaded using the read_excel() function from the readxl package. When the data was first loaded, it was observed that the first two rows did not contain useful information and hence those two rows were skipped.
```{r}
#| label: data import
#| warning: false
state_county <- read_excel("_data/StateCounty2012.xls", skip=2)
state_county <- data.frame(state_county)
head(state_county)
```
We can also observe that the second and the fourth columns contain null values and those columns can be dropped.
```{r}
#| label: drop columns
#| warning: false
state_county <- state_county[, -c(2,4)]
head(state_county)
```
## Data Description
This dataset provides information on the number of railroad employees in each county and state for the year 2012. Each row represents a county, and the columns display the state, county, and number of employees. There is no information about the origin of the data.
## Data Cleaning
Now our dataset is ready for further inspection. We can start by creating an initial summary of the data by using the summary() and str() functions. These functions provide some basic statistics for the dataset.
```{r}
summary(state_county)
str(state_county)
```
We can observe that the STATE column has 2990 values. However, the number of states in the United States is much less. We can print out the unique values for this column to check.
```{r}
unique(state_county$STATE)
```
To understand this dataset in more detail lets isolate the state "MA" and check the dataset.
```{r}
subset(state_county, STATE == "MA", select = c(COUNTY,TOTAL))
```
As we can see above, the dataset contains the total for each county in "MA".
We can also see the total number of emoloyees in "MA" by checking the "MA Total" state.
```{r}
subset(state_county, STATE == "MA Total", select = c(COUNTY,TOTAL))
```
Based on this we understand that this dataset contains the total number of employees for each county in each state as well as the total number of employees for each state. We can split the dataset into two. One dataframe will contain the values for the states and their counties, while one dataframe will contain values for the entire states.
```{r}
match_pattern <- grepl("Total",state_county$STATE)
state_total <- state_county[match_pattern, ]
state_total
```
```{r}
state_county <- state_county[!match_pattern, ]
head(state_county)
```
Finally, we remove all the null values from both the datasets.
```{r}
state_total <- na.omit(state_total)
state_county <- na.omit(state_county)
```
## Summary
We end up with two datasets:
1) One dataset contains the total number of employees for each county in each state
2) One dataset contains the total number of employees for each state