Code
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Justine Shakespeare
February 15, 2023
First we’ll load the tidyverse and readxl packages and use the read_excel() function to read in the dataset. When I first loaded this dataset I noted that the data started on the third row, so I’ll specify that R should skip the first two lines when I import the data.
This dataset contains information about the number of railroad employees by county and state in 2012. In these data, rows are individual counties and columns are the state, country, and number of employees. It is not clear from the dataset itself where this data is from/who published it.
Before summarizing the data, we’ll tidy it up a bit. Using the head() function we can see that the dataset includes 5 columns, but data is stored in only three.
# A tibble: 6 × 5
STATE ...2 COUNTY ...4 TOTAL
<chr> <lgl> <chr> <lgl> <dbl>
1 AE NA APO NA 2
2 AE Total1 NA <NA> NA 2
3 AK NA ANCHORAGE NA 7
4 AK NA FAIRBANKS NORTH STAR NA 2
5 AK NA JUNEAU NA 3
6 AK NA MATANUSKA-SUSITNA NA 2
We can use the select() function to isolate only the columns with data and assign these to a new dataframe in R. We’ll use head()
to check our work.
# A tibble: 6 × 3
STATE COUNTY TOTAL
<chr> <chr> <dbl>
1 AE APO 2
2 AE Total1 <NA> 2
3 AK ANCHORAGE 7
4 AK FAIRBANKS NORTH STAR 2
5 AK JUNEAU 3
6 AK MATANUSKA-SUSITNA 2
Using the n_distinct()
function we can see that there are 111 distinct values in the STATE column. Since we know there are 50 states in the US, this indicates there are some additional values that we may not have expected or want to keep for analysis.
We can use piping and the functions group_by()
, summarize()
, and print()
to inspect the values of the STATE column more closely to get an idea of what is stored here.
# A tibble: 111 × 2
STATE `mean(TOTAL)`
<chr> <dbl>
1 1 Military designation. NA
2 AE 2
3 AE Total1 2
4 AK 17.2
5 AK Total 103
6 AL 63.5
7 AL Total 4257
8 AP 1
9 AP Total1 1
10 AR 53.8
11 AR Total 3871
12 AZ 210.
13 AZ Total 3153
14 CA 239.
15 CA Total 13137
16 CANADA 662
17 CO 64.0
18 CO Total 3650
19 CT 324
20 CT Total 2592
21 DC 279
22 DC Total 279
23 DE 498.
24 DE Total 1495
25 FL 111.
26 FL Total 7419
27 GA 56.6
28 GA Total 8605
29 Grand Total 255432
30 HI 1.33
31 HI Total 4
32 IA 40.6
33 IA Total 4019
34 ID 43.4
35 ID Total 1563
36 IL 186.
37 IL Total 19131
38 IN 92.8
39 IN Total 8537
40 KS 64.1
41 KS Total 6092
42 KY 40.4
43 KY Total 4811
44 LA 62.1
45 LA Total 3915
46 MA 282.
47 MA Total 3379
48 MD 196.
49 MD Total 4709
50 ME 40.9
51 ME Total 654
52 MI 50.4
53 MI Total 3932
54 MN 63.6
55 MN Total 5467
56 MO 73.2
57 MO Total 8419
58 MS 27.1
59 MS Total 2111
60 MT 62.8
61 MT Total 3327
62 NC 33.4
63 NC Total 3143
64 ND 45.0
65 ND Total 2204
66 NE 148.
67 NE Total 13176
68 NH 39.3
69 NH Total 393
70 NJ 397.
71 NJ Total 8329
72 NM 67.5
73 NM Total 1958
74 NOTE: Excludes 2,896 employees without an address. NA
75 NV 62.2
76 NV Total 746
77 NY 280.
78 NY Total 17050
79 OH 103.
80 OH Total 9056
81 OK 31.8
82 OK Total 2318
83 OR 70.4
84 OR Total 2322
85 PA 196.
86 PA Total 12769
87 RI 97.4
88 RI Total 487
89 SC 49.9
90 SC Total 2296
91 SD 18.2
92 SD Total 949
93 TN 54.4
94 TN Total 4952
95 TX 89.8
96 TX Total 19839
97 UT 76.7
98 UT Total 1917
99 VA 82.1
100 VA Total 7551
101 VT 18.5
102 VT Total 259
103 WA 134.
104 WA Total 5222
105 WI 54.7
106 WI Total 3773
107 WV 60.6
108 WV Total 3213
109 WY 131.
110 WY Total 2876
111 <NA> NA
There are a few issues we see here: It looks as if there are Total rows in this dataset - that is, there are rows that add up the total for each state. We don’t need those in our data, since we can use R functions to total employees by state. It also looks as if CANADA is listed as a state. Assuming we are only interested in the US, let’s remove that as well.
We can use the filter()
and grepl()
function to remove both the total rows and Canada from this dataset.
Finally, we noticed we had some NAs in both the STATE and the TOTAL columns. We can use the na.omit()
function to remove these from the dataset.
# A tibble: 2,930 × 3
STATE COUNTY TOTAL
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,920 more rows
Let’s run the same code chunk from earlier to check that our edits have been made and the data is clean.
# A tibble: 53 × 2
STATE `mean(TOTAL)`
<chr> <dbl>
1 AE 2
2 AK 17.2
3 AL 63.5
4 AP 1
5 AR 53.8
6 AZ 210.
7 CA 239.
8 CO 64.0
9 CT 324
10 DC 279
11 DE 498.
12 FL 111.
13 GA 56.6
14 HI 1.33
15 IA 40.6
16 ID 43.4
17 IL 186.
18 IN 92.8
19 KS 64.1
20 KY 40.4
21 LA 62.1
22 MA 282.
23 MD 196.
24 ME 40.9
25 MI 50.4
26 MN 63.6
27 MO 73.2
28 MS 27.1
29 MT 62.8
30 NC 33.4
31 ND 45.0
32 NE 148.
33 NH 39.3
34 NJ 397.
35 NM 67.5
36 NV 62.2
37 NY 280.
38 OH 103.
39 OK 31.8
40 OR 70.4
41 PA 196.
42 RI 97.4
43 SC 49.9
44 SD 18.2
45 TN 54.4
46 TX 89.8
47 UT 76.7
48 VA 82.1
49 VT 18.5
50 WA 134.
51 WI 54.7
52 WV 60.6
53 WY 131.
This is more like it! Now we have only 53 distinct values listed in the STATE columns: the 50 states, Washington DC (listed as DC), and two overseas military designations (listed as AE and AP)
Let’s move on to summarizing the data.
Using the dim()
function we can see below that the dataframe has 2,930 rows/observations and 3 columns/variables.
We can use the head()
function to see more information about this dataset, including the column names, and the type of data stored in each column. The first two columns, STATE and COUNTY, are character variables, while the last column, TOTAL, is a double variable, which means it contains numeric data.
# A tibble: 6 × 3
STATE COUNTY TOTAL
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
Using the summary()
function we can see that the minimum number of employees in county is 1, the maximum is 8207, the median is 21 employees, and the mean is a little over 87 employees.
STATE COUNTY TOTAL
Length:2930 Length:2930 Min. : 1.00
Class :character Class :character 1st Qu.: 7.00
Mode :character Mode :character Median : 21.00
Mean : 87.18
3rd Qu.: 65.00
Max. :8207.00
Let’s take a closer look at this dataset to get a sense of which county and which state each have the most railraod employees. We can use piping and the arrange()
function in R to see which counties have the most railroad employees:
# A tibble: 2,930 × 3
STATE COUNTY TOTAL
<chr> <chr> <dbl>
1 IL COOK 8207
2 TX TARRANT 4235
3 NE DOUGLAS 3797
4 NY SUFFOLK 3685
5 VA INDEPENDENT CITY 3249
6 FL DUVAL 3073
7 CA SAN BERNARDINO 2888
8 CA LOS ANGELES 2545
9 TX HARRIS 2535
10 NE LINCOLN 2289
# … with 2,920 more rows
This shows that Cook County in Illinois has the most railroad employees. Given that the major city, Chicago is located in Cook County, this is not surprising. It is interesting that Cook county has almost twice as many railroad employees as the next county on the list, Tarrant county in Texas.
This data can also shed light on which states have the most railroad employees. We can use piping and the group_by()
, summarize()
, sum()
and arrange()
functions to show the states with the most railroad employees:
# A tibble: 53 × 2
STATE total_state_employees
<chr> <dbl>
1 TX 19839
2 IL 19131
3 NY 17050
4 NE 13176
5 CA 13137
6 PA 12769
7 OH 9056
8 GA 8605
9 IN 8537
10 MO 8419
# … with 43 more rows
This shows that Texas is the state with the most railroad employees.
---
title: "Challenge 1 - Railroad Employment"
author: "Justine Shakespeare"
desription: "Analysis of 2012 Railroad Employment Data"
date: "02/15/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_1
- railroads
- Justine Shakespeare
---
```{r}
#| label: setup
#| warning: false
#| message: false
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Reading in the Data
First we'll load the tidyverse and readxl packages and use the read_excel() function to read in the dataset. When I first loaded this dataset I noted that the data started on the third row, so I'll specify that R should skip the first two lines when I import the data.
```{r}
library(tidyverse)
library(readxl)
RailroadEmployment <- read_excel("_data/StateCounty2012.xls", skip=2)
```
## Data Description
This dataset contains information about the number of railroad employees by county and state in 2012. In these data, rows are individual counties and columns are the state, country, and number of employees. It is not clear from the dataset itself where this data is from/who published it.
## Data Cleaning
Before summarizing the data, we'll tidy it up a bit. Using the head() function we can see that the dataset includes 5 columns, but data is stored in only three.
```{r}
head(RailroadEmployment)
```
We can use the select() function to isolate only the columns with data and assign these to a new dataframe in R. We'll use `head()` to check our work.
```{r}
RailroadEmployment <- select(RailroadEmployment, STATE, COUNTY, TOTAL)
head(RailroadEmployment)
```
Using the `n_distinct()` function we can see that there are 111 distinct values in the STATE column. Since we know there are 50 states in the US, this indicates there are some additional values that we may not have expected or want to keep for analysis.
```{r}
RailroadEmployment %>%
select(STATE)%>%
n_distinct(.)
```
We can use piping and the functions `group_by()`, `summarize()`, and `print()` to inspect the values of the **STATE** column more closely to get an idea of what is stored here.
```{r}
RailroadEmployment %>%
group_by(STATE) %>%
summarize(mean(TOTAL)) %>%
print(n=111)
```
There are a few issues we see here: It looks as if there are *Total* rows in this dataset - that is, there are rows that add up the total for each state. We don't need those in our data, since we can use R functions to total employees by state. It also looks as if **CANADA** is listed as a state. Assuming we are only interested in the US, let's remove that as well.
We can use the `filter()` and `grepl()` function to remove both the total rows and Canada from this dataset.
```{r}
RailroadEmployment <- RailroadEmployment %>%
filter(!grepl("Total", STATE))
RailroadEmployment <- RailroadEmployment %>%
filter(!grepl("CANADA", STATE))
```
Finally, we noticed we had some NAs in both the **STATE** and the **TOTAL** columns. We can use the `na.omit()` function to remove these from the dataset.
```{r}
RailroadEmployment <- print(na.omit(RailroadEmployment))
```
Let's run the same code chunk from earlier to check that our edits have been made and the data is clean.
```{r}
RailroadEmployment %>%
group_by(STATE) %>%
summarize(mean(TOTAL)) %>%
print(n=111)
```
This is more like it! Now we have only 53 distinct values listed in the **STATE** columns: the 50 states, Washington DC (listed as DC), and two overseas military designations (listed as AE and AP)
Let's move on to summarizing the data.
------------------------------------------------------------------------
## Data Summary
Using the `dim()` function we can see below that the dataframe has 2,930 rows/observations and 3 columns/variables.
```{r}
dim(RailroadEmployment)
```
We can use the `head()` function to see more information about this dataset, including the column names, and the type of data stored in each column. The first two columns, **STATE** and **COUNTY**, are character variables, while the last column, **TOTAL**, is a double variable, which means it contains numeric data.
```{r}
head(RailroadEmployment)
```
Using the `summary()` function we can see that the **minimum number of employees in county is 1**, the **maximum is 8207**, the **median is 21 employees**, and the **mean is a little over 87 employees**.
```{r}
summary(RailroadEmployment)
```
Let's take a closer look at this dataset to get a sense of which **county** and which **state** each have the most railraod employees. We can use piping and the `arrange()` function in R to see which **counties** have the most railroad employees:
```{r}
RailroadEmployment %>%
arrange(desc(TOTAL))
```
This shows that ***Cook County in Illinois*** has the most railroad employees. Given that the major city, Chicago is located in Cook County, this is not surprising. It is interesting that Cook county has almost twice as many railroad employees as the next county on the list, Tarrant county in Texas.
This data can also shed light on which **states** have the most railroad employees. We can use piping and the `group_by()`, `summarize()`, `sum()` and `arrange()` functions to show the states with the most railroad employees:
```{r}
employees_by_state <- RailroadEmployment %>%
group_by(STATE) %>%
summarize(total_state_employees=sum(TOTAL))
employees_by_state %>%
arrange(desc(total_state_employees))
```
This shows that ***Texas*** is the state with the most railroad employees.