challenge_1
Author

Pavan Datta Abbineni

Published

August 15, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

I’ve decided to use the railroad_2012_clean_county.csv dataset

Code
railroadCompleteData<- read_csv("_data/railroad_2012_clean_county.csv")

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

Describe the data

Code
head(railroadCompleteData)
# A tibble: 6 × 3
  state county               total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1
Code
tail(railroadCompleteData)
# A tibble: 6 × 3
  state county     total_employees
  <chr> <chr>                <dbl>
1 WY    SHERIDAN               252
2 WY    SUBLETTE                 3
3 WY    SWEETWATER             196
4 WY    UINTA                   49
5 WY    WASHAKIE                10
6 WY    WESTON                  37

For a dataset to be in tidy-format it needs to satisfy the following conditions.
1) Each variable has its own column 2) Each value is in its own cell and 3) Each observation is located in its own row.

From our visualization of our above dataset we can confidently say that our dataset is already in tidy format.

Code
nrow(railroadCompleteData)
[1] 2930

Our dataset has a total of 2930 rows.

Code
ncol(railroadCompleteData)
[1] 3
Code
colnames(railroadCompleteData)
[1] "state"           "county"          "total_employees"

We have a total of 3 columns with the names being state, county and total_employees.

Code
stateNames = railroadCompleteData$state
countyNames = railroadCompleteData$county
(unique(stateNames))
 [1] "AE" "AK" "AL" "AP" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA"
[16] "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC"
[31] "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN"
[46] "TX" "UT" "VA" "VT" "WA" "WI" "WV" "WY"
Code
length(unique(stateNames))
[1] 53

We can see that there are 53 unique states data in our dataset.

Code
tableOfCompleteData =(table(railroadCompleteData$state))
tableOfCompleteData[order(tableOfCompleteData)]

 AE  AP  DC  DE  HI  RI  AK  CT  NH  MA  NV  VT  AZ  ME  NJ  WY  MD  UT  NM  OR 
  1   1   1   3   3   5   6   8  10  12  12  14  15  16  21  22  24  25  29  33 
 ID  WA  SC  ND  SD  MT  WV  CA  CO  NY  LA  PA  AL  FL  WI  AR  OK  MI  MS  MN 
 36  39  46  49  52  53  53  55  57  61  63  65  67  67  69  72  73  78  78  86 
 OH  NE  TN  IN  VA  NC  KS  IA  IL  MO  KY  GA  TX 
 88  89  91  92  92  94  95  99 103 115 119 152 221 

We can see that Texas and Georgia are the states with highest employees where as there are quite a few states with fewer than 10 employees.

This data is likely gathered from the official railroad website, as the number of employees currently on payroll is known data to them.