Challenge 1

challenge_1

Author

Pavan Datta Abbineni

Published

August 15, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

I’ve decided to use the railroad_2012_clean_county.csv dataset

Code

railroadCompleteData<- read_csv("_data/railroad_2012_clean_county.csv")

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

Describe the data

Code

head(railroadCompleteData)

# A tibble: 6 × 3
  state county               total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1

Code

tail(railroadCompleteData)

# A tibble: 6 × 3
  state county     total_employees
  <chr> <chr>                <dbl>
1 WY    SHERIDAN               252
2 WY    SUBLETTE                 3
3 WY    SWEETWATER             196
4 WY    UINTA                   49
5 WY    WASHAKIE                10
6 WY    WESTON                  37

For a dataset to be in tidy-format it needs to satisfy the following conditions.
1) Each variable has its own column 2) Each value is in its own cell and 3) Each observation is located in its own row.

From our visualization of our above dataset we can confidently say that our dataset is already in tidy format.

Code

nrow(railroadCompleteData)

[1] 2930

Our dataset has a total of 2930 rows.

Code

ncol(railroadCompleteData)

[1] 3

Code

colnames(railroadCompleteData)

[1] "state"           "county"          "total_employees"

We have a total of 3 columns with the names being state, county and total_employees.

Code

stateNames = railroadCompleteData$state
countyNames = railroadCompleteData$county
(unique(stateNames))

 [1] "AE" "AK" "AL" "AP" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA"
[16] "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC"
[31] "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN"
[46] "TX" "UT" "VA" "VT" "WA" "WI" "WV" "WY"

Code

length(unique(stateNames))

[1] 53

We can see that there are 53 unique states data in our dataset.

Code

tableOfCompleteData =(table(railroadCompleteData$state))
tableOfCompleteData[order(tableOfCompleteData)]


 AE  AP  DC  DE  HI  RI  AK  CT  NH  MA  NV  VT  AZ  ME  NJ  WY  MD  UT  NM  OR 
  1   1   1   3   3   5   6   8  10  12  12  14  15  16  21  22  24  25  29  33 
 ID  WA  SC  ND  SD  MT  WV  CA  CO  NY  LA  PA  AL  FL  WI  AR  OK  MI  MS  MN 
 36  39  46  49  52  53  53  55  57  61  63  65  67  67  69  72  73  78  78  86 
 OH  NE  TN  IN  VA  NC  KS  IA  IL  MO  KY  GA  TX 
 88  89  91  92  92  94  95  99 103 115 119 152 221

We can see that Texas and Georgia are the states with highest employees where as there are quite a few states with fewer than 10 employees.

This data is likely gathered from the official railroad website, as the number of employees currently on payroll is known data to them.