Challenge 1

challenge_1

railroads

faostat

wildbirds

Reading in data and creating a post

Author

Sai Venkatesh

Published

April 12, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

read in a dataset, and
describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

We are going to load the railroad data.

Code

railroad <- read.csv('_data/railroad_2012_clean_county.csv')

print("Lets load the data and see the dimensions and columns of the data.")

[1] "Lets load the data and see the dimensions and columns of the data."

Code

# The Dimensions 
dim(railroad)

[1] 2930    3

Code

# The Column Names 
colnames(railroad)

[1] "state"           "county"          "total_employees"

From the above, we can see that the Railroad data has 2930 rows and 3 columns. The 3 column names are state, county and total_employees.

Code

print("The top rows of the data are :- ")

[1] "The top rows of the data are :- "

Code

head(railroad)

  state               county total_employees
1    AE                  APO               2
2    AK            ANCHORAGE               7
3    AK FAIRBANKS NORTH STAR               2
4    AK               JUNEAU               3
5    AK    MATANUSKA-SUSITNA               2
6    AK                SITKA               1

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

The data seems to represent the railroad employees and the distribution of the employees by state and county.

Code

# Number of Employees Per State
print("The total number of employees in the states ordered by the count:-")

[1] "The total number of employees in the states ordered by the count:-"

Code

railroad %>%
  group_by(state) %>%
  summarize(state_total = sum(total_employees))  %>%
  select(state, state_total)  %>%
  arrange(desc(state_total))

# A tibble: 53 × 2
   state state_total
   <chr>       <int>
 1 TX          19839
 2 IL          19131
 3 NY          17050
 4 NE          13176
 5 CA          13137
 6 PA          12769
 7 OH           9056
 8 GA           8605
 9 IN           8537
10 MO           8419
# … with 43 more rows

Code

# Number of Employees > 1000
print("The counties with employees greater than 1000 ordered by the count:-")

[1] "The counties with employees greater than 1000 ordered by the count:-"

Code

railroad %>%
  filter(total_employees > 1000)  %>%
  select(state, county, total_employees) %>%
  arrange(desc(total_employees))

   state           county total_employees
1     IL             COOK            8207
2     TX          TARRANT            4235
3     NE          DOUGLAS            3797
4     NY          SUFFOLK            3685
5     VA INDEPENDENT CITY            3249
6     FL            DUVAL            3073
7     CA   SAN BERNARDINO            2888
8     CA      LOS ANGELES            2545
9     TX           HARRIS            2535
10    NE          LINCOLN            2289
11    NY           NASSAU            2076
12    MO          JACKSON            2055
13    IN             LAKE            1999
14    IL             WILL            1784
15    PA     PHILADELPHIA            1649
16    NE        LANCASTER            1619
17    CA        RIVERSIDE            1567
18    CT        NEW HAVEN            1561
19    NY           QUEENS            1470
20    KS          JOHNSON            1286
21    DE       NEW CASTLE            1275
22    NE        BOX BUTTE            1168
23    NY         DUTCHESS            1157
24    PA            BUCKS            1106
25    NJ            ESSEX            1097
26    NY      WESTCHESTER            1040
27    WA             KING            1039

We can see that Texas state has the most employees and Cook County has the most employees.