Challenge 1 Submission

challenge_1
railroads
faostat
wildbirds
Reading in data and creating a post
Author

Matt Zambetti

Published

May 31, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

  • railroad_2012_clean_county.csv ⭐
  • birds.csv ⭐⭐
  • FAOstat*.csv ⭐⭐
  • wild_bird_data.xlsx ⭐⭐⭐
  • StateCounty2012.xls ⭐⭐⭐⭐

Find the _data folder, located inside the posts folder. Then you can read in the data, using either one of the readr standard tidy read commands, or a specialized package such as readxl.

Code
# getting the data with the read.csv command
railroad_data <- read.csv("_data/railroad_2012_clean_county.csv")

# printing the first 10 rows
head(railroad_data, 10)
   state               county total_employees
1     AE                  APO               2
2     AK            ANCHORAGE               7
3     AK FAIRBANKS NORTH STAR               2
4     AK               JUNEAU               3
5     AK    MATANUSKA-SUSITNA               2
6     AK                SITKA               1
7     AK SKAGWAY MUNICIPALITY              88
8     AL              AUTAUGA             102
9     AL              BALDWIN             143
10    AL              BARBOUR               1

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

From what we can see using the “head” command, there are three columns: state, county, and total number of employees for each railroad.

I predict that the data was gathered either from on online registry with the department of labor or something similar, or this was done via survey. The survey would consist of the surveyor sending a request to each station to report how many employees there were at that particular station.

Code
# Lets sort the data
sorted <- railroad_data[order(railroad_data$total_employees, decreasing = TRUE),]

# 10 stations with the most employees
head(sorted, 10)
     state           county total_employees
659     IL             COOK            8207
2585    TX          TARRANT            4235
1747    NE          DOUGLAS            3797
1932    NY          SUFFOLK            3685
2685    VA INDEPENDENT CITY            3249
301     FL            DUVAL            3073
195     CA   SAN BERNARDINO            2888
178     CA      LOS ANGELES            2545
2484    TX           HARRIS            2535
1773    NE          LINCOLN            2289
Code
# 10 stations with the least employees
tail(sorted, 10)
     state    county total_employees
2582    TX  STEPHENS               1
2583    TX STONEWALL               1
2609    TX   WILLACY               1
2625    UT     GRAND               1
2634    UT    SEVIER               1
2691    VA LANCASTER               1
2715    VA  RICHMOND               1
2757    WA     FERRY               1
2759    WA  GARFIELD               1
2865    WV    GILMER               1

The two tables above, we can see the first contains the 10 stations with the most employees and in the second one, we can see the 10 stations with the least employees.

These two tables could be helpful for the challenge of distributing funds. For example, a station with more employees could be funded more since their payroll is higher than that of a station with fewer.

Also, stations with more employees is likely to be more populated leading to more wear and tear on the station. So, the upkeep of these stations is probably greater as well.