Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Matt Zambetti
May 31, 2023
Today’s challenge is to
read in a dataset, and
describe the dataset using both words and any supporting information (e.g., tables, etc)
Read in one (or more) of the following data sets, using the correct R package and command.
Find the _data
folder, located inside the posts
folder. Then you can read in the data, using either one of the readr
standard tidy read commands, or a specialized package such as readxl
.
state county total_employees
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
From what we can see using the “head” command, there are three columns: state, county, and total number of employees for each railroad.
I predict that the data was gathered either from on online registry with the department of labor or something similar, or this was done via survey. The survey would consist of the surveyor sending a request to each station to report how many employees there were at that particular station.
state county total_employees
659 IL COOK 8207
2585 TX TARRANT 4235
1747 NE DOUGLAS 3797
1932 NY SUFFOLK 3685
2685 VA INDEPENDENT CITY 3249
301 FL DUVAL 3073
195 CA SAN BERNARDINO 2888
178 CA LOS ANGELES 2545
2484 TX HARRIS 2535
1773 NE LINCOLN 2289
state county total_employees
2582 TX STEPHENS 1
2583 TX STONEWALL 1
2609 TX WILLACY 1
2625 UT GRAND 1
2634 UT SEVIER 1
2691 VA LANCASTER 1
2715 VA RICHMOND 1
2757 WA FERRY 1
2759 WA GARFIELD 1
2865 WV GILMER 1
The two tables above, we can see the first contains the 10 stations with the most employees and in the second one, we can see the 10 stations with the least employees.
These two tables could be helpful for the challenge of distributing funds. For example, a station with more employees could be funded more since their payroll is higher than that of a station with fewer.
Also, stations with more employees is likely to be more populated leading to more wear and tear on the station. So, the upkeep of these stations is probably greater as well.
---
title: "Challenge 1 Submission"
author: "Matt Zambetti"
description: "Reading in data and creating a post"
date: "5/31/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_1
- railroads
- faostat
- wildbirds
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a dataset, and
2) describe the dataset using both words and any supporting information (e.g., tables, etc)
## Read in the Data
Read in one (or more) of the following data sets, using the correct R package and command.
- railroad_2012_clean_county.csv ⭐
- birds.csv ⭐⭐
- FAOstat\*.csv ⭐⭐
- wild_bird_data.xlsx ⭐⭐⭐
- StateCounty2012.xls ⭐⭐⭐⭐
Find the `_data` folder, located inside the `posts` folder. Then you can read in the data, using either one of the `readr` standard tidy read commands, or a specialized package such as `readxl`.
```{r}
#| warning: false
#| message: false
# getting the data with the read.csv command
railroad_data <- read.csv("_data/railroad_2012_clean_county.csv")
# printing the first 10 rows
head(railroad_data, 10)
```
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
From what we can see using the "head" command, there are three columns: state, county, and total number of employees for each railroad.
I predict that the data was gathered either from on online registry with the department of labor or something similar, or this was done via survey. The survey would consist of the surveyor sending a request to each station to report how many employees there were at that particular station.
```{r}
#| label: summary
# Lets sort the data
sorted <- railroad_data[order(railroad_data$total_employees, decreasing = TRUE),]
# 10 stations with the most employees
head(sorted, 10)
# 10 stations with the least employees
tail(sorted, 10)
```
The two tables above, we can see the first contains the 10 stations with the most employees and in the second one, we can see the 10 stations with the least employees.
These two tables could be helpful for the challenge of distributing funds. For example, a station with more employees could be funded more since their payroll is higher than that of a station with fewer.
Also, stations with more employees is likely to be more populated leading to more wear and tear on the station. So, the upkeep of these stations is probably greater as well.