DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 1 Abby Balint

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Description
  • Filtering
  • Average

Challenge 1 Abby Balint

  • Show All Code
  • Hide All Code

  • View Source
challenge_1
railroads
abby_balint
Author

Abby Balint

Published

September 15, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

  • railroad_2012_clean_county.csv ⭐
Code
library(tidyverse)
read_csv("_data/railroad_2012_clean_county.csv")
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
Code
railroad <- read_csv("_data/railroad_2012_clean_county.csv")

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

Description

This data set includes the number of railroad employees in 2012 by both state and county.This data is relatively limited because it only has three columns: state, county, and total employees, that we can use in conjunction with each other to analyze. The data contains 2930 rows. The range of employee numbers by county is quite wide so it could be useful to compare ranges and average counts.

Code
colnames(railroad)
[1] "state"           "county"          "total_employees"
Code
dim(railroad)
[1] 2930    3

Filtering

If I filter the data by a single state, it makes it easier to look at the number of employees and the county breakdown by that state. (Kentucky example) I can now see there is 119 counties (rows) reported.

Code
library(dplyr)
filter(railroad, `state` == "KY")
# A tibble: 119 × 3
   state county   total_employees
   <chr> <chr>              <dbl>
 1 KY    ADAIR                  1
 2 KY    ALLEN                  5
 3 KY    ANDERSON               5
 4 KY    BALLARD                7
 5 KY    BARREN                 5
 6 KY    BATH                   3
 7 KY    BELL                  27
 8 KY    BOONE                236
 9 KY    BOURBON                8
10 KY    BOYD                 232
# … with 109 more rows
Code
railroadKY <- filter(railroad, `state` == "KY")

Average

And here I found the average number of railroad employees in Kentucky.

Code
mean(railroadKY$`total_employees`)
[1] 40.42857
Code
filter(railroadKY, `total_employees` >=200)
# A tibble: 7 × 3
  state county    total_employees
  <chr> <chr>               <dbl>
1 KY    BOONE                 236
2 KY    BOYD                  232
3 KY    GREENUP               483
4 KY    JEFFERSON             413
5 KY    KENTON                244
6 KY    PIKE                  231
7 KY    WHITLEY               322
Source Code
---
title: "Challenge 1 Abby Balint"
author: "Abby Balint"
desription: "Reading in data and creating a post"
date: "09/15/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_1
  - railroads
  - abby_balint
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a dataset, and

2)  describe the dataset using both words and any supporting information (e.g., tables, etc)

## Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

-   railroad_2012_clean_county.csv ⭐


```{r}
library(tidyverse)
read_csv("_data/railroad_2012_clean_county.csv")
railroad <- read_csv("_data/railroad_2012_clean_county.csv")
```

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

## Description
This data set includes the number of railroad employees in 2012 by both state and county.This data is relatively limited because it only has three columns: state, county, and total employees, that we can use in conjunction with each other to analyze. The data contains 2930 rows. The range of employee numbers by county is quite wide so it could be useful to compare ranges and average counts. 

```{r}
#| label: summary
colnames(railroad)
dim(railroad)

```

## Filtering
If I filter the data by a single state, it makes it easier to look at the number of employees and the county breakdown by that state. (Kentucky example) I can now see there is 119 counties (rows) reported.

```{r}
#| label: summary 2
library(dplyr)
filter(railroad, `state` == "KY")
railroadKY <- filter(railroad, `state` == "KY")
```
## Average
And here I found the average number of railroad employees in Kentucky. 

```{r}
#| label: summary 3
mean(railroadKY$`total_employees`)
filter(railroadKY, `total_employees` >=200)
```