DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 1 Solutions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

Challenge 1 Solutions

  • Show All Code
  • Hide All Code

  • View Source
challenge_1
railroads
faostat
wildbirds
Author

Caitlin Rowley

Published

September 16, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

I will be using the “Railroad_2012” data set for this Challenge.

Code
# install readr function

install.packages("readr")

# read in and rename data

railroad <- read_csv("_data\\railroad_2012_clean_county.csv")
railroad
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
Code
# note that R does not like back slashes in Windows - needed to add '\\' or '/'

This data set shows the number of railroad employees (‘total_employees’) by county (‘county’) by state (‘state’). This data was collected in 2012, most likely by a government entity.

The variables in this data set are (1) state, (2) county, and (3) total employees. The cases are each state’s two-letter abbreviation, the name of the county within that state, and the actual number of employees within that count.

Code
# description of data
# include: (1) how data was collected, (2) cases and variables, (3) interpretation of data and any useful details.

# descriptive statistics
# run a summary function

summary(railroad)
    state              county          total_employees  
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00  

There are 2,930 cases in this data set. The minimum value, or number of employees, is 1, and the maximum value is 8,207. The median number of employees is 21, and the average number of employees across counties and states is 87. I will check the dimensions of the data frame to be sure.

Code
# descriptive statistics
# check dimensions to be sure

dim(railroad)
[1] 2930    3
Code
# Confirmed that there are 2,930 cases in this data set and 3 variables. 

I will next generate a visualization.

Code
# generate scatterplot:

ggplot(railroad, aes(total_employees))

Code
ggplot(railroad, aes(state, total_employees)) + geom_point()

It is evident in looking at the scatterplot that the county with 8,207 is an outlier.

Next, I will code to identify the 10 counties with the highest number of employees and the 10 counties with the lowest number of employees.

Code
# slice data:

railroad %>%
arrange(`total_employees`) %>%
slice(1:10)
# A tibble: 10 × 3
   state county   total_employees
   <chr> <chr>              <dbl>
 1 AK    SITKA                  1
 2 AL    BARBOUR                1
 3 AL    HENRY                  1
 4 AP    APO                    1
 5 AR    NEWTON                 1
 6 CA    MONO                   1
 7 CO    BENT                   1
 8 CO    CHEYENNE               1
 9 CO    COSTILLA               1
10 CO    DOLORES                1

After slicing the data to indicate the counties with the lowest numbers of railroad employees, we can see the following:

The counties with the ten lowest numbers of employees are Sitka County, AK (1), Barbour County, AL (1), Henry County, AL (1), APO County, AP–which appears to be an overseas military address--(1), Newton County, AR (1), Mono County, CA (1), Bent County, CO (1), Cheyenne County, CO (1), Costilla County, CO (1), and Dolores County, CO (1).

Code
# arrange the data in descending order:

railroad %>%
arrange(desc(`total_employees`)) %>%
slice(1:10)
# A tibble: 10 × 3
   state county           total_employees
   <chr> <chr>                      <dbl>
 1 IL    COOK                        8207
 2 TX    TARRANT                     4235
 3 NE    DOUGLAS                     3797
 4 NY    SUFFOLK                     3685
 5 VA    INDEPENDENT CITY            3249
 6 FL    DUVAL                       3073
 7 CA    SAN BERNARDINO              2888
 8 CA    LOS ANGELES                 2545
 9 TX    HARRIS                      2535
10 NE    LINCOLN                     2289

After slicing the data to indicate the counties with the highest number of railroad employees, we can see the following:

Cook County, IL (8207), Tarrant County, TX (4235), Douglas County, NE (3797), Suffolk County, NY (3685), Independent City, VA–which does not appear to be an existing county or city, so it can be assumed that county or municipal-level data was not readily available for this value--(3249), Duval County, FL (3073), San Bernardino County, CA (2888), Los Angeles County, CA (2545), Harris County, TX (2535), and Lincoln County, NE (2289).

In taking a closer look at the cases in this data set, it can be assumed that these states and counties do not entirely represent existing states in counties within the US. Data may be incomplete or may include international municipalities, counties, etc.

Source Code
---
title: "Challenge 1 Solutions"
author: "Caitlin Rowley"
desription: "Reading in data and creating a post"
date: "09/16/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_1
  - railroads
  - faostat
  - wildbirds
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

I will be using the "Railroad_2012" data set for this Challenge.

```{r}
# install readr function

install.packages("readr")

# read in and rename data

railroad <- read_csv("_data\\railroad_2012_clean_county.csv")
railroad

# note that R does not like back slashes in Windows - needed to add '\\' or '/'
```

This data set shows the number of railroad employees ('total_employees') by county ('county') by state ('state'). This data was collected in 2012, most likely by a government entity.

The variables in this data set are (1) state, (2) county, and (3) total employees. The cases are each state's two-letter abbreviation, the name of the county within that state, and the actual number of employees within that count.

```{r}
#| label: summary
# description of data
# include: (1) how data was collected, (2) cases and variables, (3) interpretation of data and any useful details.

# descriptive statistics
# run a summary function

summary(railroad)
```

There are 2,930 cases in this data set. The minimum value, or number of employees, is 1, and the maximum value is 8,207. The median number of employees is 21, and the average number of employees across counties and states is 87. I will check the dimensions of the data frame to be sure.

```{r}
# descriptive statistics
# check dimensions to be sure

dim(railroad)

# Confirmed that there are 2,930 cases in this data set and 3 variables. 

```

I will next generate a visualization.

```{r}
# generate scatterplot:

ggplot(railroad, aes(total_employees))
ggplot(railroad, aes(state, total_employees)) + geom_point()
```

It is evident in looking at the scatterplot that the county with 8,207 is an outlier.

Next, I will code to identify the 10 counties with the highest number of employees and the 10 counties with the lowest number of employees.

```{r}
# slice data:

railroad %>%
arrange(`total_employees`) %>%
slice(1:10)
```

After slicing the data to indicate the counties with the lowest numbers of railroad employees, we can see the following:

The counties with the ten lowest numbers of employees are Sitka County, AK (1), Barbour County, AL (1), Henry County, AL (1), APO County, AP--which appears to be an overseas military address\--(1), Newton County, AR (1), Mono County, CA (1), Bent County, CO (1), Cheyenne County, CO (1), Costilla County, CO (1), and Dolores County, CO (1).

```{r}
# arrange the data in descending order:

railroad %>%
arrange(desc(`total_employees`)) %>%
slice(1:10)

```

After slicing the data to indicate the counties with the highest number of railroad employees, we can see the following:

Cook County, IL (8207), Tarrant County, TX (4235), Douglas County, NE (3797), Suffolk County, NY (3685), Independent City, VA--which does not appear to be an existing county or city, so it can be assumed that county or municipal-level data was not readily available for this value\--(3249), Duval County, FL (3073), San Bernardino County, CA (2888), Los Angeles County, CA (2545), Harris County, TX (2535), and Lincoln County, NE (2289).

In taking a closer look at the cases in this data set, it can be assumed that these states and counties do not entirely represent existing states in counties within the US. Data may be incomplete or may include international municipalities, counties, etc.