DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Sarah McAlpine - Challenge 1

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Reading in the birds.csv Data
    • Data Frame Summary
    • Summary of birds.csv
  • Reading in the Railroad Data
    • Data Frame Summary
    • Summary of Railroad Data
  • A tibble: 6 × 2

Sarah McAlpine - Challenge 1

  • Show All Code
  • Hide All Code

  • View Source
challenge_1
railroads
birds
sarahmcalpine
Author

Sarah McAlpine

Published

September 12, 2022

Code
library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, results = 'asis')

#setup for data frame summary
st_options(plain.ascii = FALSE)

Reading in the birds.csv Data

Below I will read in the birds.csv data set and use a data frame summary (dfSummary) to summarize it.

Code
# load the summary tools library
library(summarytools)

# use read_csv to read in and assign the birds data
birds <- read_csv("_data/birds.csv")
simplebirds <- select(birds, "Domain", "Area", "Item", "Year", "Value") 
dfSummary(simplebirds, style = "grid")

Data Frame Summary

simplebirds

Dimensions: 30977 x 5
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 Domain
[character]
1. Live Animals 30977 (100.0%) IIIIIIIIIIIIIIIIIIII 30977
(100.0%)
0
(0.0%)
2 Area
[character]
1. Africa
2. Asia
3. Eastern Asia
4. Egypt
5. Europe
6. France
7. Greece
8. Myanmar
9. Northern Africa
10. South-eastern Asia
[ 238 others ]
290 ( 0.9%)
290 ( 0.9%)
290 ( 0.9%)
290 ( 0.9%)
290 ( 0.9%)
290 ( 0.9%)
290 ( 0.9%)
290 ( 0.9%)
290 ( 0.9%)
290 ( 0.9%)
28077 (90.6%)










IIIIIIIIIIIIIIIIII
30977
(100.0%)
0
(0.0%)
3 Item
[character]
1. Chickens
2. Ducks
3. Geese and guinea fowls
4. Pigeons, other birds
5. Turkeys
13074 (42.2%)
6909 (22.3%)
4136 (13.4%)
1165 ( 3.8%)
5693 (18.4%)
IIIIIIII
IIII
II

III
30977
(100.0%)
0
(0.0%)
4 Year
[numeric]
Mean (sd) : 1990.6 (16.7)
min < med < max:
1961 < 1992 < 2018
IQR (CV) : 29 (0)
58 distinct values
. . .   . :   : : :\
: : . : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
30977
(100.0%)
0
(0.0%)
5 Value
[numeric]
Mean (sd) : 99410.6 (720611.4)
min < med < max:
0 < 1800 < 23707134
IQR (CV) : 15233 (7.2)
11495 distinct values :
:
:
:
:
29941
(96.7%)
1036
(3.3%)

Summary of birds.csv

The dataset includes annual poultry (chickens, turkeys, ducks, geese and guinea, pigeons/other) counts by thousands from 1961-2018 globally. About 35% are official figures, 32% are FAO estimates, 21% are aggregates, 3% data not available, 5% are unofficial, and 4% are FAO data based on imputation methodology. This seems to be a subset of other data since many column values are identical across all the data.

Reading in the Railroad Data

Code
library(summarytools)
rr <- read_csv("_data/railroad_2012_clean_county.csv")
#| label: summary
dfSummary(rr)

Data Frame Summary

rr

Dimensions: 2930 x 3
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 state
[character]
1. TX
2. GA
3. KY
4. MO
5. IL
6. IA
7. KS
8. NC
9. IN
10. VA
[ 43 others ]
221 ( 7.5%)
152 ( 5.2%)
119 ( 4.1%)
115 ( 3.9%)
103 ( 3.5%)
99 ( 3.4%)
95 ( 3.2%)
94 ( 3.2%)
92 ( 3.1%)
92 ( 3.1%)
1748 (59.7%)
I
I








IIIIIIIIIII
2930
(100.0%)
0
(0.0%)
2 county
[character]
1. WASHINGTON
2. JEFFERSON
3. FRANKLIN
4. LINCOLN
5. JACKSON
6. MADISON
7. MONTGOMERY
8. CLAY
9. MARION
10. MONROE
[ 1699 others ]
31 ( 1.1%)
26 ( 0.9%)
24 ( 0.8%)
24 ( 0.8%)
22 ( 0.8%)
19 ( 0.6%)
18 ( 0.6%)
17 ( 0.6%)
17 ( 0.6%)
17 ( 0.6%)
2715 (92.7%)










IIIIIIIIIIIIIIIIII
2930
(100.0%)
0
(0.0%)
3 total_employees
[numeric]
Mean (sd) : 87.2 (283.6)
min < med < max:
1 < 21 < 8207
IQR (CV) : 58 (3.3)
404 distinct values :
:
:
:
:
2930
(100.0%)
0
(0.0%)

Summary of Railroad Data

This dataset includes the number of employees at railroads by county by state. In order to get a single case, I used mutate() to disambiguate county names that appear in multiple states; however I recognize this would duplicate some values and possibly inflate overall figures. Aside from the county name overlap, this is remarkably clean data, as there are no missing values and only three columns. I’m not sure why my tibble below isn’t in table format.

Code
# Name a new dataset with a combined county-and-state column
rr_case <- mutate(rr, county_ST = paste(county,state, sep = '_'), )
#preview the data .
head(select(rr_case, county_ST, "total_employees")) 

A tibble: 6 × 2

county_ST total_employees 1 APO_AE 2 2 ANCHORAGE_AK 7 3 FAIRBANKS NORTH STAR_AK 2 4 JUNEAU_AK 3 5 MATANUSKA-SUSITNA_AK 2 6 SITKA_AK 1

Source Code
---
title: "Sarah McAlpine - Challenge 1"
author: "Sarah McAlpine"
desription: "Reading in data and creating a post"
date: "9/12/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_1
  - railroads
  - birds
  - sarahmcalpine
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, results = 'asis')

#setup for data frame summary
st_options(plain.ascii = FALSE)
```

## Reading in the birds.csv Data

Below I will read in the birds.csv data set and use a data frame summary (`dfSummary`) to summarize it. 

```{r}
# load the summary tools library
library(summarytools)

# use read_csv to read in and assign the birds data
birds <- read_csv("_data/birds.csv")
simplebirds <- select(birds, "Domain", "Area", "Item", "Year", "Value") 
dfSummary(simplebirds, style = "grid")

```

### Summary of birds.csv

The dataset includes annual poultry (chickens, turkeys, ducks, geese and guinea, pigeons/other) counts by thousands from 1961-2018 globally. About 35% are official figures, 32% are FAO estimates, 21% are aggregates, 3% data not available, 5% are unofficial, and 4% are FAO data based on imputation methodology. This seems to be a subset of other data since many column values are identical across all the data.

## Reading in the Railroad Data


```{r}
library(summarytools)
rr <- read_csv("_data/railroad_2012_clean_county.csv")
#| label: summary
dfSummary(rr)
```
### Summary of Railroad Data
This dataset includes the number of employees at railroads by county by state. In order to get a single case, I used `mutate()` to disambiguate county names that appear in multiple states; however I recognize this would duplicate some values and possibly inflate overall figures. Aside from the county name overlap, this is remarkably clean data, as there are no missing values and only three columns. I'm not sure why my tibble below isn't in table format.

```{r}
# Name a new dataset with a combined county-and-state column
rr_case <- mutate(rr, county_ST = paste(county,state, sep = '_'), )
#preview the data .
head(select(rr_case, county_ST, "total_employees")) 

```