Sarah McAlpine - Challenge 1

challenge_1

railroads

birds

sarahmcalpine

Author

Sarah McAlpine

Published

September 12, 2022

Code

library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, results = 'asis')

#setup for data frame summary
st_options(plain.ascii = FALSE)

Reading in the birds.csv Data

Below I will read in the birds.csv data set and use a data frame summary (dfSummary) to summarize it.

Code

# load the summary tools library
library(summarytools)

# use read_csv to read in and assign the birds data
birds <- read_csv("_data/birds.csv")
simplebirds <- select(birds, "Domain", "Area", "Item", "Year", "Value") 
dfSummary(simplebirds, style = "grid")

Data Frame Summary

simplebirds

Dimensions: 30977 x 5
Duplicates: 0

No	Variable	Stats / Values	Freqs (% of Valid)	Graph	Valid	Missing
1	Domain [character]	1. Live Animals	30977 (100.0%)	IIIIIIIIIIIIIIIIIIII	30977 (100.0%)	0 (0.0%)
2	Area [character]	1. Africa 2. Asia 3. Eastern Asia 4. Egypt 5. Europe 6. France 7. Greece 8. Myanmar 9. Northern Africa 10. South-eastern Asia [ 238 others ]	290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 28077 (90.6%)	IIIIIIIIIIIIIIIIII	30977 (100.0%)	0 (0.0%)
3	Item [character]	1. Chickens 2. Ducks 3. Geese and guinea fowls 4. Pigeons, other birds 5. Turkeys	13074 (42.2%) 6909 (22.3%) 4136 (13.4%) 1165 ( 3.8%) 5693 (18.4%)	IIIIIIII IIII II III	30977 (100.0%)	0 (0.0%)
4	Year [numeric]	Mean (sd) : 1990.6 (16.7) min < med < max: 1961 < 1992 < 2018 IQR (CV) : 29 (0)	58 distinct values	. . . . : : : :\ : : . : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :	30977 (100.0%)	0 (0.0%)
5	Value [numeric]	Mean (sd) : 99410.6 (720611.4) min < med < max: 0 < 1800 < 23707134 IQR (CV) : 15233 (7.2)	11495 distinct values	: : : : :	29941 (96.7%)	1036 (3.3%)

Summary of birds.csv

The dataset includes annual poultry (chickens, turkeys, ducks, geese and guinea, pigeons/other) counts by thousands from 1961-2018 globally. About 35% are official figures, 32% are FAO estimates, 21% are aggregates, 3% data not available, 5% are unofficial, and 4% are FAO data based on imputation methodology. This seems to be a subset of other data since many column values are identical across all the data.

Reading in the Railroad Data

Code

library(summarytools)
rr <- read_csv("_data/railroad_2012_clean_county.csv")
#| label: summary
dfSummary(rr)

Data Frame Summary

rr

Dimensions: 2930 x 3
Duplicates: 0

No	Variable	Stats / Values	Freqs (% of Valid)	Graph	Valid
1	state [character]	1. TX 2. GA 3. KY 4. MO 5. IL 6. IA 7. KS 8. NC 9. IN 10. VA [ 43 others ]	221 ( 7.5%) 152 ( 5.2%) 119 ( 4.1%) 115 ( 3.9%) 103 ( 3.5%) 99 ( 3.4%) 95 ( 3.2%) 94 ( 3.2%) 92 ( 3.1%) 92 ( 3.1%) 1748 (59.7%)	I I IIIIIIIIIII	2930 (100.0%)
2	county [character]	1. WASHINGTON 2. JEFFERSON 3. FRANKLIN 4. LINCOLN 5. JACKSON 6. MADISON 7. MONTGOMERY 8. CLAY 9. MARION 10. MONROE [ 1699 others ]	31 ( 1.1%) 26 ( 0.9%) 24 ( 0.8%) 24 ( 0.8%) 22 ( 0.8%) 19 ( 0.6%) 18 ( 0.6%) 17 ( 0.6%) 17 ( 0.6%) 17 ( 0.6%) 2715 (92.7%)	IIIIIIIIIIIIIIIIII	2930 (100.0%)
3	total_employees [numeric]	Mean (sd) : 87.2 (283.6) min < med < max: 1 < 21 < 8207 IQR (CV) : 58 (3.3)	404 distinct values	: : : : :	2930 (100.0%)

Summary of Railroad Data

This dataset includes the number of employees at railroads by county by state. In order to get a single case, I used mutate() to disambiguate county names that appear in multiple states; however I recognize this would duplicate some values and possibly inflate overall figures. Aside from the county name overlap, this is remarkably clean data, as there are no missing values and only three columns. I’m not sure why my tibble below isn’t in table format.

Code

# Name a new dataset with a combined county-and-state column
rr_case <- mutate(rr, county_ST = paste(county,state, sep = '_'), )
#preview the data .
head(select(rr_case, county_ST, "total_employees"))

A tibble: 6 × 2

county_ST total_employees 1 APO_AE 2 2 ANCHORAGE_AK 7 3 FAIRBANKS NORTH STAR_AK 2 4 JUNEAU_AK 3 5 MATANUSKA-SUSITNA_AK 2 6 SITKA_AK 1