Code
library(tidyverse)
library(summarytools)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, results = 'asis')
knitr
#setup for data frame summary
st_options(plain.ascii = FALSE)
Sarah McAlpine
September 12, 2022
Below I will read in the birds.csv data set and use a data frame summary (dfSummary
) to summarize it.
Dimensions: 30977 x 5
Duplicates: 0
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
---|---|---|---|---|---|---|
1 | Domain [character] |
1. Live Animals | 30977 (100.0%) | IIIIIIIIIIIIIIIIIIII | 30977 (100.0%) |
0 (0.0%) |
2 | Area [character] |
1. Africa 2. Asia 3. Eastern Asia 4. Egypt 5. Europe 6. France 7. Greece 8. Myanmar 9. Northern Africa 10. South-eastern Asia [ 238 others ] |
290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 290 ( 0.9%) 28077 (90.6%) |
IIIIIIIIIIIIIIIIII |
30977 (100.0%) |
0 (0.0%) |
3 | Item [character] |
1. Chickens 2. Ducks 3. Geese and guinea fowls 4. Pigeons, other birds 5. Turkeys |
13074 (42.2%) 6909 (22.3%) 4136 (13.4%) 1165 ( 3.8%) 5693 (18.4%) |
IIIIIIII IIII II III |
30977 (100.0%) |
0 (0.0%) |
4 | Year [numeric] |
Mean (sd) : 1990.6 (16.7) min < med < max: 1961 < 1992 < 2018 IQR (CV) : 29 (0) |
58 distinct values |
|
30977 (100.0%) |
0 (0.0%) |
5 | Value [numeric] |
Mean (sd) : 99410.6 (720611.4) min < med < max: 0 < 1800 < 23707134 IQR (CV) : 15233 (7.2) |
11495 distinct values | : : : : : |
29941 (96.7%) |
1036 (3.3%) |
The dataset includes annual poultry (chickens, turkeys, ducks, geese and guinea, pigeons/other) counts by thousands from 1961-2018 globally. About 35% are official figures, 32% are FAO estimates, 21% are aggregates, 3% data not available, 5% are unofficial, and 4% are FAO data based on imputation methodology. This seems to be a subset of other data since many column values are identical across all the data.
Dimensions: 2930 x 3
Duplicates: 0
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
---|---|---|---|---|---|---|
1 | state [character] |
1. TX 2. GA 3. KY 4. MO 5. IL 6. IA 7. KS 8. NC 9. IN 10. VA [ 43 others ] |
221 ( 7.5%) 152 ( 5.2%) 119 ( 4.1%) 115 ( 3.9%) 103 ( 3.5%) 99 ( 3.4%) 95 ( 3.2%) 94 ( 3.2%) 92 ( 3.1%) 92 ( 3.1%) 1748 (59.7%) |
I I IIIIIIIIIII |
2930 (100.0%) |
0 (0.0%) |
2 | county [character] |
1. WASHINGTON 2. JEFFERSON 3. FRANKLIN 4. LINCOLN 5. JACKSON 6. MADISON 7. MONTGOMERY 8. CLAY 9. MARION 10. MONROE [ 1699 others ] |
31 ( 1.1%) 26 ( 0.9%) 24 ( 0.8%) 24 ( 0.8%) 22 ( 0.8%) 19 ( 0.6%) 18 ( 0.6%) 17 ( 0.6%) 17 ( 0.6%) 17 ( 0.6%) 2715 (92.7%) |
IIIIIIIIIIIIIIIIII |
2930 (100.0%) |
0 (0.0%) |
3 | total_employees [numeric] |
Mean (sd) : 87.2 (283.6) min < med < max: 1 < 21 < 8207 IQR (CV) : 58 (3.3) |
404 distinct values | : : : : : |
2930 (100.0%) |
0 (0.0%) |
This dataset includes the number of employees at railroads by county by state. In order to get a single case, I used mutate()
to disambiguate county names that appear in multiple states; however I recognize this would duplicate some values and possibly inflate overall figures. Aside from the county name overlap, this is remarkably clean data, as there are no missing values and only three columns. I’m not sure why my tibble below isn’t in table format.
county_ST total_employees
---
title: "Sarah McAlpine - Challenge 1"
author: "Sarah McAlpine"
desription: "Reading in data and creating a post"
date: "9/12/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_1
- railroads
- birds
- sarahmcalpine
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(summarytools)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, results = 'asis')
#setup for data frame summary
st_options(plain.ascii = FALSE)
```
## Reading in the birds.csv Data
Below I will read in the birds.csv data set and use a data frame summary (`dfSummary`) to summarize it.
```{r}
# load the summary tools library
library(summarytools)
# use read_csv to read in and assign the birds data
birds <- read_csv("_data/birds.csv")
simplebirds <- select(birds, "Domain", "Area", "Item", "Year", "Value")
dfSummary(simplebirds, style = "grid")
```
### Summary of birds.csv
The dataset includes annual poultry (chickens, turkeys, ducks, geese and guinea, pigeons/other) counts by thousands from 1961-2018 globally. About 35% are official figures, 32% are FAO estimates, 21% are aggregates, 3% data not available, 5% are unofficial, and 4% are FAO data based on imputation methodology. This seems to be a subset of other data since many column values are identical across all the data.
## Reading in the Railroad Data
```{r}
library(summarytools)
rr <- read_csv("_data/railroad_2012_clean_county.csv")
#| label: summary
dfSummary(rr)
```
### Summary of Railroad Data
This dataset includes the number of employees at railroads by county by state. In order to get a single case, I used `mutate()` to disambiguate county names that appear in multiple states; however I recognize this would duplicate some values and possibly inflate overall figures. Aside from the county name overlap, this is remarkably clean data, as there are no missing values and only three columns. I'm not sure why my tibble below isn't in table format.
```{r}
# Name a new dataset with a combined county-and-state column
rr_case <- mutate(rr, county_ST = paste(county,state, sep = '_'), )
#preview the data .
head(select(rr_case, county_ST, "total_employees"))
```