Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Tanmay Agrawal
December 20, 2022
Today’s challenge is to
read in a dataset, and
describe the dataset using both words and any supporting information (e.g., tables, etc)
Read in one (or more) of the following data sets, using the correct R package and command.
Find the _data
folder, located inside the posts
folder. Then you can read in the data, using either one of the readr
standard tidy read commands, or a specialized package such as readxl
.
Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ state : chr [1:2930] "AE" "AK" "AK" "AK" ...
$ county : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
$ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
- attr(*, "spec")=
.. cols(
.. state = col_character(),
.. county = col_character(),
.. total_employees = col_double()
.. )
- attr(*, "problems")=<externalptr>
# A tibble: 6 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
# from a cursory analysis, it looks like the dataset describes the number of rail road employees by counties and their corresponding states.
# We can show the top-3 counties along with their states with the highest `total_employees` count
data %>%
distinct(state, total_employees) %>%
arrange(desc(total_employees)) %>%
top_n(3)
# A tibble: 3 × 2
state total_employees
<chr> <dbl>
1 IL 8207
2 TX 4235
3 NE 3797
# A tibble: 3 × 2
state total_employees
<chr> <dbl>
1 AK 1
2 AL 1
3 AP 1
We can also look at the distinct states, turns out they have more than 50 unique entries in the state column. This means that the state column has some additional entries that represent places that can be considered a state for all intents and purposes for railroad employee data. These could be overseas territories.
# A tibble: 53 × 1
state
<chr>
1 AE
2 AK
3 AL
4 AP
5 AR
6 AZ
7 CA
8 CO
9 CT
10 DC
# … with 43 more rows
Overall the dataset is a simple record of railroad employee by state and counties which could be used to allocate resources to these states based on their needs and requirements.
---
title: "Challenge 1 Submission"
author: "Tanmay Agrawal"
description: "Reading in data and creating a post"
date: "12/20/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_1
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a dataset, and
2) describe the dataset using both words and any supporting information (e.g., tables, etc)
## Read in the Data
Read in one (or more) of the following data sets, using the correct R package and command.
- railroad_2012_clean_county.csv ⭐
- birds.csv ⭐⭐
- FAOstat\*.csv ⭐⭐
- wild_bird_data.xlsx ⭐⭐⭐
- StateCounty2012.xlsx ⭐⭐⭐⭐
Find the `_data` folder, located inside the `posts` folder. Then you can read in the data, using either one of the `readr` standard tidy read commands, or a specialized package such as `readxl`.
```{r}
# load the libs
library(readr)
library(readxl)
# read the data using standard csv loading function
data = read_csv("../posts/_data/railroad_2012_clean_county.csv")
```
Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
```{r}
#| label: summary
# describe the data using str, brief summary of the columns, datatypes, sizes tell us that there are 3 columns with 2930 rows.
str(data)
# show the first few entries using the head command
head(data)
# from a cursory analysis, it looks like the dataset describes the number of rail road employees by counties and their corresponding states.
# We can show the top-3 counties along with their states with the highest `total_employees` count
data %>%
distinct(state, total_employees) %>%
arrange(desc(total_employees)) %>%
top_n(3)
# Similarly we could also look at the bottom 3.
data %>%
distinct(state, total_employees) %>%
arrange(total_employees) %>%
head(3)
```
We can also look at the distinct states, turns out they have more than 50 unique entries in the state column. This means that the state column has some additional entries that represent places that can be considered a state for all intents and purposes for railroad employee data. These could be overseas territories.
```{r}
data %>%
distinct(state)
```
Overall the dataset is a simple record of railroad employee by state and counties which could be used to allocate resources to these states based on their needs and requirements.