Challenge 1 Submission

challenge_1
Reading in data and creating a post
Author

Tanmay Agrawal

Published

December 20, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

  • railroad_2012_clean_county.csv ⭐
  • birds.csv ⭐⭐
  • FAOstat*.csv ⭐⭐
  • wild_bird_data.xlsx ⭐⭐⭐
  • StateCounty2012.xlsx ⭐⭐⭐⭐

Find the _data folder, located inside the posts folder. Then you can read in the data, using either one of the readr standard tidy read commands, or a specialized package such as readxl.

Code
# load the libs
library(readr)
library(readxl)

# read the data using standard csv loading function
data = read_csv("../posts/_data/railroad_2012_clean_county.csv")

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
# describe the data using str, brief summary of the columns, datatypes, sizes tell us that there are 3 columns with 2930 rows.
str(data)
spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state          : chr [1:2930] "AE" "AK" "AK" "AK" ...
 $ county         : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
 $ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   county = col_character(),
  ..   total_employees = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
Code
# show the first few entries using the head command
head(data)
# A tibble: 6 × 3
  state county               total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1
Code
# from a cursory analysis, it looks like the dataset describes the number of rail road employees by counties and their corresponding states.


# We can show the top-3 counties along with their states with the highest `total_employees` count
data %>%
  distinct(state, total_employees) %>%
  arrange(desc(total_employees)) %>%
  top_n(3)
# A tibble: 3 × 2
  state total_employees
  <chr>           <dbl>
1 IL               8207
2 TX               4235
3 NE               3797
Code
# Similarly we could also look at the bottom 3.
data %>%
  distinct(state, total_employees) %>%
  arrange(total_employees) %>%
  head(3)
# A tibble: 3 × 2
  state total_employees
  <chr>           <dbl>
1 AK                  1
2 AL                  1
3 AP                  1

We can also look at the distinct states, turns out they have more than 50 unique entries in the state column. This means that the state column has some additional entries that represent places that can be considered a state for all intents and purposes for railroad employee data. These could be overseas territories.

Code
data %>%
  distinct(state)
# A tibble: 53 × 1
   state
   <chr>
 1 AE   
 2 AK   
 3 AL   
 4 AP   
 5 AR   
 6 AZ   
 7 CA   
 8 CO   
 9 CT   
10 DC   
# … with 43 more rows

Overall the dataset is a simple record of railroad employee by state and counties which could be used to allocate resources to these states based on their needs and requirements.