challenge_1
maanusri balasubramanian
railroads
Reading in data and creating a post
Author

Maanusri Balasubramanian

Published

May 3, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

  • railroad_2012_clean_county.csv ⭐
  • birds.csv ⭐⭐
  • FAOstat*.csv ⭐⭐
  • wild_bird_data.xlsx ⭐⭐⭐
  • StateCounty2012.xls ⭐⭐⭐⭐

Find the _data folder, located inside the posts folder. Then you can read in the data, using either one of the readr standard tidy read commands, or a specialized package such as readxl.

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

Code
# loading the data
rr <- read_csv("_data/railroad_2012_clean_county.csv")

# printing first 5 rows of the dataset
head(rr)
# A tibble: 6 × 3
  state county               total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
# description of the dataset
str(rr)
spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state          : chr [1:2930] "AE" "AK" "AK" "AK" ...
 $ county         : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
 $ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   county = col_character(),
  ..   total_employees = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
Code
# number of rows and columns in the dataset
dim(rr)
[1] 2930    3
Code
# column names
colnames(rr)
[1] "state"           "county"          "total_employees"

From the above commands we can see that “railroad_2012_clean_county.csv” gives us the count of employees working in various counties in each state for the railroad in 2012. There are a total of 2930 entries. Each row gives us information about the number of employees in a county in the state. There are 3 columns, namely: state, county and total_employees.

Code
# Summarizing the data with summary
summary(rr)
    state              county          total_employees  
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00  
Code
# Arranging entries wrt total employees
arrange(rr, `total_employees`)
# A tibble: 2,930 × 3
   state county   total_employees
   <chr> <chr>              <dbl>
 1 AK    SITKA                  1
 2 AL    BARBOUR                1
 3 AL    HENRY                  1
 4 AP    APO                    1
 5 AR    NEWTON                 1
 6 CA    MONO                   1
 7 CO    BENT                   1
 8 CO    CHEYENNE               1
 9 CO    COSTILLA               1
10 CO    DOLORES                1
# ℹ 2,920 more rows
Code
# Arranging entries wrt total employees in the descending order
arrange(rr, desc(`total_employees`))
# A tibble: 2,930 × 3
   state county           total_employees
   <chr> <chr>                      <dbl>
 1 IL    COOK                        8207
 2 TX    TARRANT                     4235
 3 NE    DOUGLAS                     3797
 4 NY    SUFFOLK                     3685
 5 VA    INDEPENDENT CITY            3249
 6 FL    DUVAL                       3073
 7 CA    SAN BERNARDINO              2888
 8 CA    LOS ANGELES                 2545
 9 TX    HARRIS                      2535
10 NE    LINCOLN                     2289
# ℹ 2,920 more rows

From the above result we know that the country ‘COOK’ in IL has the highest number of employees: 8207 and 1 is the minimum number of employees in any country (many counties have only 1 employee).

Code
# Grouping in terms of state to summarise
grouped_rr_state <- rr%>%
  group_by(state)%>%
  summarize(state_employees = sum(total_employees))%>%
  arrange(desc(`state_employees`))
grouped_rr_state
# A tibble: 53 × 2
   state state_employees
   <chr>           <dbl>
 1 TX              19839
 2 IL              19131
 3 NY              17050
 4 NE              13176
 5 CA              13137
 6 PA              12769
 7 OH               9056
 8 GA               8605
 9 IN               8537
10 MO               8419
# ℹ 43 more rows
Code
dim(grouped_rr_state)
[1] 53  2

From the above results we know that TX has the highest number of rail road employees: 19839 and AP has the least number of employees: 1.

And from the dimensions of grouped_rr_state, we know that there are 53 unique states in which rail road employees work.