challenge_1
railroads
poobigan murugesan
Reading in data and creating a post
Author

Poobigan Murugesan

Published

May 9, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

  • railroad_2012_clean_county.csv ⭐
  • birds.csv ⭐⭐
  • FAOstat*.csv ⭐⭐
  • wild_bird_data.xlsx ⭐⭐⭐
  • StateCounty2012.xls ⭐⭐⭐⭐

Find the _data folder, located inside the posts folder. Then you can read in the data, using either one of the readr standard tidy read commands, or a specialized package such as readxl.

Loading the data and printing the top few rows

Code
df <- read_csv("_data/railroad_2012_clean_county.csv") 
head(df)
# A tibble: 6 × 3
  state county               total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Description of the railroad dataset

Code
str(df)
spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state          : chr [1:2930] "AE" "AK" "AK" "AK" ...
 $ county         : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
 $ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   county = col_character(),
  ..   total_employees = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Dimensions of the dataset (rows, columns) and names of columns

Code
dim(df)
[1] 2930    3
Code
colnames(df)
[1] "state"           "county"          "total_employees"

Based on the above lines of code, it can be observed that the “railroad_2012_clean_county” dataset provides data on the number of workers employed by the railroad in each state’s various counties in the year 2012. The dataset comprises 2930 records in total, each representing a county and providing information on the number of employees. The dataset consists of three columns: state, county, and total_employees.

Summarizing the data

Code
summary(df)
    state              county          total_employees  
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00  

Sorting data based on the total number of employees in increasing order

Code
arrange(df,total_employees)
# A tibble: 2,930 × 3
   state county   total_employees
   <chr> <chr>              <dbl>
 1 AK    SITKA                  1
 2 AL    BARBOUR                1
 3 AL    HENRY                  1
 4 AP    APO                    1
 5 AR    NEWTON                 1
 6 CA    MONO                   1
 7 CO    BENT                   1
 8 CO    CHEYENNE               1
 9 CO    COSTILLA               1
10 CO    DOLORES                1
# ℹ 2,920 more rows

Sorting data based on the total number of employees in decreasing order

Code
arrange(df, desc(total_employees))
# A tibble: 2,930 × 3
   state county           total_employees
   <chr> <chr>                      <dbl>
 1 IL    COOK                        8207
 2 TX    TARRANT                     4235
 3 NE    DOUGLAS                     3797
 4 NY    SUFFOLK                     3685
 5 VA    INDEPENDENT CITY            3249
 6 FL    DUVAL                       3073
 7 CA    SAN BERNARDINO              2888
 8 CA    LOS ANGELES                 2545
 9 TX    HARRIS                      2535
10 NE    LINCOLN                     2289
# ℹ 2,920 more rows

Based on these outputs, it is evident that the county named ‘COOK’ in IL has the most employees, which is 8207. Moreover, there exist several counties where only one employee is present, which is the minimum number of employees in any given county.

Grouping railroad employees by state

Code
group <- df %>%
  group_by(state) %>%
  summarize(employees=sum(total_employees)) %>%
  arrange(desc(employees))
group
# A tibble: 53 × 2
   state employees
   <chr>     <dbl>
 1 TX        19839
 2 IL        19131
 3 NY        17050
 4 NE        13176
 5 CA        13137
 6 PA        12769
 7 OH         9056
 8 GA         8605
 9 IN         8537
10 MO         8419
# ℹ 43 more rows

Based on the results shown above, we can see that the state of Texas has the largest number of railroad employees, with a count of 19839 followed by Illinois and New York with 19131 and 17050 employees respectively. Also, from the dimensions of the group dataset we can conclude that there are 53 distinct states where people are employed to work railroads.