Challenge 1 Solution

challenge_1
railroads
faostat
wildbirds
Reading in data and creating a post
Author

Shreya Varma

Published

June 17, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

Solution: As it is a csv file I am using read_csv() funtion to read the data and viewing the top 10 rows to see if the data was correctly imported along with headers.

Code
railroad_data <- read_csv("_data/railroad_2012_clean_county.csv")
head(railroad_data,10)
# A tibble: 10 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1

Describe the data

Solution: This looks like a dataset of number of Railroad employees in the US at a state and county level. The Cook county in Illinois had maximum exployees 8207 and there are multiple counties which had only 1 employee which is the least number of employees. The average number of employees in each county is 87. We see that the county name has duplicates this means that it is not unique and the associated state information is necessary. There are a total of 255,432 employees for 53 states in the US. There are no missing values in any of the columns.

Code
spec(railroad_data)
cols(
  state = col_character(),
  county = col_character(),
  total_employees = col_double()
)
Code
summary(railroad_data)
    state              county          total_employees  
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00  
Code
n_distinct(railroad_data$state)
[1] 53
Code
sum(railroad_data$total_employees)
[1] 255432
Code
colSums(is.na(railroad_data))
          state          county total_employees 
              0               0               0 
Code
any(duplicated(railroad_data$county))
[1] TRUE
Code
subset(railroad_data,duplicated(county))
# A tibble: 1,221 × 3
   state county    total_employees
   <chr> <chr>               <dbl>
 1 AP    APO                     1
 2 AR    CALHOUN                 5
 3 AR    CLAY                   13
 4 AR    CLEBURNE                8
 5 AR    DALLAS                 12
 6 AR    FRANKLIN                5
 7 AR    GREENE                 15
 8 AR    JACKSON                13
 9 AR    JEFFERSON             361
10 AR    LAWRENCE               32
# ℹ 1,211 more rows
Code
railroad_data[duplicated(railroad_data$county), ]
# A tibble: 1,221 × 3
   state county    total_employees
   <chr> <chr>               <dbl>
 1 AP    APO                     1
 2 AR    CALHOUN                 5
 3 AR    CLAY                   13
 4 AR    CLEBURNE                8
 5 AR    DALLAS                 12
 6 AR    FRANKLIN                5
 7 AR    GREENE                 15
 8 AR    JACKSON                13
 9 AR    JEFFERSON             361
10 AR    LAWRENCE               32
# ℹ 1,211 more rows
Code
railroad_data[railroad_data$county == 'APO',]
# A tibble: 2 × 3
  state county total_employees
  <chr> <chr>            <dbl>
1 AE    APO                  2
2 AP    APO                  1
Code
railroad_data %>% arrange(desc(total_employees))
# A tibble: 2,930 × 3
   state county           total_employees
   <chr> <chr>                      <dbl>
 1 IL    COOK                        8207
 2 TX    TARRANT                     4235
 3 NE    DOUGLAS                     3797
 4 NY    SUFFOLK                     3685
 5 VA    INDEPENDENT CITY            3249
 6 FL    DUVAL                       3073
 7 CA    SAN BERNARDINO              2888
 8 CA    LOS ANGELES                 2545
 9 TX    HARRIS                      2535
10 NE    LINCOLN                     2289
# ℹ 2,920 more rows