Challenge 1 Solution

challenge_1

railroads

faostat

wildbirds

Reading in data and creating a post

Author

Shreya Varma

Published

June 17, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

read in a dataset, and
describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

Solution: As it is a csv file I am using read_csv() funtion to read the data and viewing the top 10 rows to see if the data was correctly imported along with headers.

Code

railroad_data <- read_csv("_data/railroad_2012_clean_county.csv")
head(railroad_data,10)

# A tibble: 10 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1

Describe the data

Solution: This looks like a dataset of number of Railroad employees in the US at a state and county level. The Cook county in Illinois had maximum exployees 8207 and there are multiple counties which had only 1 employee which is the least number of employees. The average number of employees in each county is 87. We see that the county name has duplicates this means that it is not unique and the associated state information is necessary. There are a total of 255,432 employees for 53 states in the US. There are no missing values in any of the columns.

Code

spec(railroad_data)

cols(
  state = col_character(),
  county = col_character(),
  total_employees = col_double()
)

Code

summary(railroad_data)

    state              county          total_employees  
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00

Code

n_distinct(railroad_data$state)

[1] 53

Code

sum(railroad_data$total_employees)

[1] 255432

Code

colSums(is.na(railroad_data))

          state          county total_employees 
              0               0               0

Code

any(duplicated(railroad_data$county))

[1] TRUE

Code

subset(railroad_data,duplicated(county))

# A tibble: 1,221 × 3
   state county    total_employees
   <chr> <chr>               <dbl>
 1 AP    APO                     1
 2 AR    CALHOUN                 5
 3 AR    CLAY                   13
 4 AR    CLEBURNE                8
 5 AR    DALLAS                 12
 6 AR    FRANKLIN                5
 7 AR    GREENE                 15
 8 AR    JACKSON                13
 9 AR    JEFFERSON             361
10 AR    LAWRENCE               32
# ℹ 1,211 more rows

Code

railroad_data[duplicated(railroad_data$county), ]

# A tibble: 1,221 × 3
   state county    total_employees
   <chr> <chr>               <dbl>
 1 AP    APO                     1
 2 AR    CALHOUN                 5
 3 AR    CLAY                   13
 4 AR    CLEBURNE                8
 5 AR    DALLAS                 12
 6 AR    FRANKLIN                5
 7 AR    GREENE                 15
 8 AR    JACKSON                13
 9 AR    JEFFERSON             361
10 AR    LAWRENCE               32
# ℹ 1,211 more rows

Code

railroad_data[railroad_data$county == 'APO',]

# A tibble: 2 × 3
  state county total_employees
  <chr> <chr>            <dbl>
1 AE    APO                  2
2 AP    APO                  1

Code

railroad_data %>% arrange(desc(total_employees))

# A tibble: 2,930 × 3
   state county           total_employees
   <chr> <chr>                      <dbl>
 1 IL    COOK                        8207
 2 TX    TARRANT                     4235
 3 NE    DOUGLAS                     3797
 4 NY    SUFFOLK                     3685
 5 VA    INDEPENDENT CITY            3249
 6 FL    DUVAL                       3073
 7 CA    SAN BERNARDINO              2888
 8 CA    LOS ANGELES                 2545
 9 TX    HARRIS                      2535
10 NE    LINCOLN                     2289
# ℹ 2,920 more rows