Challenge-2

challenge_2

railroads

Author

Said Arslan

Published

September 20, 2022

Code

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2

Warning: package 'ggplot2' was built under R version 4.2.2

Warning: package 'stringr' was built under R version 4.2.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

I have picked the railroad data for this challenge. It includes information about railroad employees in 2012.

Code

railroad <- read.csv("_data/railroad_2012_clean_county.csv")

Describe the data

Code

dim(railroad)

[1] 2930    3

Code

head(railroad)

  state               county total_employees
1    AE                  APO               2
2    AK            ANCHORAGE               7
3    AK FAIRBANKS NORTH STAR               2
4    AK               JUNEAU               3
5    AK    MATANUSKA-SUSITNA               2
6    AK                SITKA               1

Code

summary(railroad)

    state              county          total_employees  
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00

In the dataset there are 2930 rows (observations) and 3 columns (variables). Each row gives the number of railroad employees in a county of a state.

Code

sum(is.na(railroad$state))

[1] 0

Code

sum(is.na(railroad$county))

[1] 0

Code

sum(is.na(railroad$total_employees))

[1] 0

There are no missing values.

Code

n_distinct(railroad$state)

[1] 53

Code

n_distinct(railroad$county)

[1] 1709

I would expect 51 distinct values under state column but there are 53.

Code

unique(railroad$state)

 [1] "AE" "AK" "AL" "AP" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA"
[16] "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC"
[31] "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN"
[46] "TX" "UT" "VA" "VT" "WA" "WI" "WV" "WY"

All are state abbreviations except “AE” and “AP”.

On the other hand, there are 2930 observations but 1709 distinct county names, which implies that there are a lot of counties with same name in different states.

Provide Grouped Summary Statistics

Code

sum(railroad$total_employees)

[1] 255432

There are total of 255,432 railroad employees in the U.S. in 2012.

Code

railroad %>% 
  group_by(state) %>% 
  summarise(total_employees= sum(total_employees),  
            proportion= round(total_employees/sum(railroad$total_employees)*100,1)) %>% 
  arrange(desc(total_employees))

# A tibble: 53 × 3
   state total_employees proportion
   <chr>           <int>      <dbl>
 1 TX              19839        7.8
 2 IL              19131        7.5
 3 NY              17050        6.7
 4 NE              13176        5.2
 5 CA              13137        5.1
 6 PA              12769        5  
 7 OH               9056        3.5
 8 GA               8605        3.4
 9 IN               8537        3.3
10 MO               8419        3.3
# … with 43 more rows

Top 3 states with the largest number of railroad employees are Texas, Illinois and New York. 7.8% of railroad employees in the country are from Texas.

Code

railroad %>% 
  group_by(state, county) %>% 
  summarise(total_employees= sum(total_employees)) %>% 
  arrange(desc(total_employees)) %>% 
  head()

# A tibble: 6 × 3
# Groups:   state [6]
  state county           total_employees
  <chr> <chr>                      <int>
1 IL    COOK                        8207
2 TX    TARRANT                     4235
3 NE    DOUGLAS                     3797
4 NY    SUFFOLK                     3685
5 VA    INDEPENDENT CITY            3249
6 FL    DUVAL                       3073

County Cook of Illiniois has the highest number of employees with 8,207.

Explain and Interpret

Geographically large and populated states like Texas, Illinois have more employment which makes quite sense. If the dataset is merged with other datasets that includes information about such as geographical characteristics of states, population, length of railroads etc., very interesting further analysis can be made.

Challenge Overview

Today’s challenge is to

read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

railroad*.csv or StateCounty2012.xls ⭐
FAOstat*.csv or birds.csv ⭐⭐⭐
hotel_bookings.csv ⭐⭐⭐⭐

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.