Challenge 2 Submission

challenge_2

railroads

faostat

hotel_bookings

Data wrangling: using group() and summarise()

Author

Danny Holt

Published

May 30, 2023

Code

library(tidyverse)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

For this challenge, I will read in the data set railroad_2012_clean_county.csv.

Code

  railroad <- readr::read_csv("_data/railroad_2012_clean_county.csv")
  railroad

# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# ℹ 2,920 more rows

Describe the data

We can see that there are three variables in the dataset: state, county, and total employees. Based on the name of the file, we can infer that the data shows the state, county, and number of employees of a location on a railroad (or multiple railroads) in 2012. I would guess that this data came directly from the railroad operator(s).

First, we can use the glimpse() function to take a quick look at the data.

Code

  glimpse(railroad)

Rows: 2,930
Columns: 3
$ state           <chr> "AE", "AK", "AK", "AK", "AK", "AK", "AK", "AL", "AL", …
$ county          <chr> "APO", "ANCHORAGE", "FAIRBANKS NORTH STAR", "JUNEAU", …
$ total_employees <dbl> 2, 7, 2, 3, 2, 1, 88, 102, 143, 1, 25, 154, 13, 29, 45…

To look at the distinct states in the data, we can see that there are 53 “states.” If we look at the data, this includes “states” beyond the 50 in the US, including “DC,” likely Washington, D.C., as wells as “AE” and AP.”

Code

  railroad %>%
    select(state) %>%
    n_distinct()

[1] 53

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Let’s look at the central tendency and dispersion of employee data.

Let’s find the mean number of employees per railroad location:

Code

  summarize(railroad, mean.employees=mean(total_employees), na.rm=TRUE)

# A tibble: 1 × 2
  mean.employees na.rm
           <dbl> <lgl>
1           87.2 TRUE

Next, we’ll examine the median:

Code

  summarize(railroad, median.employees=median(total_employees), na.rm=TRUE)

# A tibble: 1 × 2
  median.employees na.rm
             <dbl> <lgl>
1               21 TRUE

Here, we see that the median number of employees is significantly lower than the median. Let’s look at the standard deviation and interquartile range for more clarity:

Code

  summarize(railroad, employees.sd=sd(total_employees), employees.IQR=IQR(total_employees), na.rm=TRUE)

# A tibble: 1 × 3
  employees.sd employees.IQR na.rm
         <dbl>         <dbl> <lgl>
1         284.            58 TRUE

The standard deviation of 284 suggests high variance in the numberof employees by location. The IQR of 58 tells us that 50% of the employee numer data is clustered within 29 employees more or less than our median of 21.

Next, let’s look at how many employees are in each state.

Code

  railroad %>%
  group_by(state) %>%
  summarize(employees_per_state=sum(total_employees))

# A tibble: 53 × 2
   state employees_per_state
   <chr>               <dbl>
 1 AE                      2
 2 AK                    103
 3 AL                   4257
 4 AP                      1
 5 AR                   3871
 6 AZ                   3153
 7 CA                  13137
 8 CO                   3650
 9 CT                   2592
10 DC                    279
# ℹ 43 more rows

Now, let’s look at the statewide average numbers of employees (from all locations within each state), from highest to lowest.

Code

  railroad %>%
  group_by(state) %>%
  summarize(avg_employees_per_state=(mean(total_employees)))

# A tibble: 53 × 2
   state avg_employees_per_state
   <chr>                   <dbl>
 1 AE                        2  
 2 AK                       17.2
 3 AL                       63.5
 4 AP                        1  
 5 AR                       53.8
 6 AZ                      210. 
 7 CA                      239. 
 8 CO                       64.0
 9 CT                      324  
10 DC                      279  
# ℹ 43 more rows

Finally, we can look at employees by region (note that the strange state designations like “AE” will not fit into regions):

Code

#mutate to code regions
railroad <- railroad %>%
  mutate(region=case_when(
    state == "CT" | state =="ME" | state =="MA" | state =="NH" | state =="RI" | state =="VT" | state =="NJ" | state =="NY" | state =="PA" ~ "Northeast",
    state == "IN" | state =="IL" | state =="MI" | state =="OH" | state =="WI" | state =="IA" | state =="KS" | state =="MN" | state =="MO" | state =="NE" | state =="ND" | state =="SD" ~ "Midwest",
    state=="DE" | state =="DC" | state =="FL" | state =="GA" | state =="MD" | state =="NC" | state =="SC" | state =="VA" | state =="WV" | state =="AL" | state =="KY" | state =="MS" | state =="TN" | state =="AR" | state =="LA" | state =="OK" | state =="TX" ~ "South",
    state=="AZ" | state =="CO" | state =="ID" | state =="NM" | state =="MT" | state =="UT" | state =="NV" | state =="WY" | state =="AK" | state =="CA" | state =="HI" | state =="OR" | state =="WA" ~ "West")
    )
#summarize to view regional breakdown
  railroad %>%
  group_by(region) %>%
  summarize(employees_per_region=sum(total_employees, na.rm=TRUE))

# A tibble: 5 × 2
  region    employees_per_region
  <chr>                    <dbl>
1 Midwest                  84755
2 Northeast                45912
3 South                    84784
4 West                     39978
5 <NA>                         3

This shows us that, relatively speaking, railroad employees are clustered more in the South and Midwest than in the Northeast and West of the US.

Looking at the data by state and region allows us to look at the data on a medium level between its zoomed-out original state and the zoomed-in county level. This approach gives the viewer a stronger sense of the geographical distribution of employees across the country.