Challenge 2 Submission

challenge_2

Data wrangling: using group() and summarise()

Author

Tanmay Agrawal

Published

December 21, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

railroad*.csv or StateCounty2012.xlsx ⭐
FAOstat*.csv ⭐⭐⭐
hotel_bookings ⭐⭐⭐⭐

Code

# load the libs
library(readr)
library(readxl)

# read the data using standard csv loading function
data = read_csv("../posts/_data/railroad_2012_clean_county.csv")

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code

# describe the data using str, brief summary of the columns, datatypes, sizes tell us that there are 3 columns with 2930 rows.
str(data)

spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state          : chr [1:2930] "AE" "AK" "AK" "AK" ...
 $ county         : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
 $ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   county = col_character(),
  ..   total_employees = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Code

# show the first few entries using the head command
head(data)

# A tibble: 6 × 3
  state county               total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1

Code

# from a cursory analysis, it looks like the dataset describes the number of rail road employees by counties and their corresponding states.


# We can show the top-3 counties along with their states with the highest `total_employees` count
data %>%
  distinct(state, total_employees) %>%
  arrange(desc(total_employees)) %>%
  top_n(3)

# A tibble: 3 × 2
  state total_employees
  <chr>           <dbl>
1 IL               8207
2 TX               4235
3 NE               3797

Code

# Similarly we could also look at the bottom 3.
data %>%
  distinct(state, total_employees) %>%
  arrange(total_employees) %>%
  head(3)

# A tibble: 3 × 2
  state total_employees
  <chr>           <dbl>
1 AK                  1
2 AL                  1
3 AP                  1

We can also look at the distinct states, turns out they have more than 50 unique entries in the state column. This means that the state column has some additional entries that represent places that can be considered a state for all intents and purposes for railroad employee data. These could be overseas territories.

Code

data %>%
  distinct(state)

# A tibble: 53 × 1
   state
   <chr>
 1 AE   
 2 AK   
 3 AL   
 4 AP   
 5 AR   
 6 AZ   
 7 CA   
 8 CO   
 9 CT   
10 DC   
# … with 43 more rows

Overall the dataset is a simple record of railroad employee by state and counties which could be used to allocate resources to these states based on their needs and requirements.

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code

# We look at the states arranged by the standard deviation of total_employee count among the counties in that state alongside other central tendencies.

data %>%
  group_by(state) %>%
  summarise(mean_=mean(total_employees),
            median_=median(total_employees),
            sd_=sd(`total_employees`)) %>%
              arrange(desc(sd_)
            )

# A tibble: 53 × 4
   state mean_ median_   sd_
   <chr> <dbl>   <dbl> <dbl>
 1 IL    186.     42    829.
 2 DE    498.    158    674.
 3 NY    280.     71    591.
 4 CA    239.     61    549.
 5 CT    324     125    520.
 6 NE    148.     15    512.
 7 FL    111.     20    386.
 8 TX     89.8    17    350.
 9 VA     82.1    25.5  341.
10 NJ    397.    296    338.
# … with 43 more rows

We can use an arbitrary threshold of sd>300 to check which states have the biggest disparity in terms of employee counts in different counties. We also only look at the states with a mean employee count of at most 100 to better investigate states where the high spread is a function of very high variability in the employee counts of the counties of those states.

Code

data %>%
  group_by(state) %>%
  summarise(mean_=mean(total_employees),
            median_=median(total_employees),
            sd_=sd(`total_employees`)) %>%
              filter(mean_<100, sd_>300) %>%
              arrange(desc(sd_))

# A tibble: 2 × 4
  state mean_ median_   sd_
  <chr> <dbl>   <dbl> <dbl>
1 TX     89.8    17    350.
2 VA     82.1    25.5  341.

We get 2 such states, Texas (TX) and Virginia (VA). We can do further analysis on these two states.

Code

data %>%
  filter(`state`=='TX' | `state`=='VA') %>%
  arrange(total_employees) # 313 total counties

# A tibble: 313 × 3
   state county   total_employees
   <chr> <chr>              <dbl>
 1 TX    BAYLOR                 1
 2 TX    CONCHO                 1
 3 TX    CROCKETT               1
 4 TX    DIMMIT                 1
 5 TX    DONLEY                 1
 6 TX    DUVAL                  1
 7 TX    EDWARDS                1
 8 TX    FOARD                  1
 9 TX    HANSFORD               1
10 TX    KNOX                   1
# … with 303 more rows

Code

data %>%
  filter(`state`=='TX') %>%
  arrange(desc(total_employees)) # 221 total counties with "Tarrant" and "Harris" accounting for most of the disparity

# A tibble: 221 × 3
   state county     total_employees
   <chr> <chr>                <dbl>
 1 TX    TARRANT               4235
 2 TX    HARRIS                2535
 3 TX    BEXAR                  950
 4 TX    POTTER                 883
 5 TX    EL PASO                863
 6 TX    MONTGOMERY             474
 7 TX    JOHNSON                429
 8 TX    BELL                   413
 9 TX    DALLAS                 406
10 TX    DENTON                 394
# … with 211 more rows

Code

data %>%
  filter(`state`=='VA') %>%
  arrange(desc(total_employees)) # 92 counties with "Independent City" being the anomaly. This could mean that these employees aren't counted as part of any of the counties, hence the high spread.

# A tibble: 92 × 3
   state county           total_employees
   <chr> <chr>                      <dbl>
 1 VA    INDEPENDENT CITY            3249
 2 VA    BEDFORD                      395
 3 VA    BOTETOURT                    288
 4 VA    CHESTERFIELD                 228
 5 VA    FAIRFAX                      225
 6 VA    FRANKLIN                     202
 7 VA    TAZEWELL                     197
 8 VA    ROANOKE                      193
 9 VA    RUSSELL                      172
10 VA    HENRICO                      128
# … with 82 more rows

Explain and Interpret

We observed some interesting things in this data. We started by looking at all of the data as in challenge 1 and learn that the dataset is a record of employee counts who work in the railroad industry grouped by counties and their corresponding states, an interesting observation was that the state variable has 53 unique entries which means that the term is used loosely here and it also includes overseas employees.

After that we used grouped summary statistics to look at the states with the highest and lowest employee counts along with the mean/median employee counts. We observed that some states have a high dispersion of employee count values across counties. This could be a useful metric to determine which counties are understaffed compared to what they need. Texas and Virginia jumped out to me during the analysis because the mean employee count was low and the standard deviation was very high. This could imply disproportionate allocation of resources, which seems true in the case of Texas but further analysis showed that Virginia uses a catchall term called Independent City to count employees that don’t necessarily belong to a county. This accounts for most of the spread.