challenge_2
railroads
poobigan murugesan
Data wrangling: using group() and summarise()
Author

Poobigan Murugesan

Published

May 9, 2023

Code
library(tidyverse)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐
Code
df<- read_csv("_data/railroad_2012_clean_county.csv")
print(df,show_col_types=FALSE)
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# ℹ 2,920 more rows

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
dim(df)
[1] 2930    3

The dataset consists of 2930 rows and 3 columns.

By using the ‘str’ function, we can obtain information about the data types, lengths, contents, and other relevant information pertaining to each column. The dataset is comprised of three distinct columns: ‘State’ (which is of character type), ‘County’ (also of character type), and ‘Total Employees’ (which is of numeric type). From the dataset, we can infer that it pertains to railroad employees from various states and counties.

Code
str(df)
spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state          : chr [1:2930] "AE" "AK" "AK" "AK" ...
 $ county         : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
 $ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   county = col_character(),
  ..   total_employees = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Number of unique states in the data

Code
count(distinct(select(df, state)))
# A tibble: 1 × 1
      n
  <int>
1    53

The dataset contains information on the total number of railroad employees across 53 distinct states.

Which county has the maximum number of employees?

Code
df %>% summarise(employee_count = max(total_employees), maxCounty = county[which.max(total_employees)])
# A tibble: 1 × 2
  employee_count maxCounty
           <dbl> <chr>    
1           8207 COOK     

The county that has the highest count of railroad workers in this dataset is COOK, with a total of 8207 employees.

Which county has the minimum number of employees?

Code
df %>% summarize(employee_count = min(total_employees), minCounty = county[which(total_employees == min(total_employees))])
# A tibble: 145 × 2
   employee_count minCounty
            <dbl> <chr>    
 1              1 SITKA    
 2              1 BARBOUR  
 3              1 HENRY    
 4              1 APO      
 5              1 NEWTON   
 6              1 MONO     
 7              1 BENT     
 8              1 CHEYENNE 
 9              1 COSTILLA 
10              1 DOLORES  
# ℹ 135 more rows

Based on the above output, we can determine that the minimum number of railroad employees in any county is 1, and there are a total of 145 counties with only one employee.

Grouping railroad employees by state to figure out which states have the most railroad employees.

Code
group <- df %>%
  group_by(state) %>%
  summarize(employees=sum(total_employees)) %>%
  arrange(desc(employees))
group
# A tibble: 53 × 2
   state employees
   <chr>     <dbl>
 1 TX        19839
 2 IL        19131
 3 NY        17050
 4 NE        13176
 5 CA        13137
 6 PA        12769
 7 OH         9056
 8 GA         8605
 9 IN         8537
10 MO         8419
# ℹ 43 more rows

Based on the results shown above, we can see that the state of Texas has the largest number of railroad employees, with a count of 19839 followed by Illinois and New York with 19131 and 17050 employees respectively. Also, from the dimensions of the group dataset we can conclude that there are 53 distinct states where people are employed to work railroads.

Grouping the data by state and calculating the mean of total employees in each state based on the number of counties.

Code
group <- df %>%
  group_by(state) %>%
  summarize(mean=mean(total_employees)) %>%
  arrange(desc(mean))
group
# A tibble: 53 × 2
   state  mean
   <chr> <dbl>
 1 DE     498.
 2 NJ     397.
 3 CT     324 
 4 MA     282.
 5 NY     280.
 6 DC     279 
 7 CA     239.
 8 AZ     210.
 9 PA     196.
10 MD     196.
# ℹ 43 more rows

Based on the results above, we can conclude that the state of DE has the highest average number of railroad employees per county, which is 498

Grouping the data by state and calculating the median of total employees in each state based on the number of counties.

Code
group <- df %>%
  group_by(state) %>%
  summarize(median=median(total_employees)) %>%
  arrange(desc(median))
group
# A tibble: 53 × 2
   state median
   <chr>  <dbl>
 1 NJ      296 
 2 DC      279 
 3 MA      271 
 4 DE      158 
 5 CT      125 
 6 MD      108.
 7 AZ       94 
 8 PA       85 
 9 NY       71 
10 CA       61 
# ℹ 43 more rows

Mode - Extracting the state with highest number of counties with railroad employees.

Code
df %>%
  group_by(state) %>%
  summarize(counties = n()) %>%
  arrange(desc(counties)) %>%
  slice(1)
# A tibble: 1 × 2
  state counties
  <chr>    <int>
1 TX         221

The result of this specific code shows that Texas (TX) has the highest number of counties with railroad employees, with a count of 221.

Extracting information about railroad employees specifically from the counties in TX using a filter operation.

Code
counties_tx <-filter(df, state=='TX')
counties_tx
# A tibble: 221 × 3
   state county    total_employees
   <chr> <chr>               <dbl>
 1 TX    ANDERSON              241
 2 TX    ANDREWS                 3
 3 TX    ANGELINA               53
 4 TX    ARANSAS                 6
 5 TX    ARCHER                  8
 6 TX    ARMSTRONG              12
 7 TX    ATASCOSA               64
 8 TX    AUSTIN                 35
 9 TX    BAILEY                  5
10 TX    BANDERA                15
# ℹ 211 more rows

Double-checking the number of counties in TX, the filter and count commands confirm that there are indeed 221 counties with railroad employees in TX as previously determined.

Grouping the data by state and calculating the median of total employees in each state based on the number of counties.

Code
group <- df %>%
  group_by(state) %>%
  summarize(std=sd(total_employees)) %>%
  arrange(desc(std))
group
# A tibble: 53 × 2
   state   std
   <chr> <dbl>
 1 IL     829.
 2 DE     674.
 3 NY     591.
 4 CA     549.
 5 CT     520.
 6 NE     512.
 7 FL     386.
 8 TX     350.
 9 VA     341.
10 NJ     338.
# ℹ 43 more rows

Grouping data by state and analyzing its summary

Code
group <- df %>%
  group_by(state)

summary(group)
    state              county          total_employees  
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00  

Based on the quartile distributions, we can conclude that many states have a low number of employees, indicating that the distribution of employees among states is not evenly distributed and is instead skewed.

Explain and Interpret

I selected this specific group to analyze the distribution of railroad employees by state. The data shows that there are employees in 53 states, and the county with the highest number of employees is COOK in IL with 8207 employees, while 145 counties have only one railroad employee.

When we grouped the data by state, we found that TX has the highest number of railroad employees with 19839, while AP has the lowest with only one employee. DE has the highest mean number of employees with 498 per county, and AP has the lowest mean number with only one employee. NJ has the highest median number of employees per county with 296. Additionally, we found that TX has the maximum number of entries, which is the mode of the number of counties with 221 counties.