challenge_2
maanusri balasubramanian
railroads
Data wrangling: using group() and summarise()
Author

Maanusri Balasubramanian

Published

May 3, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Code
# reading the CSV file
rr <- read_csv("_data/railroad_2012_clean_county.csv")

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
# printing number of rows and columns in the dataset
dim(rr)
[1] 2930    3
Code
# description of the dataset
str(rr)
spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state          : chr [1:2930] "AE" "AK" "AK" "AK" ...
 $ county         : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
 $ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   county = col_character(),
  ..   total_employees = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

From the above results, we know that the dataset has 2930 rows and 3 columns.

Code
colnames(rr)
[1] "state"           "county"          "total_employees"

The 3 columns in the dataset are: state, county and total_employees.

Code
head(rr)
# A tibble: 6 × 3
  state county               total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1

We can infer that the dataset gives us the total number of railroad employees in a county in a state in 2012 from this.

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code
# Count of unique states in the dataset
states <- select(rr, `state`)

# library("dplyr")
count(distinct(states))
# A tibble: 1 × 1
      n
  <int>
1    53

Total railroad employee information about 53 unique states is present in this dataset.

Code
# retrieving information of the county with the maximum number of railroad employees
summarize(rr, employee_count = max(total_employees), maxCounty = county[which.max(total_employees)])
# A tibble: 1 × 2
  employee_count maxCounty
           <dbl> <chr>    
1           8207 COOK     

COOK is the county with the maximum number of railroad employees (8207).

Code
# retrieving information of the county with the maximum number of railroad employees
reframe(rr, employee_count = min(total_employees), minCounty = county[which(total_employees == min(total_employees))])
# A tibble: 145 × 2
   employee_count minCounty
            <dbl> <chr>    
 1              1 SITKA    
 2              1 BARBOUR  
 3              1 HENRY    
 4              1 APO      
 5              1 NEWTON   
 6              1 MONO     
 7              1 BENT     
 8              1 CHEYENNE 
 9              1 COSTILLA 
10              1 DOLORES  
# ℹ 135 more rows

From the above result, we know that there are 145 counties with only 1 railroad employee (which is the minimum employee count in any county).

Code
# Grouping in terms of state to summarise total_employees in each state and arranging accordingly
grouped_rr_state <- rr%>%
  group_by(state)%>%
  summarize(state_employees = sum(total_employees))%>%
  arrange(desc(`state_employees`))
grouped_rr_state
# A tibble: 53 × 2
   state state_employees
   <chr>           <dbl>
 1 TX              19839
 2 IL              19131
 3 NY              17050
 4 NE              13176
 5 CA              13137
 6 PA              12769
 7 OH               9056
 8 GA               8605
 9 IN               8537
10 MO               8419
# ℹ 43 more rows

From the above results we know that TX has the highest number of rail road employees (19839) and AP has the least number of railroad employees (1).

Code
# Grouping in terms of state to summarise mean total_employees in each state wrt number of counties and arranging accordingly 
grouped_rr_state_mean <- rr%>%
  group_by(state)%>%
  summarize(mean_state_employees = mean(total_employees))%>%
  arrange(desc(`mean_state_employees`))
grouped_rr_state_mean
# A tibble: 53 × 2
   state mean_state_employees
   <chr>                <dbl>
 1 DE                    498.
 2 NJ                    397.
 3 CT                    324 
 4 MA                    282.
 5 NY                    280.
 6 DC                    279 
 7 CA                    239.
 8 AZ                    210.
 9 PA                    196.
10 MD                    196.
# ℹ 43 more rows

From the above results we know that DE has the highest mean number of rail road employees (498.33) and AP has the least mean number of railroad employees (1). That is in DE, the mean number of employees in each county is 498.33.

Code
# Grouping in terms of state to summarise median total_employees in each state wrt number of counties and arranging accordingly 
grouped_rr_state_median <- rr%>%
  group_by(state)%>%
  summarize(median_state_employees = median(total_employees))%>%
  arrange(desc(`median_state_employees`))
grouped_rr_state_median
# A tibble: 53 × 2
   state median_state_employees
   <chr>                  <dbl>
 1 NJ                      296 
 2 DC                      279 
 3 MA                      271 
 4 DE                      158 
 5 CT                      125 
 6 MD                      108.
 7 AZ                       94 
 8 PA                       85 
 9 NY                       71 
10 CA                       61 
# ℹ 43 more rows

From the above results we can infer the median number of railroad employees in each state wrt the counties in the state with employees.

Code
# Mode of entries - statewise i.e the state with the maximum number of counties with railroad employees
rr %>%
  count(state, sort=TRUE)%>%
  slice(1)
# A tibble: 1 × 2
  state     n
  <chr> <int>
1 TX      221

We are finding the mode of the number of counties in any state with the above command. We are counting the number of entries for a particular state, which gives us the number of counties in the state with railroad employees. We then find the state with the maximum number of entries (mode).

We know that TX has the highest number of counties with railroad employees. There are 221 counties in TX with railroad employees.

Code
# Filtering the employee information of counties in TX
texas_counties <- filter(rr, state == 'TX')
count(texas_counties)
# A tibble: 1 × 1
      n
  <int>
1   221

Verifying the number of counties in TX with the filter and count commands again. There are 221 counties as returned earlier.

Code
# Grouping in terms of state to summarise standard deviation of total_employees in each state wrt number of counties and arranging accordingly 
grouped_rr_state_sd <- rr%>%
  group_by(state)%>%
  summarize(sd_state_employees = sd(total_employees))%>%
  arrange(desc(`sd_state_employees`))
grouped_rr_state_sd
# A tibble: 53 × 2
   state sd_state_employees
   <chr>              <dbl>
 1 IL                  829.
 2 DE                  674.
 3 NY                  591.
 4 CA                  549.
 5 CT                  520.
 6 NE                  512.
 7 FL                  386.
 8 TX                  350.
 9 VA                  341.
10 NJ                  338.
# ℹ 43 more rows

From the above result, we rank the states according to the standard deviation of railroad employee count.

Code
# grouping dataset based on the state
grouped_rr_state <- rr%>%
  group_by(state)

# summary of this subset gives us info like min, max, median, mean and quantiles
summary(grouped_rr_state)
    state              county          total_employees  
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00  

From the quartile distributions we can infer that lot of the states have less number of employees, i.e the distribution of employees state-wise is skewed.

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

I chose this specific group to understand the distribution of railroad employees statewise. We can infer that there are railroad employees in 53 unique states, ‘COOK’ in IL is the county with the highest number of railroad employees (8207) and there are 145 counties with only 1 railroad employee.

From this grouping analysis based on states, we got to know that TX has the highest number of rail road employees (19839) and AP has the least number of railroad employees (1). And DE has the highest mean number of rail road employees (498.33) and AP has the least mean number of railroad employees (1) in its counties. NJ is the state with the highest median of employees county-wise: 296. We also find that TX is the state with maximum number of entries (mode of number of entries), which is in-turn the number of counties. So TX is the state with the maximum number of counties, which is 221.