Challenge 2 Instructions

challenge_2
railroads
Abhinav Reddy Yadatha
Data wrangling: using group() and summarise()
Author

Abhinav Reddy Yadatha

Published

May 3, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐
Code
library(tidyverse)
railroads <- read_csv('_data/railroad_2012_clean_county.csv', show_col_types = FALSE)
railroads <- rename(railroads, counties = county)
railroads
# A tibble: 2,930 × 3
   state counties             total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# ℹ 2,920 more rows

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Description : The dataset involved gathering information on the total number of employees in each county by taking trains from specific states. The data revealed that the average number of employees per county was 87.2, with a standard deviation of 283.6. The lowest number of employees was 1, the middle value was 21, and the highest was 8207. Here’s how I obtained and analyzed this data

Code
# Printing the first few rows of the dataset.
head(railroads)
# A tibble: 6 × 3
  state counties             total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1
Code
library(summarytools)
dfSummary(railroads)
Data Frame Summary  
railroads  
Dimensions: 2930 x 3  
Duplicates: 0  

-----------------------------------------------------------------------------------------------------------------
No   Variable          Stats / Values             Freqs (% of Valid)    Graph                Valid      Missing  
---- ----------------- -------------------------- --------------------- -------------------- ---------- ---------
1    state             1. TX                       221 ( 7.5%)          I                    2930       0        
     [character]       2. GA                       152 ( 5.2%)          I                    (100.0%)   (0.0%)   
                       3. KY                       119 ( 4.1%)                                                   
                       4. MO                       115 ( 3.9%)                                                   
                       5. IL                       103 ( 3.5%)                                                   
                       6. IA                        99 ( 3.4%)                                                   
                       7. KS                        95 ( 3.2%)                                                   
                       8. NC                        94 ( 3.2%)                                                   
                       9. IN                        92 ( 3.1%)                                                   
                       10. VA                       92 ( 3.1%)                                                   
                       [ 43 others ]              1748 (59.7%)          IIIIIIIIIII                              

2    counties          1. WASHINGTON                31 ( 1.1%)                               2930       0        
     [character]       2. JEFFERSON                 26 ( 0.9%)                               (100.0%)   (0.0%)   
                       3. FRANKLIN                  24 ( 0.8%)                                                   
                       4. LINCOLN                   24 ( 0.8%)                                                   
                       5. JACKSON                   22 ( 0.8%)                                                   
                       6. MADISON                   19 ( 0.6%)                                                   
                       7. MONTGOMERY                18 ( 0.6%)                                                   
                       8. CLAY                      17 ( 0.6%)                                                   
                       9. MARION                    17 ( 0.6%)                                                   
                       10. MONROE                   17 ( 0.6%)                                                   
                       [ 1699 others ]            2715 (92.7%)          IIIIIIIIIIIIIIIIII                       

3    total_employees   Mean (sd) : 87.2 (283.6)   404 distinct values   :                    2930       0        
     [numeric]         min < med < max:                                 :                    (100.0%)   (0.0%)   
                       1 < 21 < 8207                                    :                                        
                       IQR (CV) : 58 (3.3)                              :                                        
                                                                        :                                        
-----------------------------------------------------------------------------------------------------------------

Check dimensions of the dataset

Code
dim(railroads)
[1] 2930    3

It can be observed that there are 2930 rows and 3 columns

Printing the column names of the dataset :

Code
colnames(railroads)
[1] "state"           "counties"        "total_employees"

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code
#IQR for railroads
railroads %>%
  summarize_all(IQR, na.rm = TRUE)
# A tibble: 1 × 3
  state counties total_employees
  <dbl>    <dbl>           <dbl>
1    NA       NA              58
Code
#Mean for railroads
railroads %>%
  summarize_all(mean, na.rm = TRUE)
# A tibble: 1 × 3
  state counties total_employees
  <dbl>    <dbl>           <dbl>
1    NA       NA            87.2
Code
#Median for railroads
railroads %>%
  summarize_all(median, na.rm = TRUE)
# A tibble: 1 × 3
  state counties total_employees
  <dbl>    <dbl>           <dbl>
1    NA       NA              21

Total count of employees for each state:

Code
state_wise_ct = select(railroads, state, total_employees)
state_wise_ct %>% 
  group_by(state) %>%
  summarize(totalEmployees=sum(total_employees))
# A tibble: 53 × 2
   state totalEmployees
   <chr>          <dbl>
 1 AE                 2
 2 AK               103
 3 AL              4257
 4 AP                 1
 5 AR              3871
 6 AZ              3153
 7 CA             13137
 8 CO              3650
 9 CT              2592
10 DC               279
# ℹ 43 more rows
Code
state_wise_ct
# A tibble: 2,930 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                  2
 2 AK                  7
 3 AK                  2
 4 AK                  3
 5 AK                  2
 6 AK                  1
 7 AK                 88
 8 AL                102
 9 AL                143
10 AL                  1
# ℹ 2,920 more rows

Median, Mean and standard deviation of employee counts in every state:

Code
state_wise_ct = select(railroads, state, total_employees)
state_wise_ct %>% 
  group_by(state) %>%
  summarize(meanEmployees=mean(total_employees),medianEmployees=median(total_employees),standardDeviation = sd(total_employees))
# A tibble: 53 × 4
   state meanEmployees medianEmployees standardDeviation
   <chr>         <dbl>           <dbl>             <dbl>
 1 AE              2               2                NA  
 2 AK             17.2             2.5              34.8
 3 AL             63.5            26               130. 
 4 AP              1               1                NA  
 5 AR             53.8            16.5             131. 
 6 AZ            210.             94               228. 
 7 CA            239.             61               549. 
 8 CO             64.0            10               128. 
 9 CT            324             125               520. 
10 DC            279             279                NA  
# ℹ 43 more rows

State wise employees count arranged and displayed in descending order :

Code
state_wise_grouped_cts <- state_wise_ct %>% 
  group_by(state) %>%
  summarize(Sum = sum(total_employees))

sorted_counts <- state_wise_grouped_cts %>%
  arrange(desc(Sum))

sorted_counts
# A tibble: 53 × 2
   state   Sum
   <chr> <dbl>
 1 TX    19839
 2 IL    19131
 3 NY    17050
 4 NE    13176
 5 CA    13137
 6 PA    12769
 7 OH     9056
 8 GA     8605
 9 IN     8537
10 MO     8419
# ℹ 43 more rows

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

We have categorized the total number of employees according to the states, which means that we have calculated the average, middle value, and sum of all the counties for various states.

There are some states that have only one county, which makes the standard deviation undefined or null.

After sorting the data based on the overall number of employees, we observed that TX and IL have the greatest number of employees while AP has only one employee.