challenge_2
railroads
faostat
hotel_bookings
Data wrangling: using group() and summarise()
Author

Sai Venkatesh

Published

April 12, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

We are going to load and use the railroad data.

Code
  library(readr)
  railroaddata <-read_csv("_data/railroad_2012_clean_county.csv")

Describe the data

Code
  print("Lets load the data and see the dimensions and columns of the data.")
[1] "Lets load the data and see the dimensions and columns of the data."
Code
  # The Dimensions 
  dim(railroaddata)
[1] 2930    3
Code
  # The Column Names 
  colnames(railroaddata)
[1] "state"           "county"          "total_employees"
Code
  # THe total employees
  summarize(railroaddata, sum(`total_employees`))
# A tibble: 1 × 1
  `sum(total_employees)`
                   <dbl>
1                 255432

From the above, we can see that the Railroad data has 2930 rows and 3 columns. The 3 column names are state, county and total_employees. The data displayed shows that this is the railroad employees count distribution based on the state and county.

We can also see the total number of employees as 255432.

Provide Grouped Summary Statistics

Code
# Number of Employees Per State
print("The total number of employees in the states > 1000 ordered by the count:-")
[1] "The total number of employees in the states > 1000 ordered by the count:-"
Code
railroaddata %>%
  group_by(state) %>%
  summarize(state_total = sum(total_employees))  %>%
  filter(state_total > 1000)  %>%
  select(state, state_total)  %>%
  arrange(desc(state_total))
# A tibble: 42 × 2
   state state_total
   <chr>       <dbl>
 1 TX          19839
 2 IL          19131
 3 NY          17050
 4 NE          13176
 5 CA          13137
 6 PA          12769
 7 OH           9056
 8 GA           8605
 9 IN           8537
10 MO           8419
# … with 32 more rows

Grouping By State

We are going to consider state and group by it. We are then going to find the distribution of employees in each state and associated functions to get more sense of the underlying data.

Code
summarizedData <- railroaddata %>%
  group_by(state) %>%
  summarize(Mean = mean(total_employees, na.rm=TRUE), 
            SD = sd(total_employees, na.rm = TRUE), 
            Max = max(total_employees, na.rm = TRUE), 
            Min = min(total_employees, na.rm = TRUE), 
            Median = median(total_employees, na.rm = TRUE))

print(summarizedData, n=60)
# A tibble: 53 × 6
   state   Mean      SD   Max   Min Median
   <chr>  <dbl>   <dbl> <dbl> <dbl>  <dbl>
 1 AE      2     NA         2     2    2  
 2 AK     17.2   34.8      88     1    2.5
 3 AL     63.5  130.      990     1   26  
 4 AP      1     NA         1     1    1  
 5 AR     53.8  131.      972     1   16.5
 6 AZ    210.   228.      749     3   94  
 7 CA    239.   549.     2888     1   61  
 8 CO     64.0  128.      553     1   10  
 9 CT    324    520.     1561    26  125  
10 DC    279     NA       279   279  279  
11 DE    498.   674.     1275    62  158  
12 FL    111.   386.     3073     1   20  
13 GA     56.6  113.      878     1   15  
14 HI      1.33   0.577     2     1    1  
15 IA     40.6   76.8     609     1   14  
16 ID     43.4   95.5     538     1   12  
17 IL    186.   829.     8207     1   42  
18 IN     92.8  233.     1999     3   30  
19 KS     64.1  167.     1286     1   12  
20 KY     40.4   76.9     483     1   11  
21 LA     62.1  101.      546     1   20  
22 MA    282.   204.      673    44  271  
23 MD    196.   233.      809     1  108. 
24 ME     40.9   38.1     117     2   29  
25 MI     50.4  110.      849     1   13  
26 MN     63.6  122.      651     1   22  
27 MO     73.2  208.     2055     1   24  
28 MS     27.1   46.7     341     1   11.5
29 MT     62.8  123.      525     1   11  
30 NC     33.4   58.6     322     1   14  
31 ND     45.0   92.5     407     1    8  
32 NE    148.   512.     3797     1   15  
33 NH     39.3   54.3     146     2   15.5
34 NJ    397.   338.     1097    19  296  
35 NM     67.5  113.      431     2   26  
36 NV     62.2   94.8     269     1   19  
37 NY    280.   591.     3685     5   71  
38 OH    103.   148.      842     3   41  
39 OK     31.8   55.9     377     1   14  
40 OR     70.4  108.      467     2   30  
41 PA    196.   293.     1649     3   85  
42 RI     97.4  129.      318     8   48  
43 SC     49.9   53.9     220     1   25  
44 SD     18.2   34.6     167     1    5  
45 TN     54.4   94.8     621     1   26  
46 TX     89.8  350.     4235     1   17  
47 UT     76.7  143.      580     1   16  
48 VA     82.1  341.     3249     1   25.5
49 VT     18.5   24.5      83     3    8.5
50 WA    134.   256.     1039     1   29  
51 WI     54.7   82.2     465     2   23  
52 WV     60.6   85.8     406     1   33  
53 WY    131.   169.      737     3   60.5

Explain and Interpret

I choose state because we are able to analyse the distributions of employees across counties using this. We are able to notice that Texas has the largest number of employees. However it is not the counties in Texas that have the maximum number of employees. But instead Cook county in IL has the maximum number of employees. Similarly there are counties whose mean employee counts exceed Texas like IL, NY even though the total is largest in the latter. Infact Texas median is 17 and mean is around 89 which means that there are a lot of counties in Texas with less than 100 employees. Hence having more total need not correlate to larger counties/employee count. It could be a large number of small counties which skews the overall count like Texas.