challenge_2
railroads
faostat
hotel_bookings
Data wrangling: using group() and summarise()
Author

Sai Pranav Kurly

Published

March 31, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

I have decided to use Railroads dataset. The Railroads dataset includes separate observations of total employee counts by county within US states/principalities. First, we read the data:

Code
railroad <-read_csv("_data/railroad_2012_clean_county.csv")
print(railroad)
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows

Describe the data

We see that the total number employees:

Code
summarize(railroad,sum(`total_employees`))
# A tibble: 1 × 1
  `sum(total_employees)`
                   <dbl>
1                 255432

We see that the mean number employees by county:

Code
summarize(railroad,mean(`total_employees`))
# A tibble: 1 × 1
  `mean(total_employees)`
                    <dbl>
1                    87.2

We see that the median number employees by county:

Code
summarize(railroad,median(`total_employees`))
# A tibble: 1 × 1
  `median(total_employees)`
                      <dbl>
1                        21

We see that the min number employees by county:

Code
summarize(railroad,min(`total_employees`))
# A tibble: 1 × 1
  `min(total_employees)`
                   <dbl>
1                      1

We see that the max number employees by county:

Code
summarize(railroad,max(`total_employees`))
# A tibble: 1 × 1
  `max(total_employees)`
                   <dbl>
1                   8207

We see that the max number employees by county:

Code
summarize(railroad,max(`total_employees`))
# A tibble: 1 × 1
  `max(total_employees)`
                   <dbl>
1                   8207

We could also summarize everything in one go and have some extra statistics like quantile using the summarry command

Code
summary(railroad)
    state              county          total_employees  
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00  

Provide Grouped Summary Statistics

The group by function comes in handy since we can actually find some information based on state.

We can find the number of employees in the state. We see that Texas has the largest number.

Code
railroad %>%
  group_by(state) %>%
  select(total_employees) %>%
  summarize_all(sum, na.rm = TRUE) %>%
  arrange(desc(total_employees))
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 TX              19839
 2 IL              19131
 3 NY              17050
 4 NE              13176
 5 CA              13137
 6 PA              12769
 7 OH               9056
 8 GA               8605
 9 IN               8537
10 MO               8419
# … with 43 more rows

We can also find the mean of the employees per state based on the counties:

Code
railroad%>%
  group_by(state)%>%
  select (`total_employees`)%>%
  summarize_all(mean, na.rm=TRUE) 
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                2  
 2 AK               17.2
 3 AL               63.5
 4 AP                1  
 5 AR               53.8
 6 AZ              210. 
 7 CA              239. 
 8 CO               64.0
 9 CT              324  
10 DC              279  
# … with 43 more rows

We also find the max number of employees in the county per state:

Code
railroad%>%
  group_by(state)%>%
  select (`total_employees`)%>%
  summarize_all(max, na.rm=TRUE)
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                  2
 2 AK                 88
 3 AL                990
 4 AP                  1
 5 AR                972
 6 AZ                749
 7 CA               2888
 8 CO                553
 9 CT               1561
10 DC                279
# … with 43 more rows

Explain and Interpret

Based on the data, Initially I thought that the number of em[ployees would be directly proportional to the size of the state. For example TX is a very large state and has the most number of employees. However, this does not hold true always. We can also find some more interesting statistics using the group by like finding the max number of employees in a particular state based on the county.