Code
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Sai Pranav Kurly
March 31, 2023
I have decided to use Railroads dataset. The Railroads dataset includes separate observations of total employee counts by county within US states/principalities. First, we read the data:
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rowsWe see that the total number employees:
# A tibble: 1 × 1
  `sum(total_employees)`
                   <dbl>
1                 255432We see that the mean number employees by county:
# A tibble: 1 × 1
  `mean(total_employees)`
                    <dbl>
1                    87.2We see that the median number employees by county:
# A tibble: 1 × 1
  `median(total_employees)`
                      <dbl>
1                        21We see that the min number employees by county:
We see that the max number employees by county:
# A tibble: 1 × 1
  `max(total_employees)`
                   <dbl>
1                   8207We see that the max number employees by county:
# A tibble: 1 × 1
  `max(total_employees)`
                   <dbl>
1                   8207We could also summarize everything in one go and have some extra statistics like quantile using the summarry command
The group by function comes in handy since we can actually find some information based on state.
We can find the number of employees in the state. We see that Texas has the largest number.
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 TX              19839
 2 IL              19131
 3 NY              17050
 4 NE              13176
 5 CA              13137
 6 PA              12769
 7 OH               9056
 8 GA               8605
 9 IN               8537
10 MO               8419
# … with 43 more rowsWe can also find the mean of the employees per state based on the counties:
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                2  
 2 AK               17.2
 3 AL               63.5
 4 AP                1  
 5 AR               53.8
 6 AZ              210. 
 7 CA              239. 
 8 CO               64.0
 9 CT              324  
10 DC              279  
# … with 43 more rowsWe also find the max number of employees in the county per state:
Based on the data, Initially I thought that the number of em[ployees would be directly proportional to the size of the state. For example TX is a very large state and has the most number of employees. However, this does not hold true always. We can also find some more interesting statistics using the group by like finding the max number of employees in a particular state based on the county.
---
title: "Challenge 2"
author: "Sai Pranav Kurly"
description: "Data wrangling: using group() and summarise()"
date: "03/31/2023"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Data
I have decided to use Railroads dataset. The Railroads dataset includes separate observations of total employee counts by county within US states/principalities. First, we read the data:
```{r}
railroad <-read_csv("_data/railroad_2012_clean_county.csv")
print(railroad)
```
## Describe the data
We see that the total number employees:
```{r}
summarize(railroad,sum(`total_employees`))
```
We see that the mean number employees by county:
```{r}
summarize(railroad,mean(`total_employees`))
```
We see that the median number employees by county:
```{r}
summarize(railroad,median(`total_employees`))
```
We see that the min number employees by county:
```{r}
summarize(railroad,min(`total_employees`))
```
We see that the max number employees by county:
```{r}
summarize(railroad,max(`total_employees`))
```
We see that the max number employees by county:
```{r}
summarize(railroad,max(`total_employees`))
```
We could also summarize everything in one go and have some extra statistics like quantile using the summarry command
```{r}
summary(railroad)
```
## Provide Grouped Summary Statistics
The group by function comes in handy since we can actually find some information based on state. 
We can find the number of employees in the state. We see that Texas has the largest number.
```{r}
railroad %>%
  group_by(state) %>%
  select(total_employees) %>%
  summarize_all(sum, na.rm = TRUE) %>%
  arrange(desc(total_employees))
```
We can also find the mean of the employees per state based on the counties:
```{r}
railroad%>%
  group_by(state)%>%
  select (`total_employees`)%>%
  summarize_all(mean, na.rm=TRUE) 
```
We also find the max number of employees in the county per state:
```{r}
railroad%>%
  group_by(state)%>%
  select (`total_employees`)%>%
  summarize_all(max, na.rm=TRUE)
```
## Explain and Interpret
Based on the data, Initially I thought that the number of em[ployees would be directly proportional to the size of the state. For example TX is a very large state and has the most number of employees. However, this does not hold true always. We can also find some more interesting statistics using the group by like finding the max number of employees in a particular state based on the county.