Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Sai Pranav Kurly
March 31, 2023
I have decided to use Railroads dataset. The Railroads dataset includes separate observations of total employee counts by county within US states/principalities. First, we read the data:
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,920 more rows
We see that the total number employees:
# A tibble: 1 × 1
`sum(total_employees)`
<dbl>
1 255432
We see that the mean number employees by county:
# A tibble: 1 × 1
`mean(total_employees)`
<dbl>
1 87.2
We see that the median number employees by county:
# A tibble: 1 × 1
`median(total_employees)`
<dbl>
1 21
We see that the min number employees by county:
We see that the max number employees by county:
# A tibble: 1 × 1
`max(total_employees)`
<dbl>
1 8207
We see that the max number employees by county:
# A tibble: 1 × 1
`max(total_employees)`
<dbl>
1 8207
We could also summarize everything in one go and have some extra statistics like quantile using the summarry command
The group by function comes in handy since we can actually find some information based on state.
We can find the number of employees in the state. We see that Texas has the largest number.
# A tibble: 53 × 2
state total_employees
<chr> <dbl>
1 TX 19839
2 IL 19131
3 NY 17050
4 NE 13176
5 CA 13137
6 PA 12769
7 OH 9056
8 GA 8605
9 IN 8537
10 MO 8419
# … with 43 more rows
We can also find the mean of the employees per state based on the counties:
# A tibble: 53 × 2
state total_employees
<chr> <dbl>
1 AE 2
2 AK 17.2
3 AL 63.5
4 AP 1
5 AR 53.8
6 AZ 210.
7 CA 239.
8 CO 64.0
9 CT 324
10 DC 279
# … with 43 more rows
We also find the max number of employees in the county per state:
Based on the data, Initially I thought that the number of em[ployees would be directly proportional to the size of the state. For example TX is a very large state and has the most number of employees. However, this does not hold true always. We can also find some more interesting statistics using the group by like finding the max number of employees in a particular state based on the county.
---
title: "Challenge 2"
author: "Sai Pranav Kurly"
description: "Data wrangling: using group() and summarise()"
date: "03/31/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Data
I have decided to use Railroads dataset. The Railroads dataset includes separate observations of total employee counts by county within US states/principalities. First, we read the data:
```{r}
railroad <-read_csv("_data/railroad_2012_clean_county.csv")
print(railroad)
```
## Describe the data
We see that the total number employees:
```{r}
summarize(railroad,sum(`total_employees`))
```
We see that the mean number employees by county:
```{r}
summarize(railroad,mean(`total_employees`))
```
We see that the median number employees by county:
```{r}
summarize(railroad,median(`total_employees`))
```
We see that the min number employees by county:
```{r}
summarize(railroad,min(`total_employees`))
```
We see that the max number employees by county:
```{r}
summarize(railroad,max(`total_employees`))
```
We see that the max number employees by county:
```{r}
summarize(railroad,max(`total_employees`))
```
We could also summarize everything in one go and have some extra statistics like quantile using the summarry command
```{r}
summary(railroad)
```
## Provide Grouped Summary Statistics
The group by function comes in handy since we can actually find some information based on state.
We can find the number of employees in the state. We see that Texas has the largest number.
```{r}
railroad %>%
group_by(state) %>%
select(total_employees) %>%
summarize_all(sum, na.rm = TRUE) %>%
arrange(desc(total_employees))
```
We can also find the mean of the employees per state based on the counties:
```{r}
railroad%>%
group_by(state)%>%
select (`total_employees`)%>%
summarize_all(mean, na.rm=TRUE)
```
We also find the max number of employees in the county per state:
```{r}
railroad%>%
group_by(state)%>%
select (`total_employees`)%>%
summarize_all(max, na.rm=TRUE)
```
## Explain and Interpret
Based on the data, Initially I thought that the number of em[ployees would be directly proportional to the size of the state. For example TX is a very large state and has the most number of employees. However, this does not hold true always. We can also find some more interesting statistics using the group by like finding the max number of employees in a particular state based on the county.