Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Sai Venkatesh
April 12, 2023
Today’s challenge is to
We are going to load and use the railroad data.
[1] "Lets load the data and see the dimensions and columns of the data."
[1] 2930 3
[1] "state" "county" "total_employees"
# A tibble: 1 × 1
`sum(total_employees)`
<dbl>
1 255432
From the above, we can see that the Railroad data has 2930 rows and 3 columns. The 3 column names are state, county and total_employees. The data displayed shows that this is the railroad employees count distribution based on the state and county.
We can also see the total number of employees as 255432.
[1] "The total number of employees in the states > 1000 ordered by the count:-"
# A tibble: 42 × 2
state state_total
<chr> <dbl>
1 TX 19839
2 IL 19131
3 NY 17050
4 NE 13176
5 CA 13137
6 PA 12769
7 OH 9056
8 GA 8605
9 IN 8537
10 MO 8419
# … with 32 more rows
Grouping By State
We are going to consider state and group by it. We are then going to find the distribution of employees in each state and associated functions to get more sense of the underlying data.
summarizedData <- railroaddata %>%
group_by(state) %>%
summarize(Mean = mean(total_employees, na.rm=TRUE),
SD = sd(total_employees, na.rm = TRUE),
Max = max(total_employees, na.rm = TRUE),
Min = min(total_employees, na.rm = TRUE),
Median = median(total_employees, na.rm = TRUE))
print(summarizedData, n=60)
# A tibble: 53 × 6
state Mean SD Max Min Median
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AE 2 NA 2 2 2
2 AK 17.2 34.8 88 1 2.5
3 AL 63.5 130. 990 1 26
4 AP 1 NA 1 1 1
5 AR 53.8 131. 972 1 16.5
6 AZ 210. 228. 749 3 94
7 CA 239. 549. 2888 1 61
8 CO 64.0 128. 553 1 10
9 CT 324 520. 1561 26 125
10 DC 279 NA 279 279 279
11 DE 498. 674. 1275 62 158
12 FL 111. 386. 3073 1 20
13 GA 56.6 113. 878 1 15
14 HI 1.33 0.577 2 1 1
15 IA 40.6 76.8 609 1 14
16 ID 43.4 95.5 538 1 12
17 IL 186. 829. 8207 1 42
18 IN 92.8 233. 1999 3 30
19 KS 64.1 167. 1286 1 12
20 KY 40.4 76.9 483 1 11
21 LA 62.1 101. 546 1 20
22 MA 282. 204. 673 44 271
23 MD 196. 233. 809 1 108.
24 ME 40.9 38.1 117 2 29
25 MI 50.4 110. 849 1 13
26 MN 63.6 122. 651 1 22
27 MO 73.2 208. 2055 1 24
28 MS 27.1 46.7 341 1 11.5
29 MT 62.8 123. 525 1 11
30 NC 33.4 58.6 322 1 14
31 ND 45.0 92.5 407 1 8
32 NE 148. 512. 3797 1 15
33 NH 39.3 54.3 146 2 15.5
34 NJ 397. 338. 1097 19 296
35 NM 67.5 113. 431 2 26
36 NV 62.2 94.8 269 1 19
37 NY 280. 591. 3685 5 71
38 OH 103. 148. 842 3 41
39 OK 31.8 55.9 377 1 14
40 OR 70.4 108. 467 2 30
41 PA 196. 293. 1649 3 85
42 RI 97.4 129. 318 8 48
43 SC 49.9 53.9 220 1 25
44 SD 18.2 34.6 167 1 5
45 TN 54.4 94.8 621 1 26
46 TX 89.8 350. 4235 1 17
47 UT 76.7 143. 580 1 16
48 VA 82.1 341. 3249 1 25.5
49 VT 18.5 24.5 83 3 8.5
50 WA 134. 256. 1039 1 29
51 WI 54.7 82.2 465 2 23
52 WV 60.6 85.8 406 1 33
53 WY 131. 169. 737 3 60.5
I choose state because we are able to analyse the distributions of employees across counties using this. We are able to notice that Texas has the largest number of employees. However it is not the counties in Texas that have the maximum number of employees. But instead Cook county in IL has the maximum number of employees. Similarly there are counties whose mean employee counts exceed Texas like IL, NY even though the total is largest in the latter. Infact Texas median is 17 and mean is around 89 which means that there are a lot of counties in Texas with less than 100 employees. Hence having more total need not correlate to larger counties/employee count. It could be a large number of small counties which skews the overall count like Texas.
---
title: "Challenge 2"
author: "Sai Venkatesh"
description: "Data wrangling: using group() and summarise()"
date: "04/12/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
We are going to load and use the railroad data.
```{r}
library(readr)
railroaddata <-read_csv("_data/railroad_2012_clean_county.csv")
```
## Describe the data
```{r}
#| label: summary
print("Lets load the data and see the dimensions and columns of the data.")
# The Dimensions
dim(railroaddata)
# The Column Names
colnames(railroaddata)
# THe total employees
summarize(railroaddata, sum(`total_employees`))
```
From the above, we can see that the Railroad data has 2930 rows and 3 columns.
The 3 column names are state, county and total_employees.
The data displayed shows that this is the railroad employees count distribution based on the state and county.
We can also see the total number of employees as 255432.
## Provide Grouped Summary Statistics
```{r}
# Number of Employees Per State
print("The total number of employees in the states > 1000 ordered by the count:-")
railroaddata %>%
group_by(state) %>%
summarize(state_total = sum(total_employees)) %>%
filter(state_total > 1000) %>%
select(state, state_total) %>%
arrange(desc(state_total))
```
Grouping By State
We are going to consider state and group by it.
We are then going to find the distribution of employees in each state and associated functions to get more sense of the underlying data.
```{r}
summarizedData <- railroaddata %>%
group_by(state) %>%
summarize(Mean = mean(total_employees, na.rm=TRUE),
SD = sd(total_employees, na.rm = TRUE),
Max = max(total_employees, na.rm = TRUE),
Min = min(total_employees, na.rm = TRUE),
Median = median(total_employees, na.rm = TRUE))
print(summarizedData, n=60)
```
### Explain and Interpret
I choose state because we are able to analyse the distributions of employees across counties using this. We are able to notice that Texas has the largest number of employees. However it is not the counties in Texas that have the maximum number of employees. But instead Cook county in IL has the maximum number of employees. Similarly there are counties whose mean employee counts exceed Texas like IL, NY even though the total is largest in the latter. Infact Texas median is 17 and mean is around 89 which means that there are a lot of counties in Texas with less than 100 employees. Hence having more total need not correlate to larger counties/employee count. It could be a large number of small counties which skews the overall count like Texas.