Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Xinpeng Liu
May 30, 2023
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
we choose - railroad_2012_clean_county.csv ⭐
spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ state : chr [1:2930] "AE" "AK" "AK" "AK" ...
$ county : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
$ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
- attr(*, "spec")=
.. cols(
.. state = col_character(),
.. county = col_character(),
.. total_employees = col_double()
.. )
- attr(*, "problems")=<externalptr>
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 7.00 21.00 87.18 65.00 8207.00
This data set contains information about the total number of employees in different counties across states. The dataset consists of three variables: ‘state’, ‘county’, and ‘total_employees’.
-‘state’ (chr): This variable is a character type that represents the abbreviation of each state in the United States. This variable does not have missing values and covers all 50 states plus other regions.
-‘county’ (chr): This variable is a character type that represents the names of the counties in each state. This variable does not have missing values, and there are multiple counties within each state.
-‘total_employees’ (int): This variable is an integer type that represents the total number of employees in each county. The values range from a minimum of 1 to a maximum of 8207, with a median of 21 and a mean of 87.18. The standard deviation is high, suggesting a wide spread in the number of employees across counties.
The data was likely gathered through employment records or surveys in each county and compiled for the year of interest. Each row represents an observation for a specific county in a specific state.
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
data_grouped_by_state <- data_railroad %>%
select(state,total_employees) %>%
group_by(state) %>%
summarize(
total_employees_by_state = sum(total_employees, na.rm = TRUE),
mean_number_of_employee_per_county = round(mean(total_employees, na.rm = TRUE), 2),
median_number_of_employee_per_county = round(median(total_employees, na.rm = TRUE), 2),
sd_employees_per_county = round(sd(total_employees, na.rm = TRUE), 2),
IQR_employees_per_county = round(IQR(total_employees, na.rm = TRUE),2),
min = min(total_employees, na.rm = TRUE),
max = max(total_employees, na.rm = TRUE)
)
# convert to standard data frame
data_grouped_by_state_df <- as.data.frame(data_grouped_by_state)
# print full data frame
print(data_grouped_by_state_df)
state total_employees_by_state mean_number_of_employee_per_county
1 AE 2 2.00
2 AK 103 17.17
3 AL 4257 63.54
4 AP 1 1.00
5 AR 3871 53.76
6 AZ 3153 210.20
7 CA 13137 238.85
8 CO 3650 64.04
9 CT 2592 324.00
10 DC 279 279.00
11 DE 1495 498.33
12 FL 7419 110.73
13 GA 8605 56.61
14 HI 4 1.33
15 IA 4019 40.60
16 ID 1563 43.42
17 IL 19131 185.74
18 IN 8537 92.79
19 KS 6092 64.13
20 KY 4811 40.43
21 LA 3915 62.14
22 MA 3379 281.58
23 MD 4709 196.21
24 ME 654 40.88
25 MI 3932 50.41
26 MN 5467 63.57
27 MO 8419 73.21
28 MS 2111 27.06
29 MT 3327 62.77
30 NC 3143 33.44
31 ND 2204 44.98
32 NE 13176 148.04
33 NH 393 39.30
34 NJ 8329 396.62
35 NM 1958 67.52
36 NV 746 62.17
37 NY 17050 279.51
38 OH 9056 102.91
39 OK 2318 31.75
40 OR 2322 70.36
41 PA 12769 196.45
42 RI 487 97.40
43 SC 2296 49.91
44 SD 949 18.25
45 TN 4952 54.42
46 TX 19839 89.77
47 UT 1917 76.68
48 VA 7551 82.08
49 VT 259 18.50
50 WA 5222 133.90
51 WI 3773 54.68
52 WV 3213 60.62
53 WY 2876 130.73
median_number_of_employee_per_county sd_employees_per_county
1 2.0 NA
2 2.5 34.76
3 26.0 130.17
4 1.0 NA
5 16.5 131.14
6 94.0 227.78
7 61.0 549.47
8 10.0 127.75
9 125.0 520.20
10 279.0 NA
11 158.0 674.32
12 20.0 386.01
13 15.0 113.13
14 1.0 0.58
15 14.0 76.80
16 12.0 95.55
17 42.0 829.15
18 30.0 233.06
19 12.0 167.36
20 11.0 76.91
21 20.0 101.48
22 271.0 203.83
23 107.5 233.28
24 29.0 38.12
25 13.0 109.76
26 22.0 122.39
27 24.0 208.12
28 11.5 46.69
29 11.0 122.95
30 14.0 58.59
31 8.0 92.47
32 15.0 511.58
33 15.5 54.33
34 296.0 338.22
35 26.0 112.72
36 19.0 94.80
37 71.0 590.78
38 41.0 147.91
39 14.0 55.86
40 30.0 108.45
41 85.0 293.07
42 48.0 129.02
43 25.0 53.91
44 5.0 34.60
45 26.0 94.82
46 17.0 350.12
47 16.0 142.57
48 25.5 340.74
49 8.5 24.54
50 29.0 255.71
51 23.0 82.17
52 33.0 85.75
53 60.5 168.98
IQR_employees_per_county min max
1 0.00 2 2
2 4.00 1 88
3 47.00 1 990
4 0.00 1 1
5 33.75 1 972
6 296.00 3 749
7 188.00 1 2888
8 39.00 1 553
9 167.25 26 1561
10 0.00 279 279
11 606.50 62 1275
12 57.50 1 3073
13 42.50 1 878
14 0.50 1 2
15 28.50 1 609
16 35.50 1 538
17 84.00 1 8207
18 55.00 3 1999
19 47.50 1 1286
20 33.50 1 483
21 64.00 1 546
22 295.00 44 673
23 313.75 1 809
24 60.25 2 117
25 41.00 1 849
26 43.75 1 651
27 41.00 1 2055
28 26.25 1 341
29 41.00 1 525
30 25.00 1 322
31 32.00 1 407
32 66.00 1 3797
33 20.25 2 146
34 474.00 19 1097
35 49.00 2 431
36 49.50 1 269
37 169.00 5 3685
38 97.25 3 842
39 25.00 1 377
40 77.00 2 467
41 171.00 3 1649
42 91.00 8 318
43 56.25 1 220
44 9.00 1 167
45 49.00 1 621
46 50.00 1 4235
47 44.00 1 580
48 33.25 1 3249
49 8.00 3 83
50 65.50 1 1039
51 45.00 2 465
52 68.00 1 406
53 159.00 3 737
[1] "The mean of total employees is 87.178156996587"
[1] "The median of total employees is 21"
[1] "The mode of total employees is 1"
[1] "The standard deviation of total employees is 283.635890179709"
[1] "The minimum of total employees is 1"
[1] "The maximum of total employees is 8207"
[1] "The first quartile of total employees is 7"
[1] "The third quartile of total employees is 65"
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
[1] "The summary of alabama"
state county total_employees
Length:67 Length:67 Min. : 1.00
Class :character Class :character 1st Qu.: 10.50
Mode :character Mode :character Median : 26.00
Mean : 63.54
3rd Qu.: 57.50
Max. :990.00
[1] "The summary of California"
state county total_employees
Length:55 Length:55 Min. : 1.0
Class :character Class :character 1st Qu.: 12.5
Mode :character Mode :character Median : 61.0
Mean : 238.9
3rd Qu.: 200.5
Max. :2888.0
Analyzing data from different states - Alabama and California - reveals insightful patterns about the distribution of railroad employees across counties. I chose these states to compare given their distinct geographical sizes, population densities, and overall distinct regional characteristics.
Alabama, known as “The Heart of Dixie,” has 4257 railroad employees. On average, there are about 63.54 employees in each county. This average might seem small at first, but considering the smaller size and lower population density of Alabama’s counties, it makes sense. Interestingly, the median value is 26, suggesting that the number of employees is less evenly distributed, and there are outliers with significantly more employees. The broad standard deviation of 130.17 supports this idea, revealing a wide spread of values around the mean.
On the other side of the country, California, the “Golden State,” has a total of 13137 railroad employees, which is significantly higher than Alabama’s. It’s no surprise as California is the most populous U.S. state with larger and more urbanized counties. The mean number of employees per county is approximately 238.85, showing a higher overall density of railroad employment. The median is 61, higher than in Alabama, demonstrating that even the counties with fewer employees in California have a larger workforce than most in Alabama. The considerable standard deviation of 549.47 and the wide IQR of 188 tell us about the great variability in California’s counties’ employment numbers.
The comparison between these two states brings to light how geographical and demographic factors may impact the distribution and concentration of employment in the railroad industry. Understanding these patterns could be a stepping stone for further research into factors influencing the job market and economic conditions in these states.
---
title: "Challenge 2 Submission"
author: "Xinpeng Liu"
description: "Data wrangling: using group() and summarise()"
date: "5/30/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
```{r}
data_railroad <- read_csv("_data/railroad_2012_clean_county.csv")
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
we choose - railroad_2012_clean_county.csv ⭐
```{r}
#| label: summary
library(dplyr)
str(data_railroad)
summary(data_railroad$total_employees)
```
This data set contains information about the total number of employees in different counties across states. The dataset consists of three variables: 'state', 'county', and 'total_employees'.
-'state' (chr): This variable is a character type that represents the abbreviation of each state in the United States. This variable does not have missing values and covers all 50 states plus other regions.
-'county' (chr): This variable is a character type that represents the names of the counties in each state. This variable does not have missing values, and there are multiple counties within each state.
-'total_employees' (int): This variable is an integer type that represents the total number of employees in each county. The values range from a minimum of 1 to a maximum of 8207, with a median of 21 and a mean of 87.18. The standard deviation is high, suggesting a wide spread in the number of employees across counties.
The data was likely gathered through employment records or surveys in each county and compiled for the year of interest. Each row represents an observation for a specific county in a specific state.
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
data_grouped_by_state <- data_railroad %>%
select(state,total_employees) %>%
group_by(state) %>%
summarize(
total_employees_by_state = sum(total_employees, na.rm = TRUE),
mean_number_of_employee_per_county = round(mean(total_employees, na.rm = TRUE), 2),
median_number_of_employee_per_county = round(median(total_employees, na.rm = TRUE), 2),
sd_employees_per_county = round(sd(total_employees, na.rm = TRUE), 2),
IQR_employees_per_county = round(IQR(total_employees, na.rm = TRUE),2),
min = min(total_employees, na.rm = TRUE),
max = max(total_employees, na.rm = TRUE)
)
# convert to standard data frame
data_grouped_by_state_df <- as.data.frame(data_grouped_by_state)
# print full data frame
print(data_grouped_by_state_df)
```
```{r}
data <- read.csv("_data/railroad_2012_clean_county.csv")
library(dplyr)
library(ggplot2)
# Calculate the mean
mean_total_employees <- mean(data$total_employees, na.rm = TRUE)
print(paste("The mean of total employees is", mean_total_employees))
# Calculate the median
median_total_employees <- median(data$total_employees, na.rm = TRUE)
print(paste("The median of total employees is", median_total_employees))
# Define a function to calculate the mode
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Calculate the mode
mode_total_employees <- getmode(data$total_employees)
print(paste("The mode of total employees is", mode_total_employees))
```
```{r}
library(dplyr)
# Calculate the standard deviation
sd_total_employees <- sd(data$total_employees, na.rm = TRUE)
print(paste("The standard deviation of total employees is", sd_total_employees))
# Calculate the minimum
min_total_employees <- min(data$total_employees, na.rm = TRUE)
print(paste("The minimum of total employees is", min_total_employees))
# Calculate the maximum
max_total_employees <- max(data$total_employees, na.rm = TRUE)
print(paste("The maximum of total employees is", max_total_employees))
# Calculate the 1st quartile
first_quartile <- quantile(data$total_employees, 0.25, na.rm = TRUE)
print(paste("The first quartile of total employees is", first_quartile))
# Calculate the 3rd quartile
third_quartile <- quantile(data$total_employees, 0.75, na.rm = TRUE)
print(paste("The third quartile of total employees is", third_quartile))
```
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
```{r}
# Filter for Alabama (AL) and California (CA)
alabama_data <- data_railroad[data_railroad$state == "AL",]
california_data <- data_railroad[data_railroad$state == "CA",]
# Display basic statistics for Alabama
print(paste("The summary of alabama"))
summary(alabama_data)
# Display basic statistics for California
print(paste("The summary of California"))
summary(california_data)
```
Analyzing data from different states - Alabama and California - reveals insightful patterns about the distribution of railroad employees across counties. I chose these states to compare given their distinct geographical sizes, population densities, and overall distinct regional characteristics.
Alabama, known as "The Heart of Dixie," has 4257 railroad employees. On average, there are about 63.54 employees in each county. This average might seem small at first, but considering the smaller size and lower population density of Alabama's counties, it makes sense. Interestingly, the median value is 26, suggesting that the number of employees is less evenly distributed, and there are outliers with significantly more employees. The broad standard deviation of 130.17 supports this idea, revealing a wide spread of values around the mean.
On the other side of the country, California, the "Golden State," has a total of 13137 railroad employees, which is significantly higher than Alabama's. It's no surprise as California is the most populous U.S. state with larger and more urbanized counties. The mean number of employees per county is approximately 238.85, showing a higher overall density of railroad employment. The median is 61, higher than in Alabama, demonstrating that even the counties with fewer employees in California have a larger workforce than most in Alabama. The considerable standard deviation of 549.47 and the wide IQR of 188 tell us about the great variability in California's counties' employment numbers.
The comparison between these two states brings to light how geographical and demographic factors may impact the distribution and concentration of employment in the railroad industry. Understanding these patterns could be a stepping stone for further research into factors influencing the job market and economic conditions in these states.