Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Guanhua Tan
September 19, 2022
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,920 more rows
state county total_employees
Length:2930 Length:2930 Min. : 1.00
Class :character Class :character 1st Qu.: 7.00
Mode :character Mode :character Median : 21.00
Mean : 87.18
3rd Qu.: 65.00
Max. :8207.00
The dataset of railroad across the United States includes three columns–state, county and total employees in each county – and 2930 counties.Each county at least maintains one employees while the maximum number is 8207. The average of total employees each county hires is 81.18. However, the median of it is 27.00, which reflects that most counties maintain a small station. The numbers of 1st Qu. and 3rd Qu. further confirms the finding.
# A tibble: 20 × 3
state county total_employees
<chr> <chr> <dbl>
1 IL COOK 8207
2 TX TARRANT 4235
3 NE DOUGLAS 3797
4 NY SUFFOLK 3685
5 VA INDEPENDENT CITY 3249
6 FL DUVAL 3073
7 CA SAN BERNARDINO 2888
8 CA LOS ANGELES 2545
9 TX HARRIS 2535
10 NE LINCOLN 2289
11 NY NASSAU 2076
12 MO JACKSON 2055
13 IN LAKE 1999
14 IL WILL 1784
15 PA PHILADELPHIA 1649
16 NE LANCASTER 1619
17 CA RIVERSIDE 1567
18 CT NEW HAVEN 1561
19 NY QUEENS 1470
20 KS JOHNSON 1286
Cook county, IL, hires the largest size of employees and reaches to 8207 while Tarrant, TX, is second to it with 4235 employees. The city of Chicago, the third largest city of the United State in terms of population, is located in Cook county, which interprets why Cook hires a large group of employees to manage the operation of railroad. I also propose that residents in IL may be more willing to take train for commuting. Tarrant county hires a large number of employees because it serves the Dallas-Fort Worth-Arlington metropolitan area. Interestingly, New York City’s railroad is not in the top ten. I don’t think it means the city relies less on railroad. In fact, New York City is heavy reliance on railroad for public transportation. But several different authorities run railroads connecting between the city and the suburban area such as New Jersey and Connecticut. I have noted that New Haven, CT hires 1561 employees while Suffolk in Long Island, NY hires 3685. New Haven was the terminal of the Metro-North Railroad connecting between New York City and CT. Long Island Rail Road connects with New York City. (source: Wikipedia)
# A tibble: 10 × 4
state total total_national ratio
<chr> <dbl> <dbl> <dbl>
1 TX 19839 255432 0.0777
2 IL 19131 255432 0.0749
3 NY 17050 255432 0.0667
4 NE 13176 255432 0.0516
5 CA 13137 255432 0.0514
6 PA 12769 255432 0.0500
7 OH 9056 255432 0.0355
8 GA 8605 255432 0.0337
9 IN 8537 255432 0.0334
10 MO 8419 255432 0.0330
Although Cook county hires the largest size of employees, Illinois is not the NO.1 in terms of total employees. Texas outnumbers it by several hundreds employees.However, no state reaches 10 % of national total employees. Both Texas and Illinois are over 7% of national total employees. New York is even less 7% of them.
# A tibble: 1 × 3
temployee_mean templyee_sd tempyee_median
<dbl> <dbl> <dbl>
1 4819. 4782. 3379
The mean, median and standard deviation of states reflect that total_employees of some big states are overwhelming and some small states only maintain a tiny number of total employees.
# A tibble: 53 × 4
state temployee_mean templyee_sd tempyee_median
<chr> <dbl> <dbl> <dbl>
1 AE 2 NA 2
2 AK 17.2 34.8 2.5
3 AL 63.5 130. 26
4 AP 1 NA 1
5 AR 53.8 131. 16.5
6 AZ 210. 228. 94
7 CA 239. 549. 61
8 CO 64.0 128. 10
9 CT 324 520. 125
10 DC 279 NA 279
# … with 43 more rows
The big differences between mean, median and sd of each state reflect population distribution in he United State.
# A tibble: 34 × 5
state county total_employees total_TX ratio_TX
<chr> <chr> <dbl> <dbl> <dbl>
1 TX TARRANT 4235 19839 0.213
2 TX HARRIS 2535 19839 0.128
3 TX BEXAR 950 19839 0.0479
4 TX POTTER 883 19839 0.0445
5 TX EL PASO 863 19839 0.0435
6 TX MONTGOMERY 474 19839 0.0239
7 TX JOHNSON 429 19839 0.0216
8 TX BELL 413 19839 0.0208
9 TX DALLAS 406 19839 0.0205
10 TX DENTON 394 19839 0.0199
# … with 24 more rows
# A tibble: 1 × 2
total_mean total_madian
<dbl> <dbl>
1 89.8 17
Cast study 1 TX Taking TX as case study, I try to looking the distribution in a state. The combination of TARRANT and HARRIS reaches 34 % of total employees in TX, which reflects high population density in these two areas.
# A tibble: 25 × 5
state county total_employees total_IL ratio_IL
<chr> <chr> <dbl> <dbl> <dbl>
1 IL COOK 8207 19131 0.429
2 IL WILL 1784 19131 0.0933
3 IL KNOX 885 19131 0.0463
4 IL DUPAGE 837 19131 0.0438
5 IL KANE 577 19131 0.0302
6 IL ST CLAIR 495 19131 0.0259
7 IL MADISON 427 19131 0.0223
8 IL MACON 425 19131 0.0222
9 IL LAKE 340 19131 0.0178
10 IL KANKAKEE 289 19131 0.0151
# … with 15 more rows
Case study of IL Cook county hire 42% employees of railroad in IL demonstrates the overwhelming population density in the city of Chicago as well as reflects the population distribution in IL.
---
title: "Challenge 2 Guanua Tan"
author: "Guanhua Tan"
desription: "Data wrangling: using group() and summarise()"
date: "09/19/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
```{r}
library(dplyr)
railroad<-read_csv("_data/railroad_2012_clean_county.csv")
railroad
```
```{r}
#| label: summary
railroad %>%
summary(railroad)
```
The dataset of railroad across the United States includes three columns--state, county and total employees in each county -- and 2930 counties.Each county at least maintains one employees while the maximum number is 8207. The average of total employees each county hires is 81.18. However, the median of it is 27.00, which reflects that most counties maintain a small station. The numbers of 1st Qu. and 3rd Qu. further confirms the finding.
```{r}
# top-ranking by county
max_railroad = railroad %>%
arrange(desc(`total_employees`)) %>%
slice(1:20)
max_railroad
```
Cook county, IL, hires the largest size of employees and reaches to 8207 while Tarrant, TX, is second to it with 4235 employees. The city of Chicago, the third largest city of the United State in terms of population, is located in Cook county, which interprets why Cook hires a large group of employees to manage the operation of railroad. I also propose that residents in IL may be more willing to take train for commuting. Tarrant county hires a large number of employees because it serves the Dallas-Fort Worth-Arlington metropolitan area. Interestingly, New York City's railroad is not in the top ten. I don't think it means the city relies less on railroad. In fact, New York City is heavy reliance on railroad for public transportation. But several different authorities run railroads connecting between the city and the suburban area such as New Jersey and Connecticut. I have noted that New Haven, CT hires 1561 employees while Suffolk in Long Island, NY hires 3685. New Haven was the terminal of the Metro-North Railroad connecting between New York City and CT. Long Island Rail Road connects with New York City. (source: Wikipedia)
```{r}
# each state's percentage in the US
railroad %>% group_by(state) %>% summarise(total=sum(total_employees)) %>% arrange(desc(total)) %>% mutate(total_national=sum(total) , ratio=total/total_national) %>% slice(1:10)
```
Although Cook county hires the largest size of employees, Illinois is not the NO.1 in terms of total employees. Texas outnumbers it by several hundreds employees.However, no state reaches 10 % of national total employees. Both Texas and Illinois are over 7% of national total employees. New York is even less 7% of them.
```{r}
# mean, median, and states of states
railroad %>% group_by(state) %>% summarise(total=sum(total_employees)) %>% summarise(temployee_mean=mean(total),
templyee_sd=sd(total, na.rm=TRUE),
tempyee_median=median(total, na.rm=TRUE))
```
The mean, median and standard deviation of states reflect that total_employees of some big states are overwhelming and some small states only maintain a tiny number of total employees.
```{r}
# state difference
railroad %>% group_by(state) %>%
summarise(temployee_mean=mean(total_employees),
templyee_sd=sd(total_employees, na.rm=TRUE),
tempyee_median=median(total_employees, na.rm=TRUE))
```
The big differences between mean, median and sd of each state reflect population distribution in he United State.
```{r}
#TX as a case study
railroad %>% filter(state=='TX') %>% arrange(desc(total_employees)) %>% mutate(total_TX=sum(total_employees), ratio_TX=total_employees/total_TX) %>% filter(total_employees >100)
```
```{r}
# TX as a case study
railroad %>% filter(state=='TX') %>% summarise(total_mean=mean(total_employees), total_madian=median(total_employees))
```
Cast study 1 TX
Taking TX as case study, I try to looking the distribution in a state. The combination of TARRANT and HARRIS reaches 34 % of total employees in TX, which reflects high population density in these two areas.
```{r}
# case study IL
railroad %>% filter(state=='IL') %>% arrange(desc(total_employees)) %>% mutate(total_IL=sum(total_employees), ratio_IL=total_employees/total_IL) %>% filter(total_employees >100)
```
Case study of IL
Cook county hire 42% employees of railroad in IL demonstrates the overwhelming population density in the city of Chicago as well as reflects the population distribution in IL.