Code
library(tidyverse)
library(readxl)
library(summarytools)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
PoChun Yang
February 27, 2023
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
[1] 2990 5
[1] "STATE" "...2" "COUNTY" "...4" "TOTAL"
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
When I used the dfSummary, I got the all the summary of the StateCounty2012.xls. However, there are a lot of strange things. For example, the max of total railroad employment is 255432. Then, I check the number of the files. It is the grand total. Thus, I used the some command to solve those problems.
Data Frame Summary
statecounty
Dimensions: 2990 x 5
Duplicates: 2
---------------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- ------------- ---------------------------- --------------------- -------------------- --------- ----------
1 STATE 1. TX 221 ( 7.4%) I 2987 3
[character] 2. GA 152 ( 5.1%) I (99.9%) (0.1%)
3. KY 119 ( 4.0%)
4. MO 115 ( 3.9%)
5. IL 103 ( 3.4%)
6. IA 99 ( 3.3%)
7. KS 95 ( 3.2%)
8. NC 94 ( 3.1%)
9. IN 92 ( 3.1%)
10. VA 92 ( 3.1%)
[ 100 others ] 1805 (60.4%) IIIIIIIIIIII
2 ...2 All NA's 0 2990
[logical] (0.0%) (100.0%)
3 COUNTY 1. WASHINGTON 31 ( 1.1%) 2930 60
[character] 2. JEFFERSON 26 ( 0.9%) (98.0%) (2.0%)
3. FRANKLIN 24 ( 0.8%)
4. LINCOLN 24 ( 0.8%)
5. JACKSON 22 ( 0.8%)
6. MADISON 19 ( 0.6%)
7. MONTGOMERY 18 ( 0.6%)
8. CLAY 17 ( 0.6%)
9. MARION 17 ( 0.6%)
10. MONROE 17 ( 0.6%)
[ 1699 others ] 2715 (92.7%) IIIIIIIIIIIIIIIIII
4 ...4 All NA's 0 2990
[logical] (0.0%) (100.0%)
5 TOTAL Mean (sd) : 256.9 (4764.1) 452 distinct values : 2985 5
[numeric] min < med < max: : (99.8%) (0.2%)
1 < 22 < 255432 :
IQR (CV) : 64 (18.5) :
:
---------------------------------------------------------------------------------------------------------------
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
First of all, I used cleancolumns to make the data is to read it. Then, I want to remove all the state_total rows. Next, used the group_by and summerise to get some of the detail of all the state. I am interested in NY state of railroad information so I used the filter command to do it. As below of form, there are 61 county of New York states. Besides that, the maximum of county’s railraod employment is 3685. In California state, there are 55 County that the summary gave us. Moreover, the minimum, maximum, mean, median county’s railroad employment are respectively 1, 2888, 238.9, and 61.
cleancolumns <- select(statecounty,"STATE","COUNTY","TOTAL")
fix_data <- na.omit(cleancolumns)
fix_data%>%
group_by(STATE)%>%
summarise(total = sum(TOTAL), meantotal=mean(TOTAL),mediantotal = median(TOTAL),
modetotal = mode(TOTAL),deviationtotal = sd(TOTAL), maxtotal = max(TOTAL),
mintotal = min(TOTAL))
# A tibble: 53 × 8
STATE total meantotal mediantotal modetotal deviationtotal maxtotal mintotal
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 AE 2 2 2 numeric NA 2 2
2 AK 103 17.2 2.5 numeric 34.8 88 1
3 AL 4257 63.5 26 numeric 130. 990 1
4 AP 1 1 1 numeric NA 1 1
5 AR 3871 53.8 16.5 numeric 131. 972 1
6 AZ 3153 210. 94 numeric 228. 749 3
7 CA 13137 239. 61 numeric 549. 2888 1
8 CO 3650 64.0 10 numeric 128. 553 1
9 CT 2592 324 125 numeric 520. 1561 26
10 DC 279 279 279 numeric NA 279 279
# … with 43 more rows
STATE COUNTY TOTAL
Length:61 Length:61 Min. : 5.0
Class :character Class :character 1st Qu.: 27.0
Mode :character Mode :character Median : 71.0
Mean : 279.5
3rd Qu.: 196.0
Max. :3685.0
STATE COUNTY TOTAL
Length:55 Length:55 Min. : 1.0
Class :character Class :character 1st Qu.: 12.5
Mode :character Mode :character Median : 61.0
Mean : 238.9
3rd Qu.: 200.5
Max. :2888.0
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
---
title: "PoChunYang_Challenge 2"
author: "PoChun Yang"
desription: "StateCounty2012.xls_Data wrangling"
date: "02/27/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- PoChunYang
- tidyverse
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
library(summarytools)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
```{r}
statecounty<- read_xls("_data/StateCounty2012.xls",skip=3)
dim(statecounty)
colnames(statecounty)
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
When I used the dfSummary, I got the all the summary of the StateCounty2012.xls. However, there are a lot of strange things. For example, the max of total railroad employment is 255432. Then, I check the number of the files. It is the grand total. Thus, I used the some command to solve those problems.
```{r}
#| label: summary
dfSummary(statecounty)
```
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
First of all, I used cleancolumns to make the data is to read it. Then, I want to remove all the state_total rows. Next, used the group_by and summerise to get some of the detail of all the state. I am interested in NY state of railroad information so I used the filter command to do it. As below of form, there are 61 county of New York states. Besides that, the maximum of county's railraod employment is 3685.
In California state, there are 55 County that the summary gave us. Moreover, the minimum, maximum, mean, median county's railroad employment are respectively 1, 2888, 238.9, and 61.
```{r}
cleancolumns <- select(statecounty,"STATE","COUNTY","TOTAL")
fix_data <- na.omit(cleancolumns)
fix_data%>%
group_by(STATE)%>%
summarise(total = sum(TOTAL), meantotal=mean(TOTAL),mediantotal = median(TOTAL),
modetotal = mode(TOTAL),deviationtotal = sd(TOTAL), maxtotal = max(TOTAL),
mintotal = min(TOTAL))
NY<-filter(fix_data,STATE == "NY")
summary(NY)
CA<-filter(fix_data,STATE == "CA")
summary(CA)
```
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.