Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Abhinav Reddy Yadatha
May 3, 2023
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
# A tibble: 2,930 × 3
state counties total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# ℹ 2,920 more rows
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Description : The dataset involved gathering information on the total number of employees in each county by taking trains from specific states. The data revealed that the average number of employees per county was 87.2, with a standard deviation of 283.6. The lowest number of employees was 1, the middle value was 21, and the highest was 8207. Here’s how I obtained and analyzed this data
# A tibble: 6 × 3
state counties total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
Data Frame Summary
railroads
Dimensions: 2930 x 3
Duplicates: 0
-----------------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- ----------------- -------------------------- --------------------- -------------------- ---------- ---------
1 state 1. TX 221 ( 7.5%) I 2930 0
[character] 2. GA 152 ( 5.2%) I (100.0%) (0.0%)
3. KY 119 ( 4.1%)
4. MO 115 ( 3.9%)
5. IL 103 ( 3.5%)
6. IA 99 ( 3.4%)
7. KS 95 ( 3.2%)
8. NC 94 ( 3.2%)
9. IN 92 ( 3.1%)
10. VA 92 ( 3.1%)
[ 43 others ] 1748 (59.7%) IIIIIIIIIII
2 counties 1. WASHINGTON 31 ( 1.1%) 2930 0
[character] 2. JEFFERSON 26 ( 0.9%) (100.0%) (0.0%)
3. FRANKLIN 24 ( 0.8%)
4. LINCOLN 24 ( 0.8%)
5. JACKSON 22 ( 0.8%)
6. MADISON 19 ( 0.6%)
7. MONTGOMERY 18 ( 0.6%)
8. CLAY 17 ( 0.6%)
9. MARION 17 ( 0.6%)
10. MONROE 17 ( 0.6%)
[ 1699 others ] 2715 (92.7%) IIIIIIIIIIIIIIIIII
3 total_employees Mean (sd) : 87.2 (283.6) 404 distinct values : 2930 0
[numeric] min < med < max: : (100.0%) (0.0%)
1 < 21 < 8207 :
IQR (CV) : 58 (3.3) :
:
-----------------------------------------------------------------------------------------------------------------
It can be observed that there are 2930 rows and 3 columns
Printing the column names of the dataset :
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
# A tibble: 1 × 3
state counties total_employees
<dbl> <dbl> <dbl>
1 NA NA 58
# A tibble: 1 × 3
state counties total_employees
<dbl> <dbl> <dbl>
1 NA NA 87.2
# A tibble: 1 × 3
state counties total_employees
<dbl> <dbl> <dbl>
1 NA NA 21
Total count of employees for each state:
# A tibble: 53 × 2
state totalEmployees
<chr> <dbl>
1 AE 2
2 AK 103
3 AL 4257
4 AP 1
5 AR 3871
6 AZ 3153
7 CA 13137
8 CO 3650
9 CT 2592
10 DC 279
# ℹ 43 more rows
# A tibble: 2,930 × 2
state total_employees
<chr> <dbl>
1 AE 2
2 AK 7
3 AK 2
4 AK 3
5 AK 2
6 AK 1
7 AK 88
8 AL 102
9 AL 143
10 AL 1
# ℹ 2,920 more rows
Median, Mean and standard deviation of employee counts in every state:
# A tibble: 53 × 4
state meanEmployees medianEmployees standardDeviation
<chr> <dbl> <dbl> <dbl>
1 AE 2 2 NA
2 AK 17.2 2.5 34.8
3 AL 63.5 26 130.
4 AP 1 1 NA
5 AR 53.8 16.5 131.
6 AZ 210. 94 228.
7 CA 239. 61 549.
8 CO 64.0 10 128.
9 CT 324 125 520.
10 DC 279 279 NA
# ℹ 43 more rows
State wise employees count arranged and displayed in descending order :
# A tibble: 53 × 2
state Sum
<chr> <dbl>
1 TX 19839
2 IL 19131
3 NY 17050
4 NE 13176
5 CA 13137
6 PA 12769
7 OH 9056
8 GA 8605
9 IN 8537
10 MO 8419
# ℹ 43 more rows
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
We have categorized the total number of employees according to the states, which means that we have calculated the average, middle value, and sum of all the counties for various states.
There are some states that have only one county, which makes the standard deviation undefined or null.
After sorting the data based on the overall number of employees, we observed that TX and IL have the greatest number of employees while AP has only one employee.
---
title: "Challenge 2 Instructions"
author: "Abhinav Reddy Yadatha"
description: "Data wrangling: using group() and summarise()"
date: "05/03/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- Abhinav Reddy Yadatha
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
```{r}
library(tidyverse)
railroads <- read_csv('_data/railroad_2012_clean_county.csv', show_col_types = FALSE)
railroads <- rename(railroads, counties = county)
railroads
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Description : The dataset involved gathering information on the total number of employees in each county by taking trains from specific states. The data revealed that the average number of employees per county was 87.2, with a standard deviation of 283.6. The lowest number of employees was 1, the middle value was 21, and the highest was 8207. Here's how I obtained and analyzed this data
```{r}
# Printing the first few rows of the dataset.
head(railroads)
```
```{r}
#| label: summary
library(summarytools)
dfSummary(railroads)
```
## Check dimensions of the dataset
```{r}
dim(railroads)
```
It can be observed that there are 2930 rows and 3 columns
Printing the column names of the dataset :
```{r}
colnames(railroads)
```
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
#IQR for railroads
railroads %>%
summarize_all(IQR, na.rm = TRUE)
#Mean for railroads
railroads %>%
summarize_all(mean, na.rm = TRUE)
#Median for railroads
railroads %>%
summarize_all(median, na.rm = TRUE)
```
Total count of employees for each state:
```{r}
state_wise_ct = select(railroads, state, total_employees)
state_wise_ct %>%
group_by(state) %>%
summarize(totalEmployees=sum(total_employees))
state_wise_ct
```
Median, Mean and standard deviation of employee counts in every state:
```{r}
state_wise_ct = select(railroads, state, total_employees)
state_wise_ct %>%
group_by(state) %>%
summarize(meanEmployees=mean(total_employees),medianEmployees=median(total_employees),standardDeviation = sd(total_employees))
```
State wise employees count arranged and displayed in descending order :
```{r}
state_wise_grouped_cts <- state_wise_ct %>%
group_by(state) %>%
summarize(Sum = sum(total_employees))
sorted_counts <- state_wise_grouped_cts %>%
arrange(desc(Sum))
sorted_counts
```
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
We have categorized the total number of employees according to the states, which means that we have calculated the average, middle value, and sum of all the counties for various states.
There are some states that have only one county, which makes the standard deviation undefined or null.
After sorting the data based on the overall number of employees, we observed that TX and IL have the greatest number of employees while AP has only one employee.