Code
#
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Aditya Salveru
April 22, 2023
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
We are reading the railroad data
state county total_employees
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
[1] 2930 3
'data.frame': 2930 obs. of 3 variables:
$ state : chr "AE" "AK" "AK" "AK" ...
$ county : chr "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
$ total_employees: int 2 7 2 3 2 1 88 102 143 1 ...
The dataset contains 2930 rows with three columns.
The three columns are ‘state’, ‘county’ and ‘total_employees’
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
Total employees per state state:
# A tibble: 53 × 2
state totalEmployees
<chr> <int>
1 AE 2
2 AK 103
3 AL 4257
4 AP 1
5 AR 3871
6 AZ 3153
7 CA 13137
8 CO 3650
9 CT 2592
10 DC 279
# ℹ 43 more rows
Median number per county in every state:
# A tibble: 53 × 4
state meanEmployees medianEmployees standardDeviation
<chr> <dbl> <dbl> <dbl>
1 AE 2 2 NA
2 AK 17.2 2.5 34.8
3 AL 63.5 26 130.
4 AP 1 1 NA
5 AR 53.8 16.5 131.
6 AZ 210. 94 228.
7 CA 239. 61 549.
8 CO 64.0 10 128.
9 CT 324 125 520.
10 DC 279 279 NA
# ℹ 43 more rows
State wise employees in the descending order:
# A tibble: 53 × 2
state Sum
<chr> <int>
1 TX 19839
2 IL 19131
3 NY 17050
4 NE 13176
5 CA 13137
6 PA 12769
7 OH 9056
8 GA 8605
9 IN 8537
10 MO 8419
# ℹ 43 more rows
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
We have grouped the total_employees state wise, i.e. we have computed the metrics mean,median and total of all the counties for different states.
We can see that there are few states with only 1 county, therefore the standard deviation is 0(NA). Upon sorting based on the total number of employees, we can see that TX and IL have the highest number of employees whereas AP has only 1 employee.
---
title: "Challenge 2 Instructions"
author: "Aditya Salveru"
desription: "Data wrangling: using group() and summarise()"
date: "04/22/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
#
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
We are reading the railroad data
```{r}
railroad <- read.csv("_data/railroad_2012_clean_county.csv")
head(railroad)
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
```{r}
#| label: summary
dim(railroad)
str(railroad)
```
The dataset contains 2930 rows with three columns.
The three columns are 'state', 'county' and 'total_employees'
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
Total employees per state state:
```{r}
state_wise = select(railroad, state, total_employees)
state_wise %>%
group_by(state) %>%
summarize(totalEmployees=sum(total_employees))
```
Median number per county in every state:
```{r}
state_wise = select(railroad, state, total_employees)
state_wise %>%
group_by(state) %>%
summarize(meanEmployees=mean(total_employees),medianEmployees=median(total_employees),standardDeviation = sd(total_employees))
```
State wise employees in the descending order:
```{r}
state_wise_grouped <- state_wise %>%
group_by(state) %>%
summarize(Sum = sum(total_employees))
sorted <- state_wise_grouped %>%
arrange(desc(Sum))
sorted
```
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
We have grouped the total_employees state wise, i.e. we have computed the metrics mean,median and total of all the counties for different states.
We can see that there are few states with only 1 county, therefore the standard deviation is 0(NA).
Upon sorting based on the total number of employees, we can see that TX and IL have the highest number of employees whereas AP has only 1 employee.