Code
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Abhinav Reddy Yadatha
May 3, 2023
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.
# A tibble: 2,930 × 3
   state counties             total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# ℹ 2,920 more rowsAdd any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Description : The dataset involved gathering information on the total number of employees in each county by taking trains from specific states. The data revealed that the average number of employees per county was 87.2, with a standard deviation of 283.6. The lowest number of employees was 1, the middle value was 21, and the highest was 8207. Here’s how I obtained and analyzed this data
# A tibble: 6 × 3
  state counties             total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1Data Frame Summary  
railroads  
Dimensions: 2930 x 3  
Duplicates: 0  
-----------------------------------------------------------------------------------------------------------------
No   Variable          Stats / Values             Freqs (% of Valid)    Graph                Valid      Missing  
---- ----------------- -------------------------- --------------------- -------------------- ---------- ---------
1    state             1. TX                       221 ( 7.5%)          I                    2930       0        
     [character]       2. GA                       152 ( 5.2%)          I                    (100.0%)   (0.0%)   
                       3. KY                       119 ( 4.1%)                                                   
                       4. MO                       115 ( 3.9%)                                                   
                       5. IL                       103 ( 3.5%)                                                   
                       6. IA                        99 ( 3.4%)                                                   
                       7. KS                        95 ( 3.2%)                                                   
                       8. NC                        94 ( 3.2%)                                                   
                       9. IN                        92 ( 3.1%)                                                   
                       10. VA                       92 ( 3.1%)                                                   
                       [ 43 others ]              1748 (59.7%)          IIIIIIIIIII                              
2    counties          1. WASHINGTON                31 ( 1.1%)                               2930       0        
     [character]       2. JEFFERSON                 26 ( 0.9%)                               (100.0%)   (0.0%)   
                       3. FRANKLIN                  24 ( 0.8%)                                                   
                       4. LINCOLN                   24 ( 0.8%)                                                   
                       5. JACKSON                   22 ( 0.8%)                                                   
                       6. MADISON                   19 ( 0.6%)                                                   
                       7. MONTGOMERY                18 ( 0.6%)                                                   
                       8. CLAY                      17 ( 0.6%)                                                   
                       9. MARION                    17 ( 0.6%)                                                   
                       10. MONROE                   17 ( 0.6%)                                                   
                       [ 1699 others ]            2715 (92.7%)          IIIIIIIIIIIIIIIIII                       
3    total_employees   Mean (sd) : 87.2 (283.6)   404 distinct values   :                    2930       0        
     [numeric]         min < med < max:                                 :                    (100.0%)   (0.0%)   
                       1 < 21 < 8207                                    :                                        
                       IQR (CV) : 58 (3.3)                              :                                        
                                                                        :                                        
-----------------------------------------------------------------------------------------------------------------It can be observed that there are 2930 rows and 3 columns
Printing the column names of the dataset :
Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
# A tibble: 1 × 3
  state counties total_employees
  <dbl>    <dbl>           <dbl>
1    NA       NA              58# A tibble: 1 × 3
  state counties total_employees
  <dbl>    <dbl>           <dbl>
1    NA       NA            87.2# A tibble: 1 × 3
  state counties total_employees
  <dbl>    <dbl>           <dbl>
1    NA       NA              21Total count of employees for each state:
# A tibble: 53 × 2
   state totalEmployees
   <chr>          <dbl>
 1 AE                 2
 2 AK               103
 3 AL              4257
 4 AP                 1
 5 AR              3871
 6 AZ              3153
 7 CA             13137
 8 CO              3650
 9 CT              2592
10 DC               279
# ℹ 43 more rows# A tibble: 2,930 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                  2
 2 AK                  7
 3 AK                  2
 4 AK                  3
 5 AK                  2
 6 AK                  1
 7 AK                 88
 8 AL                102
 9 AL                143
10 AL                  1
# ℹ 2,920 more rowsMedian, Mean and standard deviation of employee counts in every state:
# A tibble: 53 × 4
   state meanEmployees medianEmployees standardDeviation
   <chr>         <dbl>           <dbl>             <dbl>
 1 AE              2               2                NA  
 2 AK             17.2             2.5              34.8
 3 AL             63.5            26               130. 
 4 AP              1               1                NA  
 5 AR             53.8            16.5             131. 
 6 AZ            210.             94               228. 
 7 CA            239.             61               549. 
 8 CO             64.0            10               128. 
 9 CT            324             125               520. 
10 DC            279             279                NA  
# ℹ 43 more rowsState wise employees count arranged and displayed in descending order :
# A tibble: 53 × 2
   state   Sum
   <chr> <dbl>
 1 TX    19839
 2 IL    19131
 3 NY    17050
 4 NE    13176
 5 CA    13137
 6 PA    12769
 7 OH     9056
 8 GA     8605
 9 IN     8537
10 MO     8419
# ℹ 43 more rowsBe sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
We have categorized the total number of employees according to the states, which means that we have calculated the average, middle value, and sum of all the counties for various states.
There are some states that have only one county, which makes the standard deviation undefined or null.
After sorting the data based on the overall number of employees, we observed that TX and IL have the greatest number of employees while AP has only one employee.
---
title: "Challenge 2 Instructions"
author: "Abhinav Reddy Yadatha"
description: "Data wrangling: using group() and summarise()"
date: "05/03/2023"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - Abhinav Reddy Yadatha
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
-   railroad\*.csv or StateCounty2012.xls ⭐
-   FAOstat\*.csv or birds.csv ⭐⭐⭐
-   hotel_bookings.csv ⭐⭐⭐⭐
```{r}
library(tidyverse)
railroads <- read_csv('_data/railroad_2012_clean_county.csv', show_col_types = FALSE)
railroads <- rename(railroads, counties = county)
railroads
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Description : The dataset involved gathering information on the total number of employees in each county by taking trains from specific states. The data revealed that the average number of employees per county was 87.2, with a standard deviation of 283.6. The lowest number of employees was 1, the middle value was 21, and the highest was 8207. Here's how I obtained and analyzed this data
```{r}
# Printing the first few rows of the dataset.
head(railroads)
```
```{r}
#| label: summary
library(summarytools)
dfSummary(railroads)
```
## Check dimensions of the dataset
```{r}
dim(railroads)
```
It can be observed that there are 2930 rows and 3 columns
Printing the column names of the dataset :
```{r}
colnames(railroads)
```
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
#IQR for railroads
railroads %>%
  summarize_all(IQR, na.rm = TRUE)
#Mean for railroads
railroads %>%
  summarize_all(mean, na.rm = TRUE)
#Median for railroads
railroads %>%
  summarize_all(median, na.rm = TRUE)
```
Total count of employees for each state:
```{r}
state_wise_ct = select(railroads, state, total_employees)
state_wise_ct %>% 
  group_by(state) %>%
  summarize(totalEmployees=sum(total_employees))
state_wise_ct
```
Median, Mean and standard deviation of employee counts in every state:
```{r}
state_wise_ct = select(railroads, state, total_employees)
state_wise_ct %>% 
  group_by(state) %>%
  summarize(meanEmployees=mean(total_employees),medianEmployees=median(total_employees),standardDeviation = sd(total_employees))
```
State wise employees count arranged and displayed in descending order :
```{r}
state_wise_grouped_cts <- state_wise_ct %>% 
  group_by(state) %>%
  summarize(Sum = sum(total_employees))
sorted_counts <- state_wise_grouped_cts %>%
  arrange(desc(Sum))
sorted_counts
```
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
We have categorized the total number of employees according to the states, which means that we have calculated the average, middle value, and sum of all the counties for various states.
There are some states that have only one county, which makes the standard deviation undefined or null.
After sorting the data based on the overall number of employees, we observed that TX and IL have the greatest number of employees while AP has only one employee.