DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge-2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
  • Explain and Interpret
  • Challenge Overview
  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge-2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
Author

Said Arslan

Published

September 20, 2022

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

I have picked the railroad data for this challenge. It includes information about railroad employees in 2012.

Code
railroad <- read.csv("_data/railroad_2012_clean_county.csv")

Describe the data

Code
dim(railroad)
[1] 2930    3
Code
head(railroad)
  state               county total_employees
1    AE                  APO               2
2    AK            ANCHORAGE               7
3    AK FAIRBANKS NORTH STAR               2
4    AK               JUNEAU               3
5    AK    MATANUSKA-SUSITNA               2
6    AK                SITKA               1
Code
summary(railroad)
    state              county          total_employees  
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00  

In the dataset there are 2930 rows (observations) and 3 columns (variables). Each row gives the number of railroad employees in a county of a state.

Code
sum(is.na(railroad$state))
[1] 0
Code
sum(is.na(railroad$county))
[1] 0
Code
sum(is.na(railroad$total_employees))
[1] 0

There are no missing values.

Code
n_distinct(railroad$state)
[1] 53
Code
n_distinct(railroad$county)
[1] 1709

I would expect 51 distinct values under state column but there are 53.

Code
unique(railroad$state)
 [1] "AE" "AK" "AL" "AP" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA"
[16] "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC"
[31] "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN"
[46] "TX" "UT" "VA" "VT" "WA" "WI" "WV" "WY"

All are state abbreviations except “AE” and “AP”.

On the other hand, there are 2930 observations but 1709 distinct county names, which implies that there are a lot of counties with same name in different states.

Provide Grouped Summary Statistics

Code
sum(railroad$total_employees)
[1] 255432

There are total of 255,432 railroad employees in the U.S. in 2012.

Code
railroad %>% 
  group_by(state) %>% 
  summarise(total_employees= sum(total_employees),  
            proportion= round(total_employees/sum(railroad$total_employees)*100,1)) %>% 
  arrange(desc(total_employees))
# A tibble: 53 × 3
   state total_employees proportion
   <chr>           <int>      <dbl>
 1 TX              19839        7.8
 2 IL              19131        7.5
 3 NY              17050        6.7
 4 NE              13176        5.2
 5 CA              13137        5.1
 6 PA              12769        5  
 7 OH               9056        3.5
 8 GA               8605        3.4
 9 IN               8537        3.3
10 MO               8419        3.3
# … with 43 more rows

Top 3 states with the largest number of railroad employees are Texas, Illinois and New York. 7.8% of railroad employees in the country are from Texas.

Code
railroad %>% 
  group_by(state, county) %>% 
  summarise(total_employees= sum(total_employees)) %>% 
  arrange(desc(total_employees)) %>% 
  head()
# A tibble: 6 × 3
# Groups:   state [6]
  state county           total_employees
  <chr> <chr>                      <int>
1 IL    COOK                        8207
2 TX    TARRANT                     4235
3 NE    DOUGLAS                     3797
4 NY    SUFFOLK                     3685
5 VA    INDEPENDENT CITY            3249
6 FL    DUVAL                       3073

County Cook of Illiniois has the highest number of employees with 8,207.

Explain and Interpret

Geographically large and populated states like Texas, Illinois have more employment which makes quite sense. If the dataset is merged with other datasets that includes information about such as geographical characteristics of states, population, length of railroads etc., very interesting further analysis can be made.

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

Source Code
---
title: "Challenge-2"
author: "Said Arslan"
desription: "Data wrangling: using group() and summarise()"
date: "09/20/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads

---

```{r}
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

```



## Read in the Data


I have picked the railroad data for this challenge. It includes information about railroad employees in 2012.


```{r}
railroad <- read.csv("_data/railroad_2012_clean_county.csv")

```




## Describe the data

```{r}
dim(railroad)
head(railroad)
summary(railroad)

```


In the dataset there are 2930 rows (observations) and 3 columns (variables). Each row gives the number of railroad employees in a county of a state.


```{r}
sum(is.na(railroad$state))
sum(is.na(railroad$county))
sum(is.na(railroad$total_employees))

```
There are no missing values.


```{r}
n_distinct(railroad$state)
n_distinct(railroad$county)

```

I would expect 51 distinct values under `state` column but there are 53.


```{r}
unique(railroad$state)

```

All are state abbreviations except "AE" and "AP".


On the other hand, there are 2930 observations but 1709 distinct county names, which implies that there are a lot of counties with same name in different states.




## Provide Grouped Summary Statistics


```{r}
sum(railroad$total_employees)

```

There are total of 255,432 railroad employees in the U.S. in 2012.


```{r}
railroad %>% 
  group_by(state) %>% 
  summarise(total_employees= sum(total_employees),  
            proportion= round(total_employees/sum(railroad$total_employees)*100,1)) %>% 
  arrange(desc(total_employees))

```

Top 3 states with the largest number of railroad employees are Texas, Illinois and New York. 7.8% of railroad employees in the country are from Texas.


```{r}
railroad %>% 
  group_by(state, county) %>% 
  summarise(total_employees= sum(total_employees)) %>% 
  arrange(desc(total_employees)) %>% 
  head()


```

County Cook of Illiniois has the highest number of employees with 8,207. 


## Explain and Interpret

Geographically large and populated states like Texas, Illinois have more employment which makes quite sense. If the dataset is merged with other datasets that includes information about such as geographical characteristics of states, population, length of railroads etc., very interesting further analysis can be made.





























































































## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.

-   railroad\*.csv or StateCounty2012.xls ⭐
-   FAOstat\*.csv or birds.csv ⭐⭐⭐
-   hotel_bookings.csv ⭐⭐⭐⭐

```{r}
```

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

## Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

```{r}
#| label: summary

```

## Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

```{r}
```

### Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.