DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Interpretation

Challenge 2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
Author

Hezzie Phillips

Published

October 1, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

  • railroad*.csv or StateCounty2012.xls ⭐
Code
library(tidyverse)
railroad<-read_csv("_data/railroad_2012_clean_county.csv")
head(railroad)
# A tibble: 6 × 3
  state county               total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1

Describe the data

As mentioned in challenge 1 we can see that there are three variables in this table: state, county and total employees.

The data seems to be tabulating the number of railroad employees in different counties.

Provide Grouped Summary Statistics

The total number of employees across all counties is: 255,432

Code
summarise(railroad, total_employees = sum(total_employees))
# A tibble: 1 × 1
  total_employees
            <dbl>
1          255432

We can find the county with the largest number of employees: 8207 in Cook County Illinois

Code
summarise(railroad, most_employees=max(total_employees))
# A tibble: 1 × 1
  most_employees
           <dbl>
1           8207

We can take a look and see how the largest number of employees might correlate to the largest states by area.

1st: Alaska, 665,400 square miles

2nd: Texas, 268,597 square miles

3rd: California, 163,696 square miles

4th: Montana, 147,040 square miles

Code
filter(railroad, state=="AK")%>%
  summarise(AK_total_employees=sum(total_employees))
# A tibble: 1 × 1
  AK_total_employees
               <dbl>
1                103
Code
filter(railroad, state=="TX")%>%
  summarise(TX_total_employees=sum(total_employees))
# A tibble: 1 × 1
  TX_total_employees
               <dbl>
1              19839
Code
filter(railroad, state=="CA")%>%
  summarise(CA_total_employees=sum(total_employees))
# A tibble: 1 × 1
  CA_total_employees
               <dbl>
1              13137
Code
filter(railroad, state=="MT")%>%
  summarise(MT_total_employees=sum(total_employees))
# A tibble: 1 × 1
  MT_total_employees
               <dbl>
1               3327

And also how the largest number of employees might correlate to the largest states by population*

1st: California, 37,253,956
2nd: Texas, 25,145,561
3rd: New York, 19,378,102
4th: Florida, 18,801,310

This time let’s combine in one table:

Code
By_population<-railroad%>%
  group_by(state)%>%
  summarise(total_employees = sum(total_employees))
By_population[By_population$state %in% c("CA","TX","NY","FL"),]
# A tibble: 4 × 2
  state total_employees
  <chr>           <dbl>
1 CA              13137
2 FL               7419
3 NY              17050
4 TX              19839

Interpretation

I looked at the four states with the largest amount of total employees, I ran the data to see if there was any correlation between either population or square miles of the state. Based on a cursory look at the initial data there doesn’t seem to be a direct correlation between the number of railroad employees and populaton nor a correlation with the size of the state.

Source Code
---
title: "Challenge 2"
author: "Hezzie Phillips"
desription: "Data wrangling: using group() and summarise()"
date: "10/01/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

-   railroad\*.csv or StateCounty2012.xls ⭐

```{r}
library(tidyverse)
railroad<-read_csv("_data/railroad_2012_clean_county.csv")
head(railroad)


```


## Describe the data

As mentioned in challenge 1 we can see that there are three variables in this table: state, county and total employees.  

The data seems to be tabulating the number of railroad employees in different counties.  

  
## Provide Grouped Summary Statistics 

The total number of employees across all counties is:  255,432 

```{r}
#| label: total employees
summarise(railroad, total_employees = sum(total_employees))


```

We can find the county with the largest number of employees: 8207 in Cook County Illinois  

```{r}
#| label: most employees
summarise(railroad, most_employees=max(total_employees))

```

We can take a look and see how the largest number of employees might correlate to the largest states by area.  

1st: Alaska, 665,400 square miles  

2nd: Texas, 268,597 square miles  

3rd: California, 163,696 square miles  

4th: Montana, 147,040 square miles  



```{r}
#| label: employee number in largest area states
filter(railroad, state=="AK")%>%
  summarise(AK_total_employees=sum(total_employees))
filter(railroad, state=="TX")%>%
  summarise(TX_total_employees=sum(total_employees))
filter(railroad, state=="CA")%>%
  summarise(CA_total_employees=sum(total_employees))
filter(railroad, state=="MT")%>%
  summarise(MT_total_employees=sum(total_employees))

```

And also how the largest number of employees might correlate to the largest states by population*  

1st: California, 37,253,956  
2nd: Texas, 25,145,561  
3rd: New York, 19,378,102  
4th: Florida, 18,801,310  

This time let's combine in one table:

```{r}
#| label: employee number in largest population states
By_population<-railroad%>%
  group_by(state)%>%
  summarise(total_employees = sum(total_employees))
By_population[By_population$state %in% c("CA","TX","NY","FL"),]


```



### Interpretation

I looked at the four states with the largest amount of total employees, I ran the data to see if there was any correlation between either population or square miles of the state. Based on a cursory look at the initial data there doesn't seem to be a direct correlation between the number of railroad employees and populaton nor a correlation with the size of the state.