DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2 Instructions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge 2 Instructions

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
Author

Meredith Rolfe

Published

August 16, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐
Code
data <- read_csv('_data/railroad_2012_clean_county.csv')
data
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

The dataset chosen by me is Railroad, it describes the information about total number of employees for a given combination of state and county. It has been collected for n number of state and county within them.

Code
data %>% group_by(county,state) %>%
  summarize_at(c('total_employees'),mean)
# A tibble: 2,930 × 3
# Groups:   county [1,709]
   county    state total_employees
   <chr>     <chr>           <dbl>
 1 ABBEVILLE SC                124
 2 ACADIA    LA                 13
 3 ACCOMACK  VA                  4
 4 ADA       ID                 81
 5 ADAIR     IA                  5
 6 ADAIR     KY                  1
 7 ADAIR     MO                 21
 8 ADAIR     OK                  2
 9 ADAMS     CO                553
10 ADAMS     IA                  7
# … with 2,920 more rows

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),mean)
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                2  
 2 AK               17.2
 3 AL               63.5
 4 AP                1  
 5 AR               53.8
 6 AZ              210. 
 7 CA              239. 
 8 CO               64.0
 9 CT              324  
10 DC              279  
# … with 43 more rows
Code
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),mean)
# A tibble: 1,709 × 2
   county    total_employees
   <chr>               <dbl>
 1 ABBEVILLE          124   
 2 ACADIA              13   
 3 ACCOMACK             4   
 4 ADA                 81   
 5 ADAIR                7.25
 6 ADAMS               73.2 
 7 ADDISON              8   
 8 AIKEN              193   
 9 AITKIN              19   
10 ALACHUA             22   
# … with 1,699 more rows
Code
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),median)
# A tibble: 1,709 × 2
   county    total_employees
   <chr>               <dbl>
 1 ABBEVILLE           124  
 2 ACADIA               13  
 3 ACCOMACK              4  
 4 ADA                  81  
 5 ADAIR                 3.5
 6 ADAMS                19.5
 7 ADDISON               8  
 8 AIKEN               193  
 9 AITKIN               19  
10 ALACHUA              22  
# … with 1,699 more rows
Code
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),median)
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                2  
 2 AK                2.5
 3 AL               26  
 4 AP                1  
 5 AR               16.5
 6 AZ               94  
 7 CA               61  
 8 CO               10  
 9 CT              125  
10 DC              279  
# … with 43 more rows
Code
library(dplyr)
target <- c("AR")
ar <- filter(data, state %in% target) 
ar %>% summarize_at(c('total_employees'),mean)
# A tibble: 1 × 1
  total_employees
            <dbl>
1            53.8

Explain and Interpret

Code
library(dplyr)
target <- c("AK")
ak <- filter(data, state %in% target) 
ak  %>% summarize_at(c('total_employees'),mean)
# A tibble: 1 × 1
  total_employees
            <dbl>
1            17.2
Code
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),sd)
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE               NA  
 2 AK               34.8
 3 AL              130. 
 4 AP               NA  
 5 AR              131. 
 6 AZ              228. 
 7 CA              549. 
 8 CO              128. 
 9 CT              520. 
10 DC               NA  
# … with 43 more rows
Code
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),sd)
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE               NA  
 2 AK               34.8
 3 AL              130. 
 4 AP               NA  
 5 AR              131. 
 6 AZ              228. 
 7 CA              549. 
 8 CO              128. 
 9 CT              520. 
10 DC               NA  
# … with 43 more rows
Code
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),sd)
# A tibble: 1,709 × 2
   county    total_employees
   <chr>               <dbl>
 1 ABBEVILLE           NA   
 2 ACADIA              NA   
 3 ACCOMACK            NA   
 4 ADA                 NA   
 5 ADAIR                9.32
 6 ADAMS              155.  
 7 ADDISON             NA   
 8 AIKEN               NA   
 9 AITKIN              NA   
10 ALACHUA             NA   
# … with 1,699 more rows
Code
library(dplyr)
target <- c("CA")
ak <- filter(data, state %in% target) 
ak  %>% summarize_at(c('total_employees'),mean)
# A tibble: 1 × 1
  total_employees
            <dbl>
1            239.
Code
library(dplyr)
target <- c("CT")
ak <- filter(data, state %in% target) 
ak  %>% summarize_at(c('total_employees'),mean)
# A tibble: 1 × 1
  total_employees
            <dbl>
1             324

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

I looked into different central tendencies for the dataset - mean, median and also looked into standard deviation for the states. We can see that for AE, DC and AP the standard deviation is NA stating that the dispersion is 0 and it has similar values for all counties. The two groups i.e. states AK and AR where compared to see the difference in mean of the total_employees present. There is a difference of almost 35 employees, as AR has a mean of 53 while AK stands at 17. It is interesting to see that employees vary a lot per state. I also looked into the employees from the states with highest SD, to check what mean value stands at. According to the dispersion criteria, CA and CT had the most SD, but while checking means CT has a mean of 324 while CA stands at 238. We can see that the difference between them is as high as 100.

Source Code
---
title: "Challenge 2 Instructions"
author: "Meredith Rolfe"
desription: "Data wrangling: using group() and summarise()"
date: "08/16/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.

-   railroad\*.csv or StateCounty2012.xls ⭐
-   FAOstat\*.csv or birds.csv ⭐⭐⭐
-   hotel_bookings.csv ⭐⭐⭐⭐

```{r}
data <- read_csv('_data/railroad_2012_clean_county.csv')
data
```

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

The dataset chosen by me is Railroad, it describes the information about total number of employees for a given combination of state and county. It has been collected for n number of state and county within them.


```{r}
data %>% group_by(county,state) %>%
  summarize_at(c('total_employees'),mean)
```
## Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

```{r}
#| label: summary

```

## Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

```{r}
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),mean)
```

```{r}
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),mean)
```

```{r}
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),median)
```
```{r}
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),median)
```
```{r}
library(dplyr)
target <- c("AR")
ar <- filter(data, state %in% target) 
ar %>% summarize_at(c('total_employees'),mean)
```
### Explain and Interpret
```{r}
library(dplyr)
target <- c("AK")
ak <- filter(data, state %in% target) 
ak  %>% summarize_at(c('total_employees'),mean)
```
```{r}
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),sd)
```
```{r}
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),sd)
```

```{r}
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),sd)
```
```{r}
library(dplyr)
target <- c("CA")
ak <- filter(data, state %in% target) 
ak  %>% summarize_at(c('total_employees'),mean)
```

```{r}
library(dplyr)
target <- c("CT")
ak <- filter(data, state %in% target) 
ak  %>% summarize_at(c('total_employees'),mean)
```

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

I looked into different central tendencies for the dataset - mean, median and also looked into standard deviation for the states. We can see that for AE, DC and AP the standard deviation is NA stating that the dispersion is 0 and it has similar values for all counties. The two groups i.e. states AK and AR where compared to see the difference in mean of the total_employees present. There is a difference of almost 35 employees, as AR has a mean of 53 while AK stands at 17. It is interesting to see that employees vary a lot per state. I also looked into the employees from the states with highest SD, to check what mean value stands at. According to the dispersion criteria, CA and CT had the most SD, but while checking means CT has a mean of 324 while CA stands at 238. We can see that the difference between them is as high as 100.