DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 4 Adam Macaiszek

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describing the data
  • Identify variables that need to be mutated
  • Tidy Data (as needed)
  • Tidy Data (as needed)

Challenge 4 Adam Macaiszek

  • Show All Code
  • Hide All Code

  • View Source
challenge_4
debt
Author

Adam Maciaszek

Published

October 8, 2022

Code
library(tidyverse)
library(ggplot2)
library(dplyr)
library(readxl)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
Code
dataset <- data.frame()
file_name <- "_data/debt_in_trillions.xlsx"
dataset <- read_excel(file_name, sheet = 1)

Briefly describing the data

This data set is of different categories of data and thier amounts each quarter from 2003 to the second quarter of 2021. We do not know the country that this is for. The categories of debt are mortgage, HE revolving, auto, credit card student loan loan and other. Each of these categories has its own colmn and the values in is the amount of debt in trillions, there is also a column for the total debt for that quarter that year.

Code
print(head(dataset))
# A tibble: 6 × 8
  `Year and Quarter` Mortgage `HE Revolving` Auto …¹ Credi…² Stude…³ Other Total
  <chr>                 <dbl>          <dbl>   <dbl>   <dbl>   <dbl> <dbl> <dbl>
1 03:Q1                  4.94          0.242   0.641   0.688   0.241 0.478  7.23
2 03:Q2                  5.08          0.26    0.622   0.693   0.243 0.486  7.38
3 03:Q3                  5.18          0.269   0.684   0.693   0.249 0.477  7.56
4 03:Q4                  5.66          0.302   0.704   0.698   0.253 0.449  8.07
5 04:Q1                  5.84          0.328   0.72    0.695   0.260 0.446  8.29
6 04:Q2                  5.97          0.367   0.743   0.697   0.263 0.423  8.46
# … with abbreviated variable names ¹​`Auto Loan`, ²​`Credit Card`,
#   ³​`Student Loan`

Identify variables that need to be mutated

Each data point is labeled by the year and the quarter. Using that information as well as when quarters end at least the values in America the start and end date of each quarter were calculated and put into their own column of data.

Code
quarter_set <- function(input) {
    thing <- str_split(input, ":", n = Inf, simplify = FALSE)
    year <- paste("20",thing[[1]][1], sep="")
    quarter <- thing[[1]][2]
    if (quarter=="Q1") {
        start <- paste(year, "/1/1", sep="")
        stop <- paste(year, "/3/31", sep="")
        return(c(start,stop)) }
    else if (quarter=="Q2") {
        start <- paste(year, "/4/1", sep="")
        stop <- paste(year, "/6/30", sep="")
        return(c(start,stop)) }
    else if (quarter=="Q3") {
        start <- paste(year, "/7/1", sep="")
        stop <- paste(year, "/9/30", sep="")
        return(c(start,stop)) }
    else {
        start <- paste(year, "/10/1", sep="")
        stop <- paste(year, "/12/31", sep="")
        return(c(start,stop)) }
}
Start_Date <- c()
End_Date <- c()
for (x in dataset$"Year and Quarter"){
    Start_Date <- c(Start_Date,quarter_set(x)[1])
    End_Date <- c(End_Date,quarter_set(x)[2])
}
dataset$start_base <- Start_Date
dataset$end_base <- End_Date
dataset$Year <- format(as.Date(dataset$start_base, format="%Y/%m/%d"), "%Y")
dataset$Start_Date <- as.Date(dataset$start_base, format="%Y/%m/%d")
dataset$End_Date <- as.Date(dataset$end_base, format="%Y/%m/%d")    
dataset <- dataset %>% select(-start_base,-end_base,-Total,-"Year and Quarter")

Tidy Data (as needed)

Each of the different columns of debt need to be pivoted to one variable and Type Debt.

Code
dataset <- dataset %>% 
  pivot_longer(
    cols= Mortgage:Other, 
    names_to = "Type_Debt", 
    values_to = "Amount")

Tidy Data (as needed)

Code
ggplot(dataset, aes(x = End_Date, y = Amount, group=Type_Debt, color=Type_Debt)) + geom_line() +ggtitle("Mean Income Levels From (1967-2019)")

The mortgage accounts for such a high percent of the debt that it makes any of the other datapoints hard to see so an additional graph excluding it is below

Code
ggplot(subset(dataset, Type_Debt!="Mortgage"), aes(x = End_Date, y = Amount, group=Type_Debt, color=Type_Debt)) + geom_line() +ggtitle("Mean Income Levels From (1967-2019)")

Source Code
---
title: "Challenge 4 Adam Macaiszek"
author: "Adam Maciaszek"
desription: "More data wrangling: pivoting"
date: "10/08/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_4
  - debt
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(ggplot2)
library(dplyr)
library(readxl)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1)  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2)  tidy data (as needed, including sanity checks)
3)  identify variables that need to be mutated
4)  mutate variables and sanity check all mutations

## Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

-   debt_in_trillions.xlsx ⭐⭐⭐⭐⭐

```{r}
dataset <- data.frame()
file_name <- "_data/debt_in_trillions.xlsx"
dataset <- read_excel(file_name, sheet = 1)

```

### Briefly describing the data
This data set is of different categories of data and thier amounts each quarter from 2003 to the second quarter of 2021. We do not know the country that this is for. The categories of debt are mortgage, HE revolving, auto, credit card student loan loan and other. Each of these categories has its own colmn and the values in is the amount of debt in trillions, there is also a column for the total debt for that quarter that year.
```{r}
print(head(dataset))
```

## Identify variables that need to be mutated

Each data point is labeled by the year and the quarter. Using that information as well as when quarters end at least the values in America the start and end date of each quarter were calculated and put into their own column of data. 

```{r}
quarter_set <- function(input) {
    thing <- str_split(input, ":", n = Inf, simplify = FALSE)
    year <- paste("20",thing[[1]][1], sep="")
    quarter <- thing[[1]][2]
    if (quarter=="Q1") {
        start <- paste(year, "/1/1", sep="")
        stop <- paste(year, "/3/31", sep="")
        return(c(start,stop)) }
    else if (quarter=="Q2") {
        start <- paste(year, "/4/1", sep="")
        stop <- paste(year, "/6/30", sep="")
        return(c(start,stop)) }
    else if (quarter=="Q3") {
        start <- paste(year, "/7/1", sep="")
        stop <- paste(year, "/9/30", sep="")
        return(c(start,stop)) }
    else {
        start <- paste(year, "/10/1", sep="")
        stop <- paste(year, "/12/31", sep="")
        return(c(start,stop)) }
}
Start_Date <- c()
End_Date <- c()
for (x in dataset$"Year and Quarter"){
    Start_Date <- c(Start_Date,quarter_set(x)[1])
    End_Date <- c(End_Date,quarter_set(x)[2])
}
dataset$start_base <- Start_Date
dataset$end_base <- End_Date
dataset$Year <- format(as.Date(dataset$start_base, format="%Y/%m/%d"), "%Y")
dataset$Start_Date <- as.Date(dataset$start_base, format="%Y/%m/%d")
dataset$End_Date <- as.Date(dataset$end_base, format="%Y/%m/%d")    
dataset <- dataset %>% select(-start_base,-end_base,-Total,-"Year and Quarter")
```
## Tidy Data (as needed)
Each of the different columns of debt need to be pivoted to one variable and Type Debt. 

```{r}
dataset <- dataset %>% 
  pivot_longer(
    cols= Mortgage:Other, 
    names_to = "Type_Debt", 
    values_to = "Amount")

```
## Tidy Data (as needed)
```{r}
ggplot(dataset, aes(x = End_Date, y = Amount, group=Type_Debt, color=Type_Debt)) + geom_line() +ggtitle("Mean Income Levels From (1967-2019)")
```
The mortgage accounts for such a high percent of the debt that it makes any of the other datapoints hard to see so an additional graph excluding it is below

```{r}
ggplot(subset(dataset, Type_Debt!="Mortgage"), aes(x = End_Date, y = Amount, group=Type_Debt, color=Type_Debt)) + geom_line() +ggtitle("Mean Income Levels From (1967-2019)")
```