Code
library(tidyverse)
library(here)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Linus Jen
June 9, 2023
For this challenge, I will be working with the debt_in_trillions.xlsx
dataset.
[1] "There are 74 rows and 8 columns in this dataset."
Rows: 74
Columns: 8
$ `Year and Quarter` <chr> "03:Q1", "03:Q2", "03:Q3", "03:Q4", "04:Q1", "04:Q2…
$ Mortgage <dbl> 4.942, 5.080, 5.183, 5.660, 5.840, 5.967, 6.210, 6.…
$ `HE Revolving` <dbl> 0.242, 0.260, 0.269, 0.302, 0.328, 0.367, 0.426, 0.…
$ `Auto Loan` <dbl> 0.641, 0.622, 0.684, 0.704, 0.720, 0.743, 0.751, 0.…
$ `Credit Card` <dbl> 0.688, 0.693, 0.693, 0.698, 0.695, 0.697, 0.706, 0.…
$ `Student Loan` <dbl> 0.2407000, 0.2429000, 0.2488000, 0.2529000, 0.25980…
$ Other <dbl> 0.4776, 0.4860, 0.4773, 0.4486, 0.4465, 0.4231, 0.4…
$ Total <dbl> 7.2313, 7.3839, 7.5551, 8.0655, 8.2893, 8.4600, 8.8…
Based on the title of the .csv
file and what is shown within the document (I opened the Excel file to see what the data looked like before importing it), this file shows debt for various sectors, in trillions of dollars. I found information about this dataset from this site. Per this report’s data dictionary, data includes the year and quarter, the mortgage debt, home equity revolving debt, auto loan debt, credit card debt, student loan debt, other debt, and the total debt, all in trillions of dollars. This was collected by the FRBNY Consumer Credit Panel, and comes from a 5% random sample of all individuals with a social security number and credit report.
From the Year and Quarter
column, we see that we get a 2 digit year and a quarter, separated by a :
. All other columns are floats, and should be dollar amounts in the trillions.
This data is clearly not tidy. All the numerical columns contain one type (debt in ($)), while the column headers represent a category of debt. Thus, we need to pivot_longer()
this data so that we get a column for the types of debt. Additionally, need to convert the Year and Quarter
column to be a lubridate
quarter value
data <- data %>%
# First year and quarter first
mutate(`year_quarter` = quarter(yq(`Year and Quarter`), with_year=TRUE)) %>%
# make tidy by pivot_longer()
pivot_longer(!c(`Year and Quarter`, year_quarter), names_to="debt_type", values_to="debt_amount") %>%
mutate(debt_type = factor(debt_type, levels=unique(debt_type))) %>%
select(-`Year and Quarter`)
Rows: 518
Columns: 3
$ year_quarter <dbl> 2003.1, 2003.1, 2003.1, 2003.1, 2003.1, 2003.1, 2003.1, 2…
$ debt_type <fct> Mortgage, HE Revolving, Auto Loan, Credit Card, Student L…
$ debt_amount <dbl> 4.9420, 0.2420, 0.6410, 0.6880, 0.2407, 0.4776, 7.2313, 5…
[1] Mortgage HE Revolving Auto Loan Credit Card Student Loan
[6] Other
7 Levels: Mortgage HE Revolving Auto Loan Credit Card Student Loan ... Total
From the output above, we now see our data in a tidy format. Firstly, our year_quarter
column is the correct date type, using lubridate
to convert the YY:Q#
format to YYYY.Q
format. The debt_type
column contains the category of household debt - note that we kept the total
column in case someone wanted to compare totals (though this category can be seen as redundant). We also converted the column to be a set of factors, and ordered it based on the order of the columns presented in the dataset (and specifically keeping the “Other” and “Total” categories together). Lastly, we have the debt_amount
column that contains the debt (in trillions of US dollars).
---
title: "Challenge 4 Solution - Debt Dataset"
author: "Linus Jen"
description: "More data wrangling: pivoting"
date: "6/9/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
df-print: paged
categories:
- challenge_4
- debt
- Linus Jen
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(here)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in data
For this challenge, I will be working with the `debt_in_trillions.xlsx` dataset.
```{r}
data <- readxl::read_excel(here("posts", "_data", "debt_in_trillions.xlsx"))
# Display data
print(paste0("There are ", dim(data)[1], " rows and ", dim(data)[2], " columns in this dataset."))
head(data)
glimpse(data)
# Check NAs
data %>%
summarise(across(everything(), ~sum(is.na(.))))
```
### Briefly describe the data
Based on the title of the `.csv` file and what is shown within the document (I opened the Excel file to see what the data looked like before importing it), this file shows debt for various sectors, in trillions of dollars. I found information about this dataset from [this site](https://www.newyorkfed.org/medialibrary/interactives/householdcredit/data/pdf/HHDC_2019Q4.pdf). Per this report's data dictionary, data includes the year and quarter, the mortgage debt, home equity revolving debt, auto loan debt, credit card debt, student loan debt, other debt, and the total debt, all in trillions of dollars. This was collected by the FRBNY Consumer Credit Panel, and comes from a 5% random sample of all individuals with a social security number and credit report.
From the `Year and Quarter` column, we see that we get a 2 digit year and a quarter, separated by a `:`. All other columns are floats, and should be dollar amounts in the trillions.
## Tidy Data (as needed)
This data is clearly not tidy. All the numerical columns contain one type (debt in (\$)), while the column headers represent a category of debt. Thus, we need to `pivot_longer()` this data so that we get a column for the types of debt. Additionally, need to convert the `Year and Quarter` column to be a `lubridate` quarter value
```{r}
data <- data %>%
# First year and quarter first
mutate(`year_quarter` = quarter(yq(`Year and Quarter`), with_year=TRUE)) %>%
# make tidy by pivot_longer()
pivot_longer(!c(`Year and Quarter`, year_quarter), names_to="debt_type", values_to="debt_amount") %>%
mutate(debt_type = factor(debt_type, levels=unique(debt_type))) %>%
select(-`Year and Quarter`)
```
## Results
```{r}
# View data
head(data)
glimpse(data)
head(data$debt_type)
```
From the output above, we now see our data in a tidy format. Firstly, our `year_quarter` column is the correct date type, using `lubridate` to convert the `YY:Q#` format to `YYYY.Q` format. The `debt_type` column contains the category of household debt - note that we kept the `total` column in case someone wanted to compare totals (though this category can be seen as redundant). We also converted the column to be a set of factors, and ordered it based on the order of the columns presented in the dataset (and specifically keeping the "Other" and "Total" categories together). Lastly, we have the `debt_amount` column that contains the debt (in trillions of US dollars).