Challenge 4 Debt In Trillions

challenge_4
debt
More data wrangling: pivoting
Author

Kekai Liu

Published

March 21, 2023

Code
library(tidyverse)
library(stringr)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
Code
debt_in_trillions <- readxl::read_excel("_data/debt_in_trillions.xlsx", sheet="Sheet1")

print(summarytools::dfSummary(debt_in_trillions, varnumbers = FALSE, plain.ascii = FALSE, style = "grid", graph.magnif = 0.70, valid.col = FALSE), method = 'render', table.classes = 'table-condensed')

Data Frame Summary

debt_in_trillions

Dimensions: 74 x 8
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
Year and Quarter [character]
1. 03:Q1
2. 03:Q2
3. 03:Q3
4. 03:Q4
5. 04:Q1
6. 04:Q2
7. 04:Q3
8. 04:Q4
9. 05:Q1
10. 05:Q2
[ 64 others ]
1 ( 1.4% )
1 ( 1.4% )
1 ( 1.4% )
1 ( 1.4% )
1 ( 1.4% )
1 ( 1.4% )
1 ( 1.4% )
1 ( 1.4% )
1 ( 1.4% )
1 ( 1.4% )
64 ( 86.5% )
0 (0.0%)
Mortgage [numeric]
Mean (sd) : 8.3 (1.2)
min ≤ med ≤ max:
4.9 ≤ 8.4 ≤ 10.4
IQR (CV) : 1 (0.1)
74 distinct values 0 (0.0%)
HE Revolving [numeric]
Mean (sd) : 0.5 (0.1)
min ≤ med ≤ max:
0.2 ≤ 0.5 ≤ 0.7
IQR (CV) : 0.2 (0.2)
73 distinct values 0 (0.0%)
Auto Loan [numeric]
Mean (sd) : 0.9 (0.2)
min ≤ med ≤ max:
0.6 ≤ 0.8 ≤ 1.4
IQR (CV) : 0.4 (0.3)
71 distinct values 0 (0.0%)
Credit Card [numeric]
Mean (sd) : 0.8 (0.1)
min ≤ med ≤ max:
0.7 ≤ 0.7 ≤ 0.9
IQR (CV) : 0.1 (0.1)
69 distinct values 0 (0.0%)
Student Loan [numeric]
Mean (sd) : 0.9 (0.4)
min ≤ med ≤ max:
0.2 ≤ 0.9 ≤ 1.6
IQR (CV) : 0.8 (0.5)
73 distinct values 0 (0.0%)
Other [numeric]
Mean (sd) : 0.4 (0)
min ≤ med ≤ max:
0.3 ≤ 0.4 ≤ 0.5
IQR (CV) : 0.1 (0.1)
70 distinct values 0 (0.0%)
Total [numeric]
Mean (sd) : 11.8 (1.7)
min ≤ med ≤ max:
7.2 ≤ 11.9 ≤ 15
IQR (CV) : 1.4 (0.1)
74 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-03-29

Code
table(select(debt_in_trillions, `Year and Quarter`))
Year and Quarter
03:Q1 03:Q2 03:Q3 03:Q4 04:Q1 04:Q2 04:Q3 04:Q4 05:Q1 05:Q2 05:Q3 05:Q4 06:Q1 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
06:Q2 06:Q3 06:Q4 07:Q1 07:Q2 07:Q3 07:Q4 08:Q1 08:Q2 08:Q3 08:Q4 09:Q1 09:Q2 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
09:Q3 09:Q4 10:Q1 10:Q2 10:Q3 10:Q4 11:Q1 11:Q2 11:Q3 11:Q4 12:Q1 12:Q2 12:Q3 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
12:Q4 13:Q1 13:Q2 13:Q3 13:Q4 14:Q1 14:Q2 14:Q3 14:Q4 15:Q1 15:Q2 15:Q3 15:Q4 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
16:Q1 16:Q2 16:Q3 16:Q4 17:Q1 17:Q2 17:Q3 17:Q4 18:Q1 18:Q2 18:Q3 18:Q4 19:Q1 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
19:Q2 19:Q3 19:Q4 20:Q1 20:Q2 20:Q3 20:Q4 21:Q1 21:Q2 
    1     1     1     1     1     1     1     1     1 

Briefly describe the data

The debt_in_trillions dataset contains 74 rows and 8 columns.Each row contains information on amounts for seven categories of debt (Mortgage, HE Revolving, Auto Loan, Credit Card, Student Loan, Other, Total) in a year and quarter from 2003Q1 to 2021Q2. There are no missing values in the data.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

The data is not tidy, and the data is wide. Aside from the “Year and Quarter” column, the other columns represent amounts of different debt types. These columns can be pivoted into two columns: one denoting the debt type and another denoting the amount. The pivoted data would be tidy, as each row would then correspond to a single type of debt amount in a particular year-quarter. There are seven variables (Mortgage, HE Revolving, Auto Loan, Credit Card, Student Loan, Other, Total) to be pivoted. The resulting data set should have 518 rows and three columns (Year and Date, Debt Type, and Debt Amount).

The summary of the data after the pivot_longer provides a sanity check. There are 518 rows and three columns (Year and Quarter, debt_type, debt_value). The debt_type column contains seven unique values, corresponding to the seven pivoted columns. This summary confirms that the pivot_longer successfully performed what was intended.

Code
#existing rows/cases
nrow(debt_in_trillions)
[1] 74
Code
#existing columns/cases
ncol(debt_in_trillions)
[1] 8
Code
#expected rows/cases
nrow(debt_in_trillions) * (ncol(debt_in_trillions)-1)
[1] 518
Code
# expected columns 
8 - 7 + 2
[1] 3
Code
# pivot_longer
debt_in_trillions2 <- debt_in_trillions %>%
  pivot_longer(col = -c(`Year and Quarter`),
                 names_to="debt_type",
                 values_to = "debt_value")


# summary of data after pivot_longer
print(summarytools::dfSummary(debt_in_trillions2, varnumbers = FALSE, plain.ascii = FALSE, style = "grid", graph.magnif = 0.70, valid.col = FALSE), method = 'render', table.classes = 'table-condensed')

Data Frame Summary

debt_in_trillions2

Dimensions: 518 x 3
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
Year and Quarter [character]
1. 03:Q1
2. 03:Q2
3. 03:Q3
4. 03:Q4
5. 04:Q1
6. 04:Q2
7. 04:Q3
8. 04:Q4
9. 05:Q1
10. 05:Q2
[ 64 others ]
7 ( 1.4% )
7 ( 1.4% )
7 ( 1.4% )
7 ( 1.4% )
7 ( 1.4% )
7 ( 1.4% )
7 ( 1.4% )
7 ( 1.4% )
7 ( 1.4% )
7 ( 1.4% )
448 ( 86.5% )
0 (0.0%)
debt_type [character]
1. Auto Loan
2. Credit Card
3. HE Revolving
4. Mortgage
5. Other
6. Student Loan
7. Total
74 ( 14.3% )
74 ( 14.3% )
74 ( 14.3% )
74 ( 14.3% )
74 ( 14.3% )
74 ( 14.3% )
74 ( 14.3% )
0 (0.0%)
debt_value [numeric]
Mean (sd) : 3.4 (4.4)
min ≤ med ≤ max:
0.2 ≤ 0.8 ≤ 15
IQR (CV) : 7.4 (1.3)
483 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-03-29

Any additional comments?

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

To facilitate more efficient analysis, the “Year and Quarter” variable should be renamed and mutated. This variable is a character type, as shown in the previous summary output. This is not ideal, because “Year and Quarter” represents a date or time. It should converted into a date object. In addition, the variable name contains spaces which makes it easy for errors to arise when coding. The other two variables are fine; debt_type is in snake case and is a character type because it represents string categories, and debt_value is in snake case and is a numeric type because it represents an amount.

Code
debt_in_trillions3 <- debt_in_trillions2 %>%
  mutate(year = str_c("20", str_sub(`Year and Quarter`, 1, 2)), 
         quarter = str_sub(`Year and Quarter`, 5, 5),
         year_quarter = quarter(as_date(str_c(year, quarter), format="%Y%q"),  with_year=TRUE)) %>% #generate year_quarter as date object
  select(-c(`Year and Quarter`, `year`, `quarter`)) %>%
  relocate(year_quarter, debt_type, debt_value) %>%
  mutate(debt_value = str_remove(as.character(debt_value), ".0+$")) #remove excess trailing zeros in decimal

print(summarytools::dfSummary(debt_in_trillions3, varnumbers = FALSE, plain.ascii = FALSE, style = "grid", graph.magnif = 0.70, valid.col = FALSE), method = 'render', table.classes = 'table-condensed')

Data Frame Summary

debt_in_trillions3

Dimensions: 518 x 3
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
year_quarter [numeric]
Mean (sd) : 2012 (5.3)
min ≤ med ≤ max:
2003.1 ≤ 2012.2 ≤ 2021.2
IQR (CV) : 9.1 (0)
74 distinct values 0 (0.0%)
debt_type [character]
1. Auto Loan
2. Credit Card
3. HE Revolving
4. Mortgage
5. Other
6. Student Loan
7. Total
74 ( 14.3% )
74 ( 14.3% )
74 ( 14.3% )
74 ( 14.3% )
74 ( 14.3% )
74 ( 14.3% )
74 ( 14.3% )
0 (0.0%)
debt_value [character]
1. 0.367
2. 0.317
3. 0.33
4. 0.335
5. 0.351
6. 0.386
7. 0.39
8. 0.399
9. 0.412
10. 0.413
[ 473 others ]
3 ( 0.6% )
2 ( 0.4% )
2 ( 0.4% )
2 ( 0.4% )
2 ( 0.4% )
2 ( 0.4% )
2 ( 0.4% )
2 ( 0.4% )
2 ( 0.4% )
2 ( 0.4% )
497 ( 95.9% )
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-03-29

Any additional comments?

The “Year and Quarter” variable has been transformed into a numeric date object. The first four digits correspond to the year and the last digit correspond to the quarter, with a period as the delimiter.

A further consideration depending on the type of research may be to recode the debt_value amounts into bins, as there is too great of variation. From the previous summary, the plot does not give much insight about the distribution of debt.