challenge_4
Aritra Basu
fed_rates
More data wrangling: pivoting
Author

Aritra Basu

Published

April 1, 2023

Code
library(tidyverse)
library(lubridate)
library(descr)
library(readr)
library(summarytools)
library(naniar)
library(ggplot2)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Reading in data

Code
mydata <- read.csv("_data/FedFundsRate.csv")
glimpse(mydata)
Rows: 904
Columns: 10
$ Year                         <int> 1954, 1954, 1954, 1954, 1954, 1954, 1955,…
$ Month                        <int> 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7,…
$ Day                          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Federal.Funds.Target.Rate    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Federal.Funds.Upper.Target   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Federal.Funds.Lower.Target   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Effective.Federal.Funds.Rate <dbl> 0.80, 1.22, 1.06, 0.85, 0.83, 1.28, 1.39,…
$ Real.GDP..Percent.Change.    <dbl> 4.6, NA, NA, 8.0, NA, NA, 11.9, NA, NA, 6…
$ Unemployment.Rate            <dbl> 5.8, 6.0, 6.1, 5.7, 5.3, 5.0, 4.9, 4.7, 4…
$ Inflation.Rate               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Describing the data

Code
print(summarytools::dfSummary(mydata,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

mydata

Dimensions: 904 x 10
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
Year [integer]
Mean (sd) : 1986.7 (17.2)
min ≤ med ≤ max:
1954 ≤ 1987.5 ≤ 2017
IQR (CV) : 28 (0)
64 distinct values 0 (0.0%)
Month [integer]
Mean (sd) : 6.6 (3.5)
min ≤ med ≤ max:
1 ≤ 7 ≤ 12
IQR (CV) : 6 (0.5)
12 distinct values 0 (0.0%)
Day [integer]
Mean (sd) : 3.6 (6.8)
min ≤ med ≤ max:
1 ≤ 1 ≤ 31
IQR (CV) : 0 (1.9)
29 distinct values 0 (0.0%)
Federal.Funds.Target.Rate [numeric]
Mean (sd) : 5.7 (2.6)
min ≤ med ≤ max:
1 ≤ 5.5 ≤ 11.5
IQR (CV) : 4 (0.5)
63 distinct values 442 (48.9%)
Federal.Funds.Upper.Target [numeric]
Mean (sd) : 0.3 (0.1)
min ≤ med ≤ max:
0.2 ≤ 0.2 ≤ 1
IQR (CV) : 0 (0.5)
4 distinct values 801 (88.6%)
Federal.Funds.Lower.Target [numeric]
Mean (sd) : 0.1 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 0.8
IQR (CV) : 0 (2.4)
4 distinct values 801 (88.6%)
Effective.Federal.Funds.Rate [numeric]
Mean (sd) : 4.9 (3.6)
min ≤ med ≤ max:
0.1 ≤ 4.7 ≤ 19.1
IQR (CV) : 4.2 (0.7)
466 distinct values 152 (16.8%)
Real.GDP..Percent.Change. [numeric]
Mean (sd) : 3.1 (3.6)
min ≤ med ≤ max:
-10 ≤ 3.1 ≤ 16.5
IQR (CV) : 3.5 (1.1)
113 distinct values 654 (72.3%)
Unemployment.Rate [numeric]
Mean (sd) : 6 (1.6)
min ≤ med ≤ max:
3.4 ≤ 5.7 ≤ 10.8
IQR (CV) : 2.1 (0.3)
71 distinct values 152 (16.8%)
Inflation.Rate [numeric]
Mean (sd) : 3.7 (2.6)
min ≤ med ≤ max:
0.6 ≤ 2.8 ≤ 13.6
IQR (CV) : 2.7 (0.7)
106 distinct values 194 (21.5%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-04-02

We have data on federal funds rates between 1954 and 2017. The variables include a target rate, as well as upper and lower targets. Along with that, it also contains effective rates which seem to be the actual prevailing rates at that point of time. We also have percentage change in GDP, unemployment and inflation.

After viewing the data, it can be seen that we need to combine the date into a single variable (from the year, month and day). I also select away the previous three variables of month year and day, and rename the other variables.

Code
mydata2 <- mydata %>% 
mutate (date=ymd(paste(Year, Month, Day, sep = "-")))%>%
select(date, everything())%>%
select(-Year,-Month,-Day)%>% 
  rename (Target_Rate = "Federal.Funds.Target.Rate", Upper_Target="Federal.Funds.Upper.Target", Lower_Target="Federal.Funds.Lower.Target", Effective_Rate="Effective.Federal.Funds.Rate", GDP_Percent_Change="Real.GDP..Percent.Change.", Unemployment_Rate="Unemployment.Rate", Inflation_Rate="Inflation.Rate")   
glimpse(mydata2)
Rows: 904
Columns: 8
$ date               <date> 1954-07-01, 1954-08-01, 1954-09-01, 1954-10-01, 19…
$ Target_Rate        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Upper_Target       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Lower_Target       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Effective_Rate     <dbl> 0.80, 1.22, 1.06, 0.85, 0.83, 1.28, 1.39, 1.29, 1.3…
$ GDP_Percent_Change <dbl> 4.6, NA, NA, 8.0, NA, NA, 11.9, NA, NA, 6.7, NA, NA…
$ Unemployment_Rate  <dbl> 5.8, 6.0, 6.1, 5.7, 5.3, 5.0, 4.9, 4.7, 4.6, 4.7, 4…
$ Inflation_Rate     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Code
View(mydata2)

It can be seen that the date variable has been appropriately created, and other variables are modified.

Tidy Data (as needed)

The data is already tidy, with ymd observations from 1954 to 2017.

Missing values

The next task is to look at the missing values systematically.

Code
df_missing <- mydata2 %>% 
  summarise_all(funs(n_miss = sum(is.na(.)), prop_miss = mean(is.na(.))))
glimpse(df_missing)
Rows: 1
Columns: 16
$ date_n_miss                  <int> 0
$ Target_Rate_n_miss           <int> 442
$ Upper_Target_n_miss          <int> 801
$ Lower_Target_n_miss          <int> 801
$ Effective_Rate_n_miss        <int> 152
$ GDP_Percent_Change_n_miss    <int> 654
$ Unemployment_Rate_n_miss     <int> 152
$ Inflation_Rate_n_miss        <int> 194
$ date_prop_miss               <dbl> 0
$ Target_Rate_prop_miss        <dbl> 0.4889381
$ Upper_Target_prop_miss       <dbl> 0.8860619
$ Lower_Target_prop_miss       <dbl> 0.8860619
$ Effective_Rate_prop_miss     <dbl> 0.1681416
$ GDP_Percent_Change_prop_miss <dbl> 0.7234513
$ Unemployment_Rate_prop_miss  <dbl> 0.1681416
$ Inflation_Rate_prop_miss     <dbl> 0.2146018

Here, we can see the proportion of missing values. I suspect that the missing values could be date specific.

Code
ggplot(mydata2, aes(x = date, y = Target_Rate)) +
  geom_miss_point()

Code
ggplot(mydata2, aes(x = date, y = Upper_Target)) +
  geom_miss_point()

Code
ggplot(mydata2, aes(x = date, y = Lower_Target)) +
  geom_miss_point()

We can see that the variable Target_Rate is completely missing before 1980’s and is also missing after late 2000s. It seems that upper and lower targets replace the Target_Rate variable.

Now, I first filter data from 1980, and also impute the missing values from the later part of the data where Target_Rate is missing, but lower and upper target rates are present.

Identifying variables that need to be mutated

Code
mydata3 <- mydata2 %>% 
     filter(date> ymd("1982-09-01"))%>%
     mutate (Imputed_Target_Rate = ifelse(is.na(Target_Rate), (Lower_Target + Upper_Target) / 2, Target_Rate))
glimpse(mydata3)
Rows: 565
Columns: 9
$ date                <date> 1982-09-27, 1982-10-01, 1982-10-07, 1982-11-01, 1…
$ Target_Rate         <dbl> 10.2500, 10.0000, 9.5000, 9.5000, 9.0000, 9.0000, …
$ Upper_Target        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Lower_Target        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Effective_Rate      <dbl> NA, 9.71, NA, 9.20, NA, 8.95, NA, 8.68, 8.51, 8.77…
$ GDP_Percent_Change  <dbl> NA, 0.4, NA, NA, NA, NA, NA, 5.3, NA, NA, NA, 9.4,…
$ Unemployment_Rate   <dbl> NA, 10.4, NA, 10.8, NA, 10.8, NA, 10.4, 10.4, 10.3…
$ Inflation_Rate      <dbl> NA, 5.9, NA, 5.3, NA, 4.5, NA, 4.7, 4.7, 4.7, NA, …
$ Imputed_Target_Rate <dbl> 10.2500, 10.0000, 9.5000, 9.5000, 9.0000, 9.0000, …

We can now see that the effective rate and the target rate usually move together, and that pattern is retained even after imputation. I this imputation, I have imputed the missing target rates, with the averages of lower and upper targets.

Code
mydata4 <- mydata3 %>%
gather(variable, value, Imputed_Target_Rate, Effective_Rate) 

ggplot(mydata4, aes(x = date, y = value, color = variable)) +
geom_line()

We could moreover, impute the missing values of GDP_percent_change, Unemployment_rate, and Inflation_rate by the annual averages:

Code
mydata5 <- mydata3 %>%
  group_by(year = lubridate::year(date)) %>%
  mutate(GDP_Percent_Change_imputed = ifelse(is.na(GDP_Percent_Change), mean(GDP_Percent_Change, na.rm = TRUE), GDP_Percent_Change)) %>%
  mutate(Unemployment_Rate_imputed = ifelse(is.na(Unemployment_Rate), mean(Unemployment_Rate, na.rm = TRUE), Unemployment_Rate)) %>%
 mutate(Inflation_Rate_imputed = ifelse(is.na(Inflation_Rate), mean(Inflation_Rate, na.rm = TRUE), Inflation_Rate)) %>% 
  ungroup()

mydata6 <- mydata5 %>%
  select(date, Imputed_Target_Rate, Effective_Rate, GDP_Percent_Change_imputed, Unemployment_Rate_imputed, Inflation_Rate_imputed)
glimpse(mydata6)
Rows: 565
Columns: 6
$ date                       <date> 1982-09-27, 1982-10-01, 1982-10-07, 1982-1…
$ Imputed_Target_Rate        <dbl> 10.2500, 10.0000, 9.5000, 9.5000, 9.0000, 9…
$ Effective_Rate             <dbl> NA, 9.71, NA, 9.20, NA, 8.95, NA, 8.68, 8.5…
$ GDP_Percent_Change_imputed <dbl> 0.400, 0.400, 0.400, 0.400, 0.400, 0.400, 0…
$ Unemployment_Rate_imputed  <dbl> 10.66667, 10.40000, 10.66667, 10.80000, 10.…
$ Inflation_Rate_imputed     <dbl> 5.233333, 5.900000, 5.233333, 5.300000, 5.2…

This is how the final dataset looks like after imputations.