challenge_4
Ishan Bhardwaj
fed_rates
Data cleaning and mutations
Author

Ishan Bhardwaj

Published

May 20, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
Code
fed_funds <- read_csv("_data/FedFundsRate.csv")
fed_funds
# A tibble: 904 × 10
    Year Month   Day `Federal Funds Target Rate` `Federal Funds Upper Target`
   <dbl> <dbl> <dbl>                       <dbl>                        <dbl>
 1  1954     7     1                          NA                           NA
 2  1954     8     1                          NA                           NA
 3  1954     9     1                          NA                           NA
 4  1954    10     1                          NA                           NA
 5  1954    11     1                          NA                           NA
 6  1954    12     1                          NA                           NA
 7  1955     1     1                          NA                           NA
 8  1955     2     1                          NA                           NA
 9  1955     3     1                          NA                           NA
10  1955     4     1                          NA                           NA
# ℹ 894 more rows
# ℹ 5 more variables: `Federal Funds Lower Target` <dbl>,
#   `Effective Federal Funds Rate` <dbl>, `Real GDP (Percent Change)` <dbl>,
#   `Unemployment Rate` <dbl>, `Inflation Rate` <dbl>

Briefly describe the data

Code
table(select(fed_funds, "Year"))
Year
1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 
   6   12   12   12   12   12   12   12   12   12   12   12   12   12   12   12 
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 
  12   12   12   12   12   12   12   12   12   12   12   12   16   20   25   21 
1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 
  19   21   24   23   17   21   15   12   18   14   13   13   15   15   15   23 
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 
  13   13   17   19   16   15   19   12   12   12   12   12   12   13   13    4 
Code
normal_obsv <- filter(fed_funds, `Year` == 1955)
normal_obsv
# A tibble: 12 × 10
    Year Month   Day `Federal Funds Target Rate` `Federal Funds Upper Target`
   <dbl> <dbl> <dbl>                       <dbl>                        <dbl>
 1  1955     1     1                          NA                           NA
 2  1955     2     1                          NA                           NA
 3  1955     3     1                          NA                           NA
 4  1955     4     1                          NA                           NA
 5  1955     5     1                          NA                           NA
 6  1955     6     1                          NA                           NA
 7  1955     7     1                          NA                           NA
 8  1955     8     1                          NA                           NA
 9  1955     9     1                          NA                           NA
10  1955    10     1                          NA                           NA
11  1955    11     1                          NA                           NA
12  1955    12     1                          NA                           NA
# ℹ 5 more variables: `Federal Funds Lower Target` <dbl>,
#   `Effective Federal Funds Rate` <dbl>, `Real GDP (Percent Change)` <dbl>,
#   `Unemployment Rate` <dbl>, `Inflation Rate` <dbl>
Code
diff_obsv <- filter(fed_funds, `Year` == 1988)
diff_obsv
# A tibble: 24 × 10
    Year Month   Day `Federal Funds Target Rate` `Federal Funds Upper Target`
   <dbl> <dbl> <dbl>                       <dbl>                        <dbl>
 1  1988     1     1                        6.81                           NA
 2  1988     1    28                        6.62                           NA
 3  1988     2     1                        6.62                           NA
 4  1988     2    11                        6.5                            NA
 5  1988     3     1                        6.5                            NA
 6  1988     3    30                        6.75                           NA
 7  1988     4     1                        6.75                           NA
 8  1988     5     1                        6.75                           NA
 9  1988     5     9                        7                              NA
10  1988     5    25                        7.25                           NA
# ℹ 14 more rows
# ℹ 5 more variables: `Federal Funds Lower Target` <dbl>,
#   `Effective Federal Funds Rate` <dbl>, `Real GDP (Percent Change)` <dbl>,
#   `Unemployment Rate` <dbl>, `Inflation Rate` <dbl>

This dataset provides federal funds rate targets with an upper and lower boundary, the actual federal funds rate for a specified year + month + day, and general economic info from 1954 to 2017. As seen from the tables, these values are usually entered on the first day of every month. In certain cases, these values are also entered either halfway through the month or at random dates. These irregular entries may hint at periods of economic fluctuation during those months.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Code
dim(fed_funds)
[1] 904  10
Code
new_rows <- nrow(fed_funds) * 4
new_cols <- ncol(fed_funds) - 4 + 2
# Expected number of rows
new_rows
[1] 3616
Code
# Expected number of columns
new_cols
[1] 8

This dataset is not tidy because the federal funds target rate, upper and lower targets, and effective values are all observances of one variable: the federal funds rate type. We can compress these four columns into one, whose values are in a new column titled “Rate”. Hence, as a sanity check, I have calculated the expected dimensions of the transformed dataset.

Code
fed_funds_tidy <- pivot_longer(fed_funds, 4:7, names_to = "Federal Funds Rate Type", values_to = "Rate")
fed_funds_tidy
# A tibble: 3,616 × 8
    Year Month   Day Real GDP (Percent Ch…¹ `Unemployment Rate` `Inflation Rate`
   <dbl> <dbl> <dbl>                  <dbl>               <dbl>            <dbl>
 1  1954     7     1                    4.6                 5.8               NA
 2  1954     7     1                    4.6                 5.8               NA
 3  1954     7     1                    4.6                 5.8               NA
 4  1954     7     1                    4.6                 5.8               NA
 5  1954     8     1                   NA                   6                 NA
 6  1954     8     1                   NA                   6                 NA
 7  1954     8     1                   NA                   6                 NA
 8  1954     8     1                   NA                   6                 NA
 9  1954     9     1                   NA                   6.1               NA
10  1954     9     1                   NA                   6.1               NA
# ℹ 3,606 more rows
# ℹ abbreviated name: ¹​`Real GDP (Percent Change)`
# ℹ 2 more variables: `Federal Funds Rate Type` <chr>, Rate <dbl>
Code
dim(fed_funds_tidy)
[1] 3616    8

As expected, we end up with a dataset with 3616 rows and 8 columns. Each date will now be observed across four types of federal funds rates.

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

Code
fed_funds_tidy <- fed_funds_tidy %>%
  mutate(Date = ymd(str_c(Year, Month, Day, sep="-"))) %>%
  select(Date, 7:8, 4:6)
fed_funds_tidy
# A tibble: 3,616 × 6
   Date       `Federal Funds Rate Type`     Rate `Real GDP (Percent Change)`
   <date>     <chr>                        <dbl>                       <dbl>
 1 1954-07-01 Federal Funds Target Rate    NA                            4.6
 2 1954-07-01 Federal Funds Upper Target   NA                            4.6
 3 1954-07-01 Federal Funds Lower Target   NA                            4.6
 4 1954-07-01 Effective Federal Funds Rate  0.8                          4.6
 5 1954-08-01 Federal Funds Target Rate    NA                           NA  
 6 1954-08-01 Federal Funds Upper Target   NA                           NA  
 7 1954-08-01 Federal Funds Lower Target   NA                           NA  
 8 1954-08-01 Effective Federal Funds Rate  1.22                        NA  
 9 1954-09-01 Federal Funds Target Rate    NA                           NA  
10 1954-09-01 Federal Funds Upper Target   NA                           NA  
# ℹ 3,606 more rows
# ℹ 2 more variables: `Unemployment Rate` <dbl>, `Inflation Rate` <dbl>

In the original dataset, the year, month, and day were in separate columns, which is incorrect. These should all be collapsed into one variable “Date”, which is what I have done. I first concatenated each year, month, and day into a string delimited by - and then used ymd() from lubridate to convert it to proper date format. These new dates were added into a new column “Date” and finally, the relevant columns were selected from the tidy dataset.