library(tidyverse)
library(here)
library(lubridate)
library(readxl)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 4 Solutions
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- identify variables that need to be mutated
- mutate variables and sanity check all mutations
Two tidyverse packages will be used heavily today: lubridate
(which is not automatically loaded) and stringr
(which is part of core tidyverse).
I use the here package to handle directories whenever I work in an R
project. This helps simplify the file path issues that can arise when collaborating for others (for example, working on a course blog with others!).
The package (mainly) consists of one function: here()
, which helps build a file path. When run with no arguments, the function always returns a string of the home directory for the R
project the user is working in.
here()
[1] "/Users/seanconway/Github/601_Winter_2022-2023"
The user can then specify additional “sub-directories” as strings, separated by a comma.
here("posts","_data")
[1] "/Users/seanconway/Github/601_Winter_2022-2023/posts/_data"
This file path will be different for any user who checks out this R
project, but any operations using the file path will still work (providing the appropriate file exists).
<- here("posts","_data","cereal.csv")
test_file_name test_file_name
[1] "/Users/seanconway/Github/601_Winter_2022-2023/posts/_data/cereal.csv"
file.exists(test_file_name)
[1] TRUE
While here
is not the only solution for file paths, it’s one of several very suitable options.
<-here("posts","_data","abc_poll_2021.csv") %>%
abc_poll_origread_csv()
# political questions
%>%
abc_poll_origselect(starts_with("Q"))%>%
colnames()
[1] "Q1_a" "Q1_b" "Q1_c" "Q1_d" "Q1_e" "Q1_f" "Q2" "Q3" "Q4" "Q5"
[11] "QPID"
# all but one demographer
%>%
abc_poll_origselect(starts_with("pp"))%>%
colnames()
[1] "ppage" "ppeduc5" "ppeducat" "ppgender" "ppethm" "pphhsize"
[7] "ppinc7" "ppmarit5" "ppmsacat" "ppreg4" "pprent" "ppstaten"
[13] "PPWORKA" "ppemploy"
# national poll
n_distinct(abc_poll_orig$ppstaten)
[1] 49
The ABC Poll appears to be a national sample survey (presumably from 2019) with 527 respondents. There are 10 political attitudes questions, plus party identification, in addition to 15 demographic variables (some with re-coded information) and 5 survey administration variables.
print(summarytools::dfSummary(abc_poll_orig,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
Data Frame Summary
abc_poll_orig
Dimensions: 527 x 31Duplicates: 0
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id [numeric] |
|
527 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
xspanish [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
complete_status [character] | 1. qualified |
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppage [numeric] |
|
72 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppeduc5 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppeducat [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppgender [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppethm [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pphhsize [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppinc7 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppmarit5 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppmsacat [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppreg4 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pprent [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppstaten [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PPWORKA [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppemploy [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_a [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_b [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_c [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_d [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_e [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_f [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q2 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q3 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q4 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q5 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
QPID [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ABCAGE [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Contact [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
weights_pid [numeric] |
|
453 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2023-01-04
There are lots of string variables that might need to be modified for analysis or visualization. For example, the party id variable has “A Democrat” not the more standard language. Plus, there is a response “skipped” that should be treated as missing data. Lets see if we can fix it.
#starting point
table(abc_poll_orig$QPID)
A Democrat A Republican An Independent Skipped Something else
176 152 168 3 28
#mutate
<-abc_poll_orig%>%
abc_pollmutate(partyid = str_remove(QPID, "A[n]* "),
partyid = case_when(
str_detect(QPID, "Skipped")~NA_character_,
TRUE~partyid
%>%
)) select(-QPID)
#sanity check
table(abc_poll$partyid)
Democrat Independent Republican Something else
176 168 152 28
unique(abc_poll$partyid)
[1] "Democrat" "Independent" "Something else" "Republican"
[5] NA
Ethnic Identity
The ethnic identity variable is long and could be tough to include in graphs, lets see if we can modify it - but we would need to include a table note to explain what the data labels mean (e.g., that racial labels mean non-hispanic, and that hispanic responses don’t indicate race.)
#starting point
table(abc_poll$ppethm)
2+ Races, Non-Hispanic Black, Non-Hispanic Hispanic
21 27 51
Other, Non-Hispanic White, Non-Hispanic
24 404
#mutate
<-abc_poll%>%
abc_pollmutate(ethnic = str_remove(ppethm, ", Non-Hispanic"))%>%
select(-ppethm)
#sanity check
table(abc_poll$ethnic)
2+ Races Black Hispanic Other White
21 27 51 24 404
Removing “Skipped”
What about the political variables that all have “Skipped” - a value that should probably be replaced with NA
for analysis. Lets use the across
function to make this easier.
<-abc_poll%>%
abc_pollmutate(across(starts_with("Q"), ~ na_if(.x, "Skipped")))
# purrr - a bit advanced for this particular challenge
map(select(abc_poll, starts_with("Q1")), table)
$Q1_a
Approve Disapprove
329 193
$Q1_b
Approve Disapprove
192 322
$Q1_c
Approve Disapprove
272 248
$Q1_d
Approve Disapprove
192 321
$Q1_e
Approve Disapprove
212 301
$Q1_f
Approve Disapprove
281 230
Factor order
Finally, what if you would like the categories of your variable to appear in a specific order, like the education variable that is currently in alphabetical order?
The factor
variable type links variable labels to an underlying numeric order, and allows you to maintain the specified order for tables and graphics. Character strings always appear in alphabetical order.
table(abc_poll$ppeducat)
Bachelors degree or higher High school
207 133
Less than high school Some college
29 158
<- unique(abc_poll$ppeducat)
edulabs edulabs
[1] "High school" "Bachelors degree or higher"
[3] "Some college" "Less than high school"
<- c("Less than high school",
levs "High school",
"Some college",
"Bachelors degree or higher")
<-abc_poll%>%
abc_pollmutate(educ = factor(ppeducat,
levels=levs)) %>% #edulabs[c(4,1,3,2)]))%>%
select(-ppeducat)
rm(edulabs)
table(abc_poll$educ)
Less than high school High school
29 133
Some college Bachelors degree or higher
158 207
This section builds on the code available in the solution to Challenge 3, where we pivoted the organic eggs pricing data. The data reports the average price per carton paid to the farmer or producer for organic eggs (and organic chicken), reported monthly from 2004 to 2013. Average price is reported by carton type, which can vary in both size (x-large or large) and quantity (half-dozen or dozen.)
Read Data
We are reading in half of the data from this workbook - the other half contains information about the price of organic chicken.
<- here("posts","_data","organiceggpoultry.xls") %>%
eggs_origread_excel(sheet="Data",
range = "B6:F125",
col_names = c("date",
"xlarge_dozen",
"xlarge_halfdozen",
"large_dozen",
"large_halfdozen")
)
Clean and Mutate
We are going to be removing the note from the first column of the data, and splitting the year and month, and pivoting into long format prior to transforming the year and month columns into a date.
<-eggs_orig%>%
eggsmutate(date = str_remove(date, " /1"))%>%
separate(date, into=c("month", "year"), sep=" ")%>%
fill(year) %>%
pivot_longer(cols=contains("large"),
names_to = c("size", "quantity"),
names_sep="_",
values_to = "price")
Now, we need to create a date from a month and year. I can see that the months are a mix of long month name and 3 character month (for January), and the years are four digit years. Do I need to adjust the string for month manually, or can lubridate fix things for me?
I’m going to combine the month with the now complete year column, and the parse the “month-year” format using my()
.
<-eggs%>%
eggsmutate(date = str_c(month, year, sep=" "),
date = my(date))
select(eggs, month, year, date)
month <chr> | year <chr> | date <date> | ||
---|---|---|---|---|
Jan | 2004 | 2004-01-01 | ||
Jan | 2004 | 2004-01-01 | ||
Jan | 2004 | 2004-01-01 | ||
Jan | 2004 | 2004-01-01 | ||
February | 2004 | 2004-02-01 | ||
February | 2004 | 2004-02-01 | ||
February | 2004 | 2004-02-01 | ||
February | 2004 | 2004-02-01 | ||
March | 2004 | 2004-03-01 | ||
March | 2004 | 2004-03-01 |
Interesting - lubridate
automatically fills in the first day of the month. Maybe we would prefer the last day, or even the middle of the month?
Note that we can’t easily use make_datetime()
for this example, as we would then need to transform the irregular month names into numeric values.
<-eggs%>%
eggsmutate(date = make_datetime(month, "15", year))
select(eggs, month, year, date)
month <chr> | year <chr> | date <dttm> | ||
---|---|---|---|---|
Jan | 2004 | <NA> | ||
Jan | 2004 | <NA> | ||
Jan | 2004 | <NA> | ||
Jan | 2004 | <NA> | ||
February | 2004 | <NA> | ||
February | 2004 | <NA> | ||
February | 2004 | <NA> | ||
February | 2004 | <NA> | ||
March | 2004 | <NA> | ||
March | 2004 | <NA> |
The mdy()
function works just fine, if we put all the information in the proper format, though.
<-eggs%>%
eggsmutate(date = mdy(str_c(month,"15",year,sep="/")))
select(eggs, month, year, date)
month <chr> | year <chr> | date <date> | ||
---|---|---|---|---|
Jan | 2004 | 2004-01-15 | ||
Jan | 2004 | 2004-01-15 | ||
Jan | 2004 | 2004-01-15 | ||
Jan | 2004 | 2004-01-15 | ||
February | 2004 | 2004-02-15 | ||
February | 2004 | 2004-02-15 | ||
February | 2004 | 2004-02-15 | ||
February | 2004 | 2004-02-15 | ||
March | 2004 | 2004-03-15 | ||
March | 2004 | 2004-03-15 |
This data set runs from July 1954 to March 2017, and includes daily macroeconomic indicators related to the effective federal funds rate - or the interest rate at which banks lend money to each other in order to meet mandated reserve requirements.
A single case is a year-month-day, and there are 7 values that can be pivoted or not depending on the needs of the analyst. 4 values are related to the federal funds rate: target, upper target, lower target, and effective), while 3 are related macroeconomic indicators (inflation,
For now, lets just focus on mutating the date.
<-here("posts","_data","FedFundsRate.csv") %>%
fed_rates_origread_csv()
fed_rates_orig
Year <dbl> | Month <dbl> | Day <dbl> | Federal Funds Target Rate <dbl> | Federal Funds Upper Target <dbl> | Federal Funds Lower Target <dbl> | Effective Federal Funds Rate <dbl> | |
---|---|---|---|---|---|---|---|
1954 | 7 | 1 | NA | NA | NA | 0.80 | |
1954 | 8 | 1 | NA | NA | NA | 1.22 | |
1954 | 9 | 1 | NA | NA | NA | 1.06 | |
1954 | 10 | 1 | NA | NA | NA | 0.85 | |
1954 | 11 | 1 | NA | NA | NA | 0.83 | |
1954 | 12 | 1 | NA | NA | NA | 1.28 | |
1955 | 1 | 1 | NA | NA | NA | 1.39 | |
1955 | 2 | 1 | NA | NA | NA | 1.29 | |
1955 | 3 | 1 | NA | NA | NA | 1.35 | |
1955 | 4 | 1 | NA | NA | NA | 1.43 |
Once again, it looks like we will need to combine the year, month and date using stringr::str_c()
, then we can use lubridate
to transform into a date. Alternatively, because both month and day are numeric variables, we can use make_datetime()
.
<-fed_rates_orig%>%
fed_ratesmutate(date = str_c(Year, Month, Day, sep="-"),
date = ymd(date))
summary(fed_rates$date)
Min. 1st Qu. Median Mean 3rd Qu. Max.
"1954-07-01" "1973-04-23" "1987-12-16" "1987-02-25" "2001-06-07" "2017-03-16"
You can now go through and figure out whether there are patterns in the missing-ness of specific indicators by date (maybe the values are only measured once a month or once a quarter, and we need to use fill()
, or maybe there is something else going on?)
This data set contains 119,390 hotel bookings from two hotels (“City Hotel” and “Resort Hotel”) with an arrival date between July 2015 and August 2017 (more detail needed), including bookings that were later cancelled. See Solution Set 2 for additional details. The data are a de-identified extract of real hotel demand data, made available by the authors.
<- here("posts","_data","hotel_bookings.csv") %>%
bookings_origread_csv()
select(bookings_orig, starts_with("arrival"))
arrival_date_year <dbl> | arrival_date_month <chr> | arrival_date_week_number <dbl> | arrival_date_day_of_month <dbl> | |
---|---|---|---|---|
2015 | July | 27 | 1 | |
2015 | July | 27 | 1 | |
2015 | July | 27 | 1 | |
2015 | July | 27 | 1 | |
2015 | July | 27 | 1 | |
2015 | July | 27 | 1 | |
2015 | July | 27 | 1 | |
2015 | July | 27 | 1 | |
2015 | July | 27 | 1 | |
2015 | July | 27 | 1 |
Last time we looked at these data, I went to pretty extraordinary lengths to confirm the dates covered by the data. Lets see how much easier that is if we set the date to a date
type variable instead! Those are long variable names, thank goodness we can get rid of them. Note that we only need three pieces of information out of the four provided.
Look how I can mess around with the format, and lubridate still recovers the date!
<-bookings_orig%>%
bookingsmutate(date_arrival = str_c(arrival_date_day_of_month,
arrival_date_month,sep="/"),
arrival_date_year, date_arrival = dmy(date_arrival))%>%
select(-starts_with("arrival"))
summary(bookings$date_arrival)
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2015-07-01" "2016-03-13" "2016-09-06" "2016-08-28" "2017-03-18" "2017-08-31"
There are other relevant time variables in the data set that may be worth exploring. For example, we are given a lead time measure in days (integer), but we could recover a date with lubridate
. This would allow us to more easily visually explore, for example, if some people were more likely to make bookings over the winter for summer trips, but in fall for winter trips - or some other seasonal pattern.
<-bookings%>%
bookingsmutate(date_booking = date_arrival-days(lead_time))
summary(bookings$date_booking)
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2013-06-24" "2015-11-28" "2016-05-04" "2016-05-16" "2016-12-09" "2017-08-31"
We can also go in the reverse order. So if we wanted to know how many days before a booking there was last a change in the reservation status, we can generate this by comparing arrival date to reservation status date.
summary(bookings$reservation_status_date)
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2014-10-17" "2016-02-01" "2016-08-07" "2016-07-30" "2017-02-08" "2017-09-14"
<-bookings%>%
bookingsmutate(change_days = interval(reservation_status_date,
date_arrival),change_days = change_days %/% days(1))
summary(bookings$change_days)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-69.00 -3.00 -1.00 29.68 26.00 526.00
This data set runs from the first quarter of 2003 to the second quarter of 2021, and includes quarterly measures of the total amount of household debt associated with 6 different types of loans - mortgage,HE revolving, auto, credit card, student, and other - plus a total household debt including all 6 loan types. This is another fantastic macroeconomic data product from the New York Federal Reserve. Detailed notes on the website reveal that the data are from an Equifax, and explain why data prior to 2003 is no longer part of the primary data publication.
<-here("posts","_data","debt_in_trillions.xlsx") %>%
debt_origread_excel()
debt_orig
Year and Quarter <chr> | Mortgage <dbl> | HE Revolving <dbl> | Auto Loan <dbl> | Credit Card <dbl> | Student Loan <dbl> | Other <dbl> | Total <dbl> |
---|---|---|---|---|---|---|---|
03:Q1 | 4.942 | 0.2420 | 0.6410 | 0.6880 | 0.2407000 | 0.4776 | 7.23130 |
03:Q2 | 5.080 | 0.2600 | 0.6220 | 0.6930 | 0.2429000 | 0.4860 | 7.38390 |
03:Q3 | 5.183 | 0.2690 | 0.6840 | 0.6930 | 0.2488000 | 0.4773 | 7.55510 |
03:Q4 | 5.660 | 0.3020 | 0.7040 | 0.6980 | 0.2529000 | 0.4486 | 8.06550 |
04:Q1 | 5.840 | 0.3280 | 0.7200 | 0.6950 | 0.2598000 | 0.4465 | 8.28930 |
04:Q2 | 5.967 | 0.3670 | 0.7430 | 0.6970 | 0.2629000 | 0.4231 | 8.46000 |
04:Q3 | 6.210 | 0.4260 | 0.7510 | 0.7060 | 0.3300000 | 0.4100 | 8.83300 |
04:Q4 | 6.360 | 0.4680 | 0.7280 | 0.7170 | 0.3457000 | 0.4229 | 9.04160 |
05:Q1 | 6.512 | 0.5020 | 0.7250 | 0.7100 | 0.3636000 | 0.3941 | 9.20670 |
05:Q2 | 6.696 | 0.5280 | 0.7740 | 0.7170 | 0.3744000 | 0.4024 | 9.49180 |
A single case is a year-quarter, and there are 6 (or 7) values that can be pivoted or not depending on the needs of the analyst. The tricky part is figuring out how to tell R to treat the quarters as a date! We could take the long road and separate the year and quarter information, then fix the year to be numeric, recombine, etc. But lets use the more complex formats option of parse_date_time()
plus a little regular expression style knowledge and read the information directly.
<- debt_orig%>%
debt mutate(date = parse_date_time(`Year and Quarter`,
orders="yq"))
summary(debt$date)
Min. 1st Qu.
"2003-01-01 00:00:00.0000" "2007-07-24 00:00:00.0000"
Median Mean
"2012-02-15 12:00:00.0000" "2012-02-15 06:09:43.7837"
3rd Qu. Max.
"2016-09-08 00:00:00.0000" "2021-04-01 00:00:00.0000"
Wow, isn’t that super simple!