Challenge 4 Solutions

challenge_4
solution
More data wrangling: mutate
Author

Sean Conway

Published

June 13, 2023

library(tidyverse)
library(here)
library(lubridate)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Two tidyverse packages will be used heavily today: lubridate (which is not automatically loaded) and stringr (which is part of core tidyverse).

Tip

I use the here package to handle directories whenever I work in an R project. This helps simplify the file path issues that can arise when collaborating with others (for example, working on a course blog with others!).

The package (mainly) consists of one function: here(), which helps build a file path. When run with no arguments, the function always returns a string of the home directory for the R project the user is working in.

here()
[1] "/Users/seanconway/Github/DACSS_601_Summer2023_Sec1"

The user can then specify additional “sub-directories” as strings, separated by a comma.

here("posts","_data")
[1] "/Users/seanconway/Github/DACSS_601_Summer2023_Sec1/posts/_data"

This file path will be different for any user who checks out this R project, but any operations using the file path will still work (providing the appropriate file exists).

test_file_name <- here("posts","_data","cereal.csv")
test_file_name
[1] "/Users/seanconway/Github/DACSS_601_Summer2023_Sec1/posts/_data/cereal.csv"
file.exists(test_file_name)
[1] TRUE

While here is not the only solution for file paths, it’s one of several very suitable options.

abc_poll_orig<-here("posts","_data","abc_poll_2021.csv") %>% 
  read_csv()

# political questions
abc_poll_orig%>%
  select(starts_with("Q"))%>%
  colnames()
 [1] "Q1_a" "Q1_b" "Q1_c" "Q1_d" "Q1_e" "Q1_f" "Q2"   "Q3"   "Q4"   "Q5"  
[11] "QPID"
# all but one demographer
abc_poll_orig%>%
  select(starts_with("pp"))%>%
  colnames()
 [1] "ppage"    "ppeduc5"  "ppeducat" "ppgender" "ppethm"   "pphhsize"
 [7] "ppinc7"   "ppmarit5" "ppmsacat" "ppreg4"   "pprent"   "ppstaten"
[13] "PPWORKA"  "ppemploy"
# national poll
n_distinct(abc_poll_orig$ppstaten)
[1] 49

The ABC Poll appears to be a national sample survey (presumably from 2019) with 527 respondents. There are 10 political attitudes questions, plus party identification, in addition to 15 demographic variables (some with re-coded information) and 5 survey administration variables.

print(summarytools::dfSummary(abc_poll_orig,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

abc_poll_orig

Dimensions: 527 x 31
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
id [numeric]
Mean (sd) : 7230264 (152.3)
min ≤ med ≤ max:
7230001 ≤ 7230264 ≤ 7230527
IQR (CV) : 263 (0)
527 distinct values 0 (0.0%)
xspanish [character]
1. English
2. Spanish
514 ( 97.5% )
13 ( 2.5% )
0 (0.0%)
complete_status [character] 1. qualified
527 ( 100.0% )
0 (0.0%)
ppage [numeric]
Mean (sd) : 53.4 (17.1)
min ≤ med ≤ max:
18 ≤ 55 ≤ 91
IQR (CV) : 27 (0.3)
72 distinct values 0 (0.0%)
ppeduc5 [character]
1. NA
2. High school graduate (hig
3. NA
4. No high school diploma or
5. Some college or Associate
108 ( 20.5% )
133 ( 25.2% )
99 ( 18.8% )
29 ( 5.5% )
158 ( 30.0% )
0 (0.0%)
ppeducat [character]
1. Bachelors degree or highe
2. High school
3. Less than high school
4. Some college
207 ( 39.3% )
133 ( 25.2% )
29 ( 5.5% )
158 ( 30.0% )
0 (0.0%)
ppgender [character]
1. Female
2. Male
254 ( 48.2% )
273 ( 51.8% )
0 (0.0%)
ppethm [character]
1. 2+ Races, Non-Hispanic
2. Black, Non-Hispanic
3. Hispanic
4. Other, Non-Hispanic
5. White, Non-Hispanic
21 ( 4.0% )
27 ( 5.1% )
51 ( 9.7% )
24 ( 4.6% )
404 ( 76.7% )
0 (0.0%)
pphhsize [character]
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6 or more
80 ( 15.2% )
219 ( 41.6% )
102 ( 19.4% )
76 ( 14.4% )
35 ( 6.6% )
15 ( 2.8% )
0 (0.0%)
ppinc7 [character]
1. $10,000 to $24,999
2. $100,000 to $149,999
3. $150,000 or more
4. $25,000 to $49,999
5. $50,000 to $74,999
6. $75,000 to $99,999
7. Less than $10,000
32 ( 6.1% )
105 ( 19.9% )
137 ( 26.0% )
82 ( 15.6% )
85 ( 16.1% )
69 ( 13.1% )
17 ( 3.2% )
0 (0.0%)
ppmarit5 [character]
1. Divorced
2. Never married
3. Now Married
4. Separated
5. Widowed
43 ( 8.2% )
111 ( 21.1% )
337 ( 63.9% )
8 ( 1.5% )
28 ( 5.3% )
0 (0.0%)
ppmsacat [character]
1. Metro area
2. Non-metro area
448 ( 85.0% )
79 ( 15.0% )
0 (0.0%)
ppreg4 [character]
1. MidWest
2. NorthEast
3. South
4. West
118 ( 22.4% )
93 ( 17.6% )
190 ( 36.1% )
126 ( 23.9% )
0 (0.0%)
pprent [character]
1. Occupied without payment
2. Owned or being bought by
3. Rented for cash
10 ( 1.9% )
406 ( 77.0% )
111 ( 21.1% )
0 (0.0%)
ppstaten [character]
1. California
2. Texas
3. Florida
4. Pennsylvania
5. Illinois
6. New Jersey
7. Ohio
8. Michigan
9. New York
10. Washington
[ 39 others ]
51 ( 9.7% )
42 ( 8.0% )
34 ( 6.5% )
28 ( 5.3% )
23 ( 4.4% )
21 ( 4.0% )
21 ( 4.0% )
18 ( 3.4% )
18 ( 3.4% )
18 ( 3.4% )
253 ( 48.0% )
0 (0.0%)
PPWORKA [character]
1. Currently laid off
2. Employed full-time (by so
3. Employed part-time (by so
4. Full Time Student
5. Homemaker
6. On furlough
7. Other
8. Retired
9. Self-employed
13 ( 2.5% )
220 ( 41.7% )
31 ( 5.9% )
8 ( 1.5% )
37 ( 7.0% )
1 ( 0.2% )
20 ( 3.8% )
165 ( 31.3% )
32 ( 6.1% )
0 (0.0%)
ppemploy [character]
1. Not working
2. Working full-time
3. Working part-time
221 ( 41.9% )
245 ( 46.5% )
61 ( 11.6% )
0 (0.0%)
Q1_a [character]
1. Approve
2. Disapprove
3. Skipped
329 ( 62.4% )
193 ( 36.6% )
5 ( 0.9% )
0 (0.0%)
Q1_b [character]
1. Approve
2. Disapprove
3. Skipped
192 ( 36.4% )
322 ( 61.1% )
13 ( 2.5% )
0 (0.0%)
Q1_c [character]
1. Approve
2. Disapprove
3. Skipped
272 ( 51.6% )
248 ( 47.1% )
7 ( 1.3% )
0 (0.0%)
Q1_d [character]
1. Approve
2. Disapprove
3. Skipped
192 ( 36.4% )
321 ( 60.9% )
14 ( 2.7% )
0 (0.0%)
Q1_e [character]
1. Approve
2. Disapprove
3. Skipped
212 ( 40.2% )
301 ( 57.1% )
14 ( 2.7% )
0 (0.0%)
Q1_f [character]
1. Approve
2. Disapprove
3. Skipped
281 ( 53.3% )
230 ( 43.6% )
16 ( 3.0% )
0 (0.0%)
Q2 [character]
1. Not concerned at all
2. Not so concerned
3. Somewhat concerned
4. Very concerned
65 ( 12.3% )
147 ( 27.9% )
221 ( 41.9% )
94 ( 17.8% )
0 (0.0%)
Q3 [character]
1. No
2. Skipped
3. Yes
107 ( 20.3% )
5 ( 0.9% )
415 ( 78.7% )
0 (0.0%)
Q4 [character]
1. Excellent
2. Good
3. Not so good
4. Poor
5. Skipped
60 ( 11.4% )
215 ( 40.8% )
97 ( 18.4% )
149 ( 28.3% )
6 ( 1.1% )
0 (0.0%)
Q5 [character]
1. Optimistic
2. Pessimistic
3. Skipped
229 ( 43.5% )
295 ( 56.0% )
3 ( 0.6% )
0 (0.0%)
QPID [character]
1. A Democrat
2. A Republican
3. An Independent
4. Skipped
5. Something else
176 ( 33.4% )
152 ( 28.8% )
168 ( 31.9% )
3 ( 0.6% )
28 ( 5.3% )
0 (0.0%)
ABCAGE [character]
1. 18-29
2. 30-49
3. 50-64
4. 65+
60 ( 11.4% )
148 ( 28.1% )
157 ( 29.8% )
162 ( 30.7% )
0 (0.0%)
Contact [character]
1. No, I am not willing to b
2. Yes, I am willing to be i
355 ( 67.4% )
172 ( 32.6% )
0 (0.0%)
weights_pid [numeric]
Mean (sd) : 1 (0.6)
min ≤ med ≤ max:
0.3 ≤ 0.8 ≤ 6.3
IQR (CV) : 0.5 (0.6)
453 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2023-06-14

There are lots of string variables that might need to be modified for analysis or visualization. For example, the party id variable has “A Democrat” not the more standard language. Plus, there is a response “skipped” that should be treated as missing data. Lets see if we can fix it.

#starting point
table(abc_poll_orig$QPID)

    A Democrat   A Republican An Independent        Skipped Something else 
           176            152            168              3             28 
#mutate
abc_poll<-abc_poll_orig%>%
  mutate(partyid = str_remove(QPID, "A[n]* "),
         partyid = case_when(
           str_detect(QPID, "Skipped")~NA_character_,
           TRUE~partyid
         )) %>%
  select(-QPID)

#sanity check
table(abc_poll$partyid, useNA = "ifany")

      Democrat    Independent     Republican Something else           <NA> 
           176            168            152             28              3 
unique(abc_poll$partyid)
[1] "Democrat"       "Independent"    "Something else" "Republican"    
[5] NA              

Ethnic Identity

The ethnic identity variable is long and could be tough to include in graphs, lets see if we can modify it - but we would need to include a table note to explain what the data labels mean (e.g., that racial labels mean non-hispanic, and that hispanic responses don’t indicate race.)

#starting point
table(abc_poll$ppethm)

2+ Races, Non-Hispanic    Black, Non-Hispanic               Hispanic 
                    21                     27                     51 
   Other, Non-Hispanic    White, Non-Hispanic 
                    24                    404 
#mutate
abc_poll<-abc_poll%>%
  mutate(ethnic = str_remove(ppethm, ", Non-Hispanic"))%>%
  select(-ppethm)

#sanity check
table(abc_poll$ethnic)

2+ Races    Black Hispanic    Other    White 
      21       27       51       24      404 

Removing “Skipped”

What about the political variables that all have “Skipped” - a value that should probably be replaced with NA for analysis. Lets use the across function to make this easier.

abc_poll_1 <- abc_poll%>%
  mutate(across(starts_with("Q"), ~ na_if(.x, "Skipped")))

# purrr - a bit advanced for this particular challenge
map(select(abc_poll, starts_with("Q1")), table, useNA="ifany")
$Q1_a

   Approve Disapprove    Skipped 
       329        193          5 

$Q1_b

   Approve Disapprove    Skipped 
       192        322         13 

$Q1_c

   Approve Disapprove    Skipped 
       272        248          7 

$Q1_d

   Approve Disapprove    Skipped 
       192        321         14 

$Q1_e

   Approve Disapprove    Skipped 
       212        301         14 

$Q1_f

   Approve Disapprove    Skipped 
       281        230         16 

Factor order

Finally, what if you would like the categories of your variable to appear in a specific order, like the education variable that is currently in alphabetical order?

factor()

The factor variable type links variable labels to an underlying numeric order, and allows you to maintain the specified order for tables and graphics. Character strings always appear in alphabetical order.

table(abc_poll$ppeducat)

Bachelors degree or higher                High school 
                       207                        133 
     Less than high school               Some college 
                        29                        158 
edulabs <- unique(abc_poll$ppeducat)
edulabs
[1] "High school"                "Bachelors degree or higher"
[3] "Some college"               "Less than high school"     
levs <- c("Less than high school",
          "High school",
          "Some college",
          "Bachelors degree or higher")

abc_poll_1<-abc_poll_1%>%
  mutate(educ = factor(ppeducat, 
                       levels=levs)) %>% #edulabs[c(4,1,3,2)]))%>%
  select(-ppeducat)
rm(edulabs)

table(abc_poll_1$educ)

     Less than high school                High school 
                        29                        133 
              Some college Bachelors degree or higher 
                       158                        207 

This section builds on the code available in the solution to Challenge 3, where we pivoted the organic eggs pricing data. The data reports the average price per carton paid to the farmer or producer for organic eggs (and organic chicken), reported monthly from 2004 to 2013. Average price is reported by carton type, which can vary in both size (x-large or large) and quantity (half-dozen or dozen.)

Read Data

We are reading in half of the data from this workbook - the other half contains information about the price of organic chicken.

eggs_orig<- here("posts","_data","organiceggpoultry.xls") %>%
  read_excel(sheet="Data",
             range = "B6:F125",
             col_names = c("date", 
                           "xlarge_dozen",
                           "xlarge_halfdozen",
                           "large_dozen",
                           "large_halfdozen")
  )

Clean and Mutate

We are going to be removing the note from the first column of the data, and splitting the year and month, and pivoting into long format prior to transforming the year and month columns into a date.

eggs<-eggs_orig%>%
  mutate(date = str_remove(date, " /1"))%>%
  separate(date, into=c("month", "year"), sep=" ")%>%
  fill(year) %>%
  pivot_longer(cols=contains("large"),
               names_to = c("size", "quantity"),
               names_sep="_",
               values_to = "price")

Now, we need to create a date from a month and year. I can see that the months are a mix of long month name and 3 character month (for January), and the years are four digit years. Do I need to adjust the string for month manually, or can lubridate fix things for me?

I’m going to combine the month with the now complete year column, and the parse the “month-year” format using my().

eggs<-eggs%>%
  mutate(date = str_c(month, year, sep=" "),
         date = my(date))

select(eggs, month, year, date)

Interesting - lubridate automatically fills in the first day of the month. Maybe we would prefer the last day, or even the middle of the month?

Note that we can’t easily use make_datetime() for this example, as we would then need to transform the irregular month names into numeric values.

eggs<-eggs%>%
  mutate(date = make_datetime(month, "15", year))

select(eggs, month, year, date)

The mdy() function works just fine, if we put all the information in the proper format, though.

eggs<-eggs%>%
  mutate(date = mdy(str_c(month,"15",year,sep="/")))

select(eggs, month, year, date)

This data set runs from July 1954 to March 2017, and includes daily macroeconomic indicators related to the effective federal funds rate - or the interest rate at which banks lend money to each other in order to meet mandated reserve requirements.

A single case is a year-month-day, and there are 7 values that can be pivoted or not depending on the needs of the analyst. 4 values are related to the federal funds rate: target, upper target, lower target, and effective), while 3 are related macroeconomic indicators (inflation, \(\bigtriangleup\) GDP, and unemployment rate.)

For now, lets just focus on mutating the date.

fed_rates_orig<-here("posts","_data","FedFundsRate.csv") %>%
  read_csv()

fed_rates_orig

Once again, it looks like we will need to combine the year, month and date using stringr::str_c(), then we can use lubridate to transform into a date. Alternatively, because both month and day are numeric variables, we can use make_datetime().

fed_rates<-fed_rates_orig%>%
  mutate(date = str_c(Year, Month, Day, sep="-"),
         date = ymd(date))

summary(fed_rates$date)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"1954-07-01" "1973-04-23" "1987-12-16" "1987-02-25" "2001-06-07" "2017-03-16" 
Going Further

You can now go through and figure out whether there are patterns in the missing-ness of specific indicators by date (maybe the values are only measured once a month or once a quarter, and we need to use fill(), or maybe there is something else going on?)

This data set contains 119,390 hotel bookings from two hotels (“City Hotel” and “Resort Hotel”) with an arrival date between July 2015 and August 2017 (more detail needed), including bookings that were later cancelled. See Solution Set 2 for additional details. The data are a de-identified extract of real hotel demand data, made available by the authors.

bookings_orig<- here("posts","_data","hotel_bookings.csv") %>%
  read_csv()

select(bookings_orig, starts_with("arrival"))

Last time we looked at these data, I went to pretty extraordinary lengths to confirm the dates covered by the data. Lets see how much easier that is if we set the date to a date type variable instead! Those are long variable names, thank goodness we can get rid of them. Note that we only need three pieces of information out of the four provided.

Look how I can mess around with the format, and lubridate still recovers the date!

bookings<-bookings_orig%>%
  mutate(date_arrival = str_c(arrival_date_day_of_month,
                              arrival_date_month,
                              arrival_date_year, sep="/"),
         date_arrival = dmy(date_arrival))%>%
  select(-starts_with("arrival"))

summary(bookings$date_arrival)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2015-07-01" "2016-03-13" "2016-09-06" "2016-08-28" "2017-03-18" "2017-08-31" 

There are other relevant time variables in the data set that may be worth exploring. For example, we are given a lead time measure in days (integer), but we could recover a date with lubridate. This would allow us to more easily visually explore, for example, if some people were more likely to make bookings over the winter for summer trips, but in fall for winter trips - or some other seasonal pattern.

bookings<-bookings%>%
  mutate(date_booking = date_arrival-days(lead_time))

summary(bookings$date_booking)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2013-06-24" "2015-11-28" "2016-05-04" "2016-05-16" "2016-12-09" "2017-08-31" 
bookings %>%
  select(date_arrival, date_booking, lead_time)

We can also go in the reverse order. So if we wanted to know how many days before a booking there was last a change in the reservation status, we can generate this by comparing arrival date to reservation status date.

summary(bookings$reservation_status_date)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2014-10-17" "2016-02-01" "2016-08-07" "2016-07-30" "2017-02-08" "2017-09-14" 
bookings<-bookings%>%
  mutate(change_days = interval(reservation_status_date,
                                date_arrival),
         change_days = change_days %/% days(1))

summary(bookings$change_days)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -69.00   -3.00   -1.00   29.68   26.00  526.00 

This data set runs from the first quarter of 2003 to the second quarter of 2021, and includes quarterly measures of the total amount of household debt associated with 6 different types of loans - mortgage,HE revolving, auto, credit card, student, and other - plus a total household debt including all 6 loan types. This is another fantastic macroeconomic data product from the New York Federal Reserve. Detailed notes on the website reveal that the data are from an Equifax, and explain why data prior to 2003 is no longer part of the primary data publication.

debt_orig<-here("posts","_data","debt_in_trillions.xlsx") %>%
  read_excel()

debt_orig

A single case is a year-quarter, and there are 6 (or 7) values that can be pivoted or not depending on the needs of the analyst. The tricky part is figuring out how to tell R to treat the quarters as a date! We could take the long road and separate the year and quarter information, then fix the year to be numeric, recombine, etc. But lets use the more complex formats option of parse_date_time() plus a little regular expression style knowledge and read the information directly.

debt <- debt_orig%>%
  mutate(date = parse_date_time(`Year and Quarter`, 
                          orders="yq"))

summary(debt$date)
                      Min.                    1st Qu. 
"2003-01-01 00:00:00.0000" "2007-07-24 00:00:00.0000" 
                    Median                       Mean 
"2012-02-15 12:00:00.0000" "2012-02-15 06:09:43.7837" 
                   3rd Qu.                       Max. 
"2016-09-08 00:00:00.0000" "2021-04-01 00:00:00.0000" 

Wow, isn’t that super simple!