Sarah McAlpine - Challenge 4

challenge_4

sarahmcalpine

abc_poll

mutate

Author

Sarah McAlpine

Published

September 30, 2022

Code

library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Challenge Overview

Read in and describe a data set, tidy it, identify variables to be mutated, and mutate them.

ABC Poll
Fed Data

I will begin with the ABC poll.

ABC Poll Data Overview

As a new R user, I will begin with the abc_poll.csv data and may return to do another. I discovered the source of this data is very likely to be the July 23-24, 2021 ABC News/Ipsos Poll given the total of 527 respondents and the alignment of answer types with the annotated questionnaire. I will rename the numbered questions to clarify each topic.

I see the data is 527 rows by 31 columns, so I already anticipate needing to reduce columns if possible. Since this is poll data, one respondent defines a case, and I will consider the weighing methodology in reading the case notes. The data includes 5 administrative questions, 12 demographic/political groupings, 11 political questions, and 3 translated variables. Some of the administrative variables are not of analytical interest, so I will remove those upon read-in along with all 3 of the translated variables that IPSOS used to group categories of responses, such as age groups.

On a philosophical note, this poll was conducted with an online tool, and even with accommodations for those without their own devices and internet connections as noted in the annotated questionnaire, a disproportionate bias may be present for those with increased resources. Additionally, while weighting may attempt to give a better sense of the national population, I will want to ensure sufficient quantity of responses for particular groups so that the effects of any outliers can be minimized rather than magnified.

Code

#read in abc_poll.csv
poll_orig <- read_csv("_data/abc_poll_2021.csv")
#print(dfSummary(poll_orig,
 #                       varnumbers = FALSE,
  #                      plain.ascii  = FALSE, 
   #                     style        = "grid", 
    #                    graph.magnif = 0.70, 
     #                   valid.col    = FALSE),
  #    method = 'render',
  #    table.classes = 'table-condensed')
  
  # pp - seems to mean demog data, vs questions
  
  # 1 admin 527 distinct IDs - should be character vector

  # 2 admin Language: Polled in English or Spanish 
  # 3 admin DELETE 100% qualified - delete
  # 4 demog age 18-91 (more specific than ABCAGE)
  
  # 5 demog MUTATE ppeduc5 (keep over ppeducat) to edu0, edu1, edu2, edu3, edu4?
  # 6 trans DELETE ppeducat because code reveals the more granular data: distinct(poll_orig, ppeduc5,ppeducat)

  # 7 demog gender
  # 8 demog ethnic mix - 5 options
  # 9 demog household size 1-6+
  
  # 10 demog MUTATE income levels 1-7
  # 11 demog marriage status - 5 options
  # 12 demog metro or non-metro
  # 13 trans DELETE BC REDUNDANT? 4 regions, not the whole country
  # 14 demog rent: unpaid, own, rent 
  # 15 demog state, nearly all 50 states repped, plus DC, and unevenly
  
  # 16-17 demog COMPARE work status in ppemply and ppWORKA - possible mutate, wonder about any conflicting responses
  # Q1 a-f: approve/disapprove/skip
       # I wonder if skipped answers are better as NA or as skipped
  # Q1 a rename: covid
  # Q1 b rename: mexborder
  # Q1 c rename: econrecovery
  # Q1 d rename: guns
  # Q1 e rename: crime
  # Q1 f rename: troops
  # Q2 rename: infection
  # Q3 rename: vaxed
  # Q4 rename: promises
  # Q5 rename: nextyear
  # 28 demog political ID: 5 options
  
  # 29 trans DELETE ABC Age groupings - delete and keep granular age
  # 30 admin Contact: willing to be interviewed yes or no
  # 31 admin weights_pid - weighting explained in link

Identify Variables to Mutate

I want to clean the data in the read-in chunk to prevent any later confusion. When I read in the data at first, I called it poll_orig, and below I assign it simply poll. In this step, I remove unecessary columns (administrative or redundant), change the “Skipped” values to NA, change the wording of columns and values, and reorder the factors for more sesible visualization. I decided to add “q_” to the beginning of each poll question in case I later want to treat those in a specific way. See the code itself for detailed notes on each change.

Code

#read in, while removing header column and renaming columns
poll <- read_csv("_data/abc_poll_2021.csv",
                 skip = 1,
                 col_names = c("ID","delete", "delete", "age", "edu_lvl", "delete", "gender", "ethnicity", "hh_size", "income_lvl","marriage","metro","delete", "residence", "state", "work", "delete", "q_covid", "q_mexborder", "q_econrecovery","q_guns","q_crime", "q_troops", "q_infection", "q_vaxed","q_promises", "q_nextyear", "polparty", "delete", "delete", "weights"))%>%
  # remove the "delete" columns
                 select(!starts_with("Delete")) %>%
  # change ID to character vector
  mutate(across(("ID"), as.character)) %>%
  # change Skipped to NA
  na_if("Skipped")%>%
  # clean up text of values for polpart and ethnicity
  mutate(polparty = str_remove(polparty, "A[n]* ")) %>%
  mutate(ethnicity = str_remove(ethnicity, ", Non-Hispanic")) %>%
   # turn income levels into factors 
  mutate(income_lvl = fct_recode(income_lvl,
                                 "income1" = "Less than $10,000",
                                 "income2" = "$10,000 to $24,999",
                                 "income3" = "$25,000 to $49,999",
                                 "income4" = "$50,000 to $74,999",
                                 "income5" = "$75,000 to $99,999",
                                 "income6" = "$100,000 to $149,999",
                                 "income7" = "$150,000 or more"
                                 )) %>%
  # reorder the income levels lowest to highest
  mutate(income_lvl = fct_relevel(income_lvl, "income1", "income2", "income3", "income4", "income5", "income6", "income7")) %>%
  # remove apostrophe from edu_lvl values
   mutate(edu_lvl = str_remove(edu_lvl, "\x92")) %>%
  # rename edu levels
   mutate(edu_lvl = fct_recode(edu_lvl,
                              "no diploma" = "No high school diploma or GED",
                              "diploma or ged" = "High school graduate (high school diploma or the equivalent GED)",
                              "some college" = "Some college or Associate degree",
                              "bachelors" = "Bachelors degree",
                              "masters plus" = "Masters degree or above")) %>%
  mutate(edu_lvl = fct_relevel(edu_lvl, "no diploma", "diploma or ged", "some college", "bachelors", "masters plus")) %>%
  mutate(residence = fct_recode(residence,
                                "bought" = "Owned or being bought by you or someone in your household",
                                "rented" = "Rented for cash",
                                "unpaid" = "Occupied without payment of cash rent"))%>%
  mutate(marriage = fct_infreq(marriage), work = fct_infreq(work), residence = fct_infreq(residence))

Resulting Data

Below is the data frame summary showing my new data, which is 527 cases by 24 columns. I also set the levels of sequential answers as logica would dictate (as in, education levels beginning with less and adding from there), and set the others to display in order of frequency if simply categorical. I was not sure whether to retain the IDs and weights, so I left them for now.

Code

# Look at the tidied data
print(dfSummary(poll,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

poll

Dimensions: 527 x 24
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

ID [character]

1. 7230001
2. 7230002
3. 7230003
4. 7230004
5. 7230005
6. 7230006
7. 7230007
8. 7230008
9. 7230009
10. 7230010
[ 517 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
517	(	98.1%	)

0 (0.0%)

age [numeric]

Mean (sd) : 53.4 (17.1)

min ≤ med ≤ max:

18 ≤ 55 ≤ 91

IQR (CV) : 27 (0.3)

72 distinct values

0 (0.0%)

edu_lvl [factor]

1. no diploma

2. diploma or ged

3. some college

4. bachelors

5. masters plus

29	(	5.5%	)
133	(	25.2%	)
158	(	30.0%	)
108	(	20.5%	)
99	(	18.8%	)

0 (0.0%)

gender [character]

1. Female

2. Male

254	(	48.2%	)
273	(	51.8%	)

0 (0.0%)

ethnicity [character]

1. 2+ Races

2. Black

3. Hispanic

4. Other

5. White

21	(	4.0%	)
27	(	5.1%	)
51	(	9.7%	)
24	(	4.6%	)
404	(	76.7%	)

0 (0.0%)

hh_size [character]

1. 1

2. 2

3. 3

4. 4

5. 5

6. 6 or more

80	(	15.2%	)
219	(	41.6%	)
102	(	19.4%	)
76	(	14.4%	)
35	(	6.6%	)
15	(	2.8%	)

0 (0.0%)

income_lvl [factor]

1. income1

2. income2

3. income3

4. income4

5. income5

6. income6

7. income7

17	(	3.2%	)
32	(	6.1%	)
82	(	15.6%	)
85	(	16.1%	)
69	(	13.1%	)
105	(	19.9%	)
137	(	26.0%	)

0 (0.0%)

marriage [factor]

1. Now Married

2. Never married

3. Divorced

4. Widowed

5. Separated

337	(	63.9%	)
111	(	21.1%	)
43	(	8.2%	)
28	(	5.3%	)
8	(	1.5%	)

0 (0.0%)

metro [character]

1. Metro area

2. Non-metro area

448	(	85.0%	)
79	(	15.0%	)

0 (0.0%)

residence [factor]

1. bought

2. rented

3. unpaid

406	(	77.0%	)
111	(	21.1%	)
10	(	1.9%	)

0 (0.0%)

state [character]

1. California

2. Texas

3. Florida

4. Pennsylvania

5. Illinois

6. New Jersey

7. Ohio

8. Michigan

9. New York

10. Washington

[ 39 others ]

51	(	9.7%	)
42	(	8.0%	)
34	(	6.5%	)
28	(	5.3%	)
23	(	4.4%	)
21	(	4.0%	)
21	(	4.0%	)
18	(	3.4%	)
18	(	3.4%	)
18	(	3.4%	)
253	(	48.0%	)

0 (0.0%)

work [factor]

1. Employed full-time (by so

2. Retired

3. Homemaker

4. Self-employed

5. Employed part-time (by so

6. Other

7. Currently laid off

8. Full Time Student

9. On furlough

220	(	41.7%	)
165	(	31.3%	)
37	(	7.0%	)
32	(	6.1%	)
31	(	5.9%	)
20	(	3.8%	)
13	(	2.5%	)
8	(	1.5%	)
1	(	0.2%	)

0 (0.0%)

q_covid [character]

1. Approve

2. Disapprove

329	(	63.0%	)
193	(	37.0%	)

5 (0.9%)

q_mexborder [character]

1. Approve

2. Disapprove

192	(	37.4%	)
322	(	62.6%	)

13 (2.5%)

q_econrecovery [character]

1. Approve

2. Disapprove

272	(	52.3%	)
248	(	47.7%	)

7 (1.3%)

q_guns [character]

1. Approve

2. Disapprove

192	(	37.4%	)
321	(	62.6%	)

14 (2.7%)

q_crime [character]

1. Approve

2. Disapprove

212	(	41.3%	)
301	(	58.7%	)

14 (2.7%)

q_troops [character]

1. Approve

2. Disapprove

281	(	55.0%	)
230	(	45.0%	)

16 (3.0%)

q_infection [character]

1. Not concerned at all

2. Not so concerned

3. Somewhat concerned

4. Very concerned

65	(	12.3%	)
147	(	27.9%	)
221	(	41.9%	)
94	(	17.8%	)

0 (0.0%)

q_vaxed [character]

1. No

2. Yes

107	(	20.5%	)
415	(	79.5%	)

5 (0.9%)

q_promises [character]

1. Excellent

2. Good

3. Not so good

4. Poor

60	(	11.5%	)
215	(	41.3%	)
97	(	18.6%	)
149	(	28.6%	)

6 (1.1%)

q_nextyear [character]

1. Optimistic

2. Pessimistic

229	(	43.7%	)
295	(	56.3%	)

3 (0.6%)

polparty [character]

1. Democrat

2. Independent

3. Republican

4. Something else

176	(	33.6%	)
168	(	32.1%	)
152	(	29.0%	)
28	(	5.3%	)

3 (0.6%)

weights [numeric]

Mean (sd) : 1 (0.6)

min ≤ med ≤ max:

0.3 ≤ 0.8 ≤ 6.3

IQR (CV) : 0.5 (0.6)

453 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

Learning Lubridate

I wanted to try my hand at lubridate, so I am returning to read in ____ data. For this post, I will focus on ____.

Code

#read in FedFundsRate.csv
fed_orig <- read_csv("_data/FedFundsRate.csv")

#quick view of column names
colnames(fed_orig)

 [1] "Year"                         "Month"                       
 [3] "Day"                          "Federal Funds Target Rate"   
 [5] "Federal Funds Upper Target"   "Federal Funds Lower Target"  
 [7] "Effective Federal Funds Rate" "Real GDP (Percent Change)"   
 [9] "Unemployment Rate"            "Inflation Rate"

Code

# find start and end years of data
max(fed_orig$Year)

[1] 2017

Code

min(fed_orig$Year)

[1] 1954

Code

# previewed the data frame summary, but no need to include in post.
# dfSummary(fed_orig)

Summarize Data, Identify Variables to Mutate

The FedFundsRate.csv file has 10 columns and 904 rows and includes dates from 1954-2017. Year, month, and day are the first three columns–I will want to change those to a single date field. Next columns are Federal Funds (FF) Target Rate, FF Upper Target, FF Lower Target, Effective FF Rate, Real GDP (Percent Change), Unemployment Rate, and Inflation Rate. These remaining columns seem for now to be pretty tidy as the variables all seem independent, but I will rename them for ease of analysis.

I will also need to handle high occurrence of NA values. Taking a look at the data file, I can see that the Target, Upper Target, and Lower Target rates only appear for some years. I can also see that the first 28 years (1954-1981) include only the first days of the month, and thereafter additional dates appear in a pattern that’s not monthly or easily seen. Inflation rates appear in 1958 and later.

Defining an Observation

The unevenness of data throughout the set makes defining a case somewhat complicated. Would I be better off with the tidiest, most complete data set, and therefore use only first days of the month? Or should I retain as much as possible and expect to do some smoothing around the missing data? On one hand, I prefer to retain as much information as possible, though without a research question in mind, I could elect to keep a broader view of the data, since it already covers 63 years and perhaps mid-month changes aren’t as meaningful as the broader trends. For the purposes of this exercise, I will keep things simple and retain only the first days of each month.

Come back to this

Don’t toss the fed rates data. time series set up the dates carefully select v filter add table notes and table captions, table numbers in yaml header

Cleaning Dates

In order for this to work, I should confirm the first of each month is included in the data. As seen below, we expected and in fact do have 753 “firsts.”

Code

# confirm first days of every month are in the data
# 12 months * 64 years - (6 earliest months in 1954 and 9 latest months in 2017) = 753
12*64 - 15

[1] 753

Code

# so I will expect to have 753 distinct dates once I remove the non-first-days
 firsts <- count(fed_orig, (Day==1))
 firsts

More on dates. select only the firsts and assign

Code

# read-in chunk with transformations
fed_clean <- fed_orig %>%
  # filter? select?
filter(Day==1) %>%
#lubridate
# rename columns

fed_clean

Error in fed_clean(.): could not find function "fed_clean"