Challenge 4

Theresa_Szczepanski

challenge_4

abc_poll

Author

Theresa Szczepanski

Published

September 30, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Approach

To address today’s challenge I tried to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
Used summary on my data frame and examined the spreadsheet itself to describe the data
tidied the data (as needed, including sanity checks)
Identified variables to be mutated
Mutate all ordinal variables using factor
Create a Codebook key for all of my variables
Make note of my questions about my process and inefficienct coding practices.

abc_poll.csv ⭐

To read in the data, I used the following process:

Examine the summary
Identify information to filter out on the read in
Identify variables to rename on the read in
Identify variables to mutate on the read in to simplify values
Identify variable values to mutate on the read in to fix data type issues.

Examine the Summary
Filter, Rename, and Mutate on Read in

Code

#Read in the abc_poll data and use the summary to decide how to best set up
# the our data frame

 abc_poll<-read_csv("_data/abc_poll_2021.csv")
 print(summarytools::dfSummary(abc_poll,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.70,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')

Data Frame Summary

abc_poll

Dimensions: 527 x 31
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

id [numeric]

Mean (sd) : 7230264 (152.3)

min ≤ med ≤ max:

7230001 ≤ 7230264 ≤ 7230527

IQR (CV) : 263 (0)

527 distinct values

0 (0.0%)

xspanish [character]

1. English

2. Spanish

514	(	97.5%	)
13	(	2.5%	)

0 (0.0%)

complete_status [character]

1. qualified

527

(

100.0%

)

0 (0.0%)

ppage [numeric]

Mean (sd) : 53.4 (17.1)

min ≤ med ≤ max:

18 ≤ 55 ≤ 91

IQR (CV) : 27 (0.3)

72 distinct values

0 (0.0%)

ppeduc5 [character]

1. NA

2. NA

3. High school graduate (hig

4. No high school diploma or

5. Some college or Associate

99	(	18.8%	)
108	(	20.5%	)
133	(	25.2%	)
29	(	5.5%	)
158	(	30.0%	)

0 (0.0%)

ppeducat [character]

1. Bachelors degree or highe

2. High school

3. Less than high school

4. Some college

207	(	39.3%	)
133	(	25.2%	)
29	(	5.5%	)
158	(	30.0%	)

0 (0.0%)

ppgender [character]

1. Female

2. Male

254	(	48.2%	)
273	(	51.8%	)

0 (0.0%)

ppethm [character]

1. 2+ Races, Non-Hispanic

2. Black, Non-Hispanic

3. Hispanic

4. Other, Non-Hispanic

5. White, Non-Hispanic

21	(	4.0%	)
27	(	5.1%	)
51	(	9.7%	)
24	(	4.6%	)
404	(	76.7%	)

0 (0.0%)

pphhsize [character]

1. 1

2. 2

3. 3

4. 4

5. 5

6. 6 or more

80	(	15.2%	)
219	(	41.6%	)
102	(	19.4%	)
76	(	14.4%	)
35	(	6.6%	)
15	(	2.8%	)

0 (0.0%)

ppinc7 [character]

1. $10,000 to $24,999

2. $100,000 to $149,999

3. $150,000 or more

4. $25,000 to $49,999

5. $50,000 to $74,999

6. $75,000 to $99,999

7. Less than $10,000

32	(	6.1%	)
105	(	19.9%	)
137	(	26.0%	)
82	(	15.6%	)
85	(	16.1%	)
69	(	13.1%	)
17	(	3.2%	)

0 (0.0%)

ppmarit5 [character]

1. Divorced

2. Never married

3. Now Married

4. Separated

5. Widowed

43	(	8.2%	)
111	(	21.1%	)
337	(	63.9%	)
8	(	1.5%	)
28	(	5.3%	)

0 (0.0%)

ppmsacat [character]

1. Metro area

2. Non-metro area

448	(	85.0%	)
79	(	15.0%	)

0 (0.0%)

ppreg4 [character]

1. MidWest

2. NorthEast

3. South

4. West

118	(	22.4%	)
93	(	17.6%	)
190	(	36.1%	)
126	(	23.9%	)

0 (0.0%)

pprent [character]

1. Occupied without payment

2. Owned or being bought by

3. Rented for cash

10	(	1.9%	)
406	(	77.0%	)
111	(	21.1%	)

0 (0.0%)

ppstaten [character]

1. California

2. Texas

3. Florida

4. Pennsylvania

5. Illinois

6. New Jersey

7. Ohio

8. Michigan

9. New York

10. Washington

[ 39 others ]

51	(	9.7%	)
42	(	8.0%	)
34	(	6.5%	)
28	(	5.3%	)
23	(	4.4%	)
21	(	4.0%	)
21	(	4.0%	)
18	(	3.4%	)
18	(	3.4%	)
18	(	3.4%	)
253	(	48.0%	)

0 (0.0%)

PPWORKA [character]

1. Currently laid off

2. Employed full-time (by so

3. Employed part-time (by so

4. Full Time Student

5. Homemaker

6. On furlough

7. Other

8. Retired

9. Self-employed

13	(	2.5%	)
220	(	41.7%	)
31	(	5.9%	)
8	(	1.5%	)
37	(	7.0%	)
1	(	0.2%	)
20	(	3.8%	)
165	(	31.3%	)
32	(	6.1%	)

0 (0.0%)

ppemploy [character]

1. Not working

2. Working full-time

3. Working part-time

221	(	41.9%	)
245	(	46.5%	)
61	(	11.6%	)

0 (0.0%)

Q1_a [character]

1. Approve

2. Disapprove

3. Skipped

329	(	62.4%	)
193	(	36.6%	)
5	(	0.9%	)

0 (0.0%)

Q1_b [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
322	(	61.1%	)
13	(	2.5%	)

0 (0.0%)

Q1_c [character]

1. Approve

2. Disapprove

3. Skipped

272	(	51.6%	)
248	(	47.1%	)
7	(	1.3%	)

0 (0.0%)

Q1_d [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
321	(	60.9%	)
14	(	2.7%	)

0 (0.0%)

Q1_e [character]

1. Approve

2. Disapprove

3. Skipped

212	(	40.2%	)
301	(	57.1%	)
14	(	2.7%	)

0 (0.0%)

Q1_f [character]

1. Approve

2. Disapprove

3. Skipped

281	(	53.3%	)
230	(	43.6%	)
16	(	3.0%	)

0 (0.0%)

Q2 [character]

1. Not concerned at all

2. Not so concerned

3. Somewhat concerned

4. Very concerned

65	(	12.3%	)
147	(	27.9%	)
221	(	41.9%	)
94	(	17.8%	)

0 (0.0%)

Q3 [character]

1. No

2. Skipped

3. Yes

107	(	20.3%	)
5	(	0.9%	)
415	(	78.7%	)

0 (0.0%)

Q4 [character]

1. Excellent

2. Good

3. Not so good

4. Poor

5. Skipped

60	(	11.4%	)
215	(	40.8%	)
97	(	18.4%	)
149	(	28.3%	)
6	(	1.1%	)

0 (0.0%)

Q5 [character]

1. Optimistic

2. Pessimistic

3. Skipped

229	(	43.5%	)
295	(	56.0%	)
3	(	0.6%	)

0 (0.0%)

QPID [character]

1. A Democrat

2. A Republican

3. An Independent

4. Skipped

5. Something else

176	(	33.4%	)
152	(	28.8%	)
168	(	31.9%	)
3	(	0.6%	)
28	(	5.3%	)

0 (0.0%)

ABCAGE [character]

1. 18-29

2. 30-49

3. 50-64

4. 65+

60	(	11.4%	)
148	(	28.1%	)
157	(	29.8%	)
162	(	30.7%	)

0 (0.0%)

Contact [character]

1. No, I am not willing to b

2. Yes, I am willing to be i

355	(	67.4%	)
172	(	32.6%	)

0 (0.0%)

weights_pid [numeric]

Mean (sd) : 1 (0.6)

min ≤ med ≤ max:

0.3 ≤ 0.8 ≤ 6.3

IQR (CV) : 0.5 (0.6)

453 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21

After examining the summary, I chose to

Filter:

id: this info won’t be used in an analysis
complete_status: everyone was qualified
ppeducat: this categorizing of ppeduc5 can be done in the data frame using a case_when()
ABCAGE: this qualitative age range variable can be replicated by using the data in the ppage variable and a case_when; one might want to examine different ranges of ages.
weights_pid variable, since this is calculated using percentages of respondents relative to their representation to the general population and can be calculated using data within our data frame.

Rename

I renamed all of the variables corresponding to demographic characteristics of the poll participant to begin with pp_.
I renamed all of the variables corresponding to survey question responses from the participants to begin with Q
If a variable had a fixed number of possible responses (which I could see from the summary), e.g., pp_marital had 5 possible responses, I included the number of “categories” or possible responses in the variable name preceded by an underscore, pp_marital_5

Mutate

I replaced the pp_hhsize_6 value of “6 or more” with 6, so that it could be of double data type
I mutated the pp_educ5 column to remove the apostrophes from “Bachelor’s” and “Master’s” that were producing the “\x92”’s in the values on read in.
If a nominal variable had lengthy values, I reduced them to the key info using mutate, str_sub, and case_when

Code

#Filter, rename variables, and mutate values of variables on read-in

abc_poll<-read_csv("_data/abc_poll_2021.csv", skip = 1,  
                   col_names= c("delete",  "pp_Language_2",  "delete","pp_age", 
                                "pp_educ_5", "delete", "pp_gender_2", 
                                "pp_ethnicity_5", "pp_hhsize_6", "pp_inc_7", 
                                "pp_marital_5", "pp_metro_cat_2", "pp_region_4",
                                "pp_housing_3", "pp_state", 
                                "pp_working_arrangement_9", 
                                "pp_employment_status_3", "Q1a_3", "Q1b_3", 
                                "Q1c_3",  "Q1d_3","Q1e_3", "Q1f_3","Q2ConcernLevel_4",
                                "Q3_3", "Q4_5",  "Q5Optimism_3", 
                                "pp_political_id_5", "delete", "pp_contact_2",  
                                  "delete"))%>%
  select(!contains("delete"))%>%
  
  #replace "6 or more" in pp_hhsize_6 to the value of 6 so that the column can be
  # of double data type.
     mutate(pp_hhsize_6 = ifelse(pp_hhsize_6 == "6 or more", "6", pp_hhsize_6)) %>%
    transform( pp_hhsize_6 = as.numeric(pp_hhsize_6))%>%
  
  #fix the issue with apostrophes in pp_educ_5 values on read in
    mutate(pp_educ_5 = ifelse(str_starts(pp_educ_5,"Bachelor"), 
                           "Bachelor", pp_educ_5))%>%
    mutate(pp_educ_5 = ifelse(str_starts(pp_educ_5, "Master"), "Master", pp_educ_5))

  # reduce lengthy responses to necessary info in nominal variables

  abc_poll$pp_Language_2 = substr(abc_poll$pp_Language_2,1,2)
  #mutate(pp_Language_2 = (str_sub(abc_poll,pp_Language_2, 1, 2)))
  abc_poll$pp_gender_2 = substr(abc_poll$pp_gender_2,1,1)
  abc_poll$pp_contact_2 = substr(abc_poll$pp_contact_2,1,1)
  
  #reduce lengthy responses of nominal variables using Case When
  #pp_ethnicity_5
  abc_poll <-mutate(abc_poll, pp_ethnicity_5 = case_when(
    pp_ethnicity_5 == "2+ Races, Non-Hispanic" ~ "2+NH",
    pp_ethnicity_5 == "Black, Non-Hispanic" ~ "BlNH",
    pp_ethnicity_5 == "Hispanic" ~ "H",
    pp_ethnicity_5 == "Other, Non-Hispanic" ~ "OtNH",
    pp_ethnicity_5 == "White, Non-Hispanic" ~ "WhNH"

))
  
  #pp_metro_cat_2
  abc_poll <-mutate(abc_poll, pp_metro_cat_2 = case_when(
    pp_metro_cat_2 == "Metro area" ~ "M",
    pp_metro_cat_2 == "Non-metro area" ~ "NM"
))
 
  #pp_political_id_5 
 abc_poll <-mutate(abc_poll, pp_political_id_5  = case_when(
    pp_political_id_5 == "A Democrat" ~ "Dem",
    pp_political_id_5 == "A Republican" ~ "Rep",
    pp_political_id_5 == "An Independent" ~ "Ind",
    pp_political_id_5 == "Something else" ~ "Other",
    pp_political_id_5 == "Skipped" ~ "DNR"
))
 
 #pp_housing_3
abc_poll <-mutate(abc_poll, pp_housing_3 = case_when(
    pp_housing_3 == "Occupied without payment of cash rent" ~ "NonP_Occupied",
    pp_housing_3 == "Rented for cash"~ "P_Rent",
    pp_housing_3 == "Owned or being bought by you or someone in your household" ~ "P_Own"))

  #pp_region_4 
 abc_poll <-mutate(abc_poll, pp_region_4  = case_when(
    pp_region_4 == "MidWest" ~ "MW",
    pp_region_4 == "NorthEast" ~ "NE",
    pp_region_4 == "South" ~ "S",
    pp_region_4 == "West" ~ "W",
    
))
  #pp_marital_5 
 abc_poll <-mutate(abc_poll, pp_marital_5 = case_when(
   pp_marital_5 == "Never married" ~ "NM",
   pp_marital_5 == "Now Married" ~ "M",
   pp_marital_5 == "Separated" ~ "S",
   pp_marital_5 == "Divorced" ~ "D",
   pp_marital_5 == "Widowed" ~ "W"))
 
# pp_working_arrangement_9
 abc_poll <-mutate(abc_poll, pp_working_arrangement_9 = case_when(
          pp_working_arrangement_9 == "Other" ~ "Other",
          pp_working_arrangement_9 =="Retired" ~ "Retired",
          pp_working_arrangement_9 == "Homemaker" ~ "Homemaker",
          pp_working_arrangement_9 == "Student" ~ "Student",
          pp_working_arrangement_9 == "Currently laid off" ~ "Laid Off",
          pp_working_arrangement_9 == "On furlough"~ "Furlough",
          pp_working_arrangement_9 == "Employed part-time (by someone else)" ~ "Emp_PT",
          pp_working_arrangement_9 =="Self-employed" ~ "Emp_Self",
          pp_working_arrangement_9 == "Employed full-time (by someone else)"~ "Emp_FT"))
 

  #Q3_3 What is the best "coding for variables that are like "Booleans"?
 # abc_poll <-mutate(abc_poll, Q3_3 = case_when(
 #                                   Q3_3 ==  "Yes" ~ 1,
 #                                   Q3_3 ==  "No" ~ 0,
 #                                     Q3_3 ==  "Skipped" ~ 1))
 
 #Q5Optimism_3
 # abc_poll <-mutate(abc_poll, Q5Optimism_3 = case_when(
 #                                    Q5Optimism_3 == "Pessimistic"~ 0,
 #                                    Q5Optimism_3 == "Optimistic" ~ 1,
 #                                    Q5Optimism_3 == "Skipped" ~ -1))
 
 
  
  abc_poll

Question

Is there a way to not be writing a mutate line for each variable the way I did on the read in?

Broad Summary
Post Read in Variable Summary

From our abc_poll data frame summary, we can see that this data set contains polling results from 527 respondents to an ABC news political poll. The results consist of information for two broad categories

Demographic characteristics of the respondents themselves (e.g., language of the poll given to the respondent (Spanish or English), age, educational attainment, ethnicity, household size, ethnic make up, gender, income range, Marital status, Metro category, Geographic region, Rental status, State, Employment status, Working characteristics, Willingness to have a follow up interview)
The responses that the individuals gave to 10 questions (there are 5 broad questions Q1-Q5, but Q1 consists of 6 sub questions, a-f).

Now when we examine our summary, we can see that

each categorical variable is of character data type with the number of distinct categories included in the variable name
some of these categorical variables are ordinal and will need the ordering of their values coded in as factors.
each variable has 527 observations partitioned among the possible values
each discrete numerical variable is of double data type.

Code

print(summarytools::dfSummary(abc_poll,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE,
                        style        = "grid",
                        graph.magnif = 0.70,
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

abc_poll

Dimensions: 527 x 26
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

pp_Language_2 [character]

1. En

2. Sp

514	(	97.5%	)
13	(	2.5%	)

0 (0.0%)

pp_age [numeric]

Mean (sd) : 53.4 (17.1)

min ≤ med ≤ max:

18 ≤ 55 ≤ 91

IQR (CV) : 27 (0.3)

72 distinct values

0 (0.0%)

pp_educ_5 [character]

1. Bachelor

2. High school graduate (hig

3. Master

4. No high school diploma or

5. Some college or Associate

108	(	20.5%	)
133	(	25.2%	)
99	(	18.8%	)
29	(	5.5%	)
158	(	30.0%	)

0 (0.0%)

pp_gender_2 [character]

1. F

2. M

254	(	48.2%	)
273	(	51.8%	)

0 (0.0%)

pp_ethnicity_5 [character]

1. 2+NH

2. BlNH

3. H

4. OtNH

5. WhNH

21	(	4.0%	)
27	(	5.1%	)
51	(	9.7%	)
24	(	4.6%	)
404	(	76.7%	)

0 (0.0%)

pp_hhsize_6 [numeric]

Mean (sd) : 2.6 (1.3)

min ≤ med ≤ max:

1 ≤ 2 ≤ 6

IQR (CV) : 1 (0.5)

1	:	80	(	15.2%	)
2	:	219	(	41.6%	)
3	:	102	(	19.4%	)
4	:	76	(	14.4%	)
5	:	35	(	6.6%	)
6	:	15	(	2.8%	)

0 (0.0%)

pp_inc_7 [character]

1. $10,000 to $24,999

2. $100,000 to $149,999

3. $150,000 or more

4. $25,000 to $49,999

5. $50,000 to $74,999

6. $75,000 to $99,999

7. Less than $10,000

32	(	6.1%	)
105	(	19.9%	)
137	(	26.0%	)
82	(	15.6%	)
85	(	16.1%	)
69	(	13.1%	)
17	(	3.2%	)

0 (0.0%)

pp_marital_5 [character]

1. D

2. M

3. NM

4. S

5. W

43	(	8.2%	)
337	(	63.9%	)
111	(	21.1%	)
8	(	1.5%	)
28	(	5.3%	)

0 (0.0%)

pp_metro_cat_2 [character]

1. M

2. NM

448	(	85.0%	)
79	(	15.0%	)

0 (0.0%)

pp_region_4 [character]

1. MW

2. NE

3. S

4. W

118	(	22.4%	)
93	(	17.6%	)
190	(	36.1%	)
126	(	23.9%	)

0 (0.0%)

pp_housing_3 [character]

1. NonP_Occupied

2. P_Own

3. P_Rent

10	(	1.9%	)
406	(	77.0%	)
111	(	21.1%	)

0 (0.0%)

pp_state [character]

1. California

2. Texas

3. Florida

4. Pennsylvania

5. Illinois

6. New Jersey

7. Ohio

8. Michigan

9. New York

10. Washington

[ 39 others ]

51	(	9.7%	)
42	(	8.0%	)
34	(	6.5%	)
28	(	5.3%	)
23	(	4.4%	)
21	(	4.0%	)
21	(	4.0%	)
18	(	3.4%	)
18	(	3.4%	)
18	(	3.4%	)
253	(	48.0%	)

0 (0.0%)

pp_working_arrangement_9 [character]

1. Emp_FT

2. Emp_PT

3. Emp_Self

4. Furlough

5. Homemaker

6. Laid Off

7. Other

8. Retired

220	(	42.4%	)
31	(	6.0%	)
32	(	6.2%	)
1	(	0.2%	)
37	(	7.1%	)
13	(	2.5%	)
20	(	3.9%	)
165	(	31.8%	)

8 (1.5%)

pp_employment_status_3 [character]

1. Not working

2. Working full-time

3. Working part-time

221	(	41.9%	)
245	(	46.5%	)
61	(	11.6%	)

0 (0.0%)

Q1a_3 [character]

1. Approve

2. Disapprove

3. Skipped

329	(	62.4%	)
193	(	36.6%	)
5	(	0.9%	)

0 (0.0%)

Q1b_3 [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
322	(	61.1%	)
13	(	2.5%	)

0 (0.0%)

Q1c_3 [character]

1. Approve

2. Disapprove

3. Skipped

272	(	51.6%	)
248	(	47.1%	)
7	(	1.3%	)

0 (0.0%)

Q1d_3 [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
321	(	60.9%	)
14	(	2.7%	)

0 (0.0%)

Q1e_3 [character]

1. Approve

2. Disapprove

3. Skipped

212	(	40.2%	)
301	(	57.1%	)
14	(	2.7%	)

0 (0.0%)

Q1f_3 [character]

1. Approve

2. Disapprove

3. Skipped

281	(	53.3%	)
230	(	43.6%	)
16	(	3.0%	)

0 (0.0%)

Q2ConcernLevel_4 [character]

1. Not concerned at all

2. Not so concerned

3. Somewhat concerned

4. Very concerned

65	(	12.3%	)
147	(	27.9%	)
221	(	41.9%	)
94	(	17.8%	)

0 (0.0%)

Q3_3 [character]

1. No

2. Skipped

3. Yes

107	(	20.3%	)
5	(	0.9%	)
415	(	78.7%	)

0 (0.0%)

Q4_5 [character]

1. Excellent

2. Good

3. Not so good

4. Poor

5. Skipped

60	(	11.4%	)
215	(	40.8%	)
97	(	18.4%	)
149	(	28.3%	)
6	(	1.1%	)

0 (0.0%)

Q5Optimism_3 [character]

1. Optimistic

2. Pessimistic

3. Skipped

229	(	43.5%	)
295	(	56.0%	)
3	(	0.6%	)

0 (0.0%)

pp_political_id_5 [character]

1. Dem

2. DNR

3. Ind

4. Other

5. Rep

176	(	33.4%	)
3	(	0.6%	)
168	(	31.9%	)
28	(	5.3%	)
152	(	28.8%	)

0 (0.0%)

pp_contact_2 [character]

1. N

2. Y

355	(	67.4%	)
172	(	32.6%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21

In order to have tidy data, each row should be a unique observation. A unique case therefore should consist of all of the demographic information about the polled person and their response for one of the questions.

The demographic characteristics (our pp_ variables) and the specific Question variable define a case.
The value for each case is the poll participants Response.

Code

abc_poll<-abc_poll %>%
    pivot_longer(c(starts_with("Q1")), names_to = "Question 1 part", values_to = "Q1 Response")

abc_poll

Questions

When examining the data, I noticed nominal, ordinal, discrete, and continuous variables.
Does arrange sort your ordinal data by the factors or use something else?
Should you ever pivot variables of different levels of measurement into the same column? In the abc_poll case, I could imagine pivoting all of the questions names into a column and all of the responses into a response column. But some of the questions had response values that were clearly ordinal and others that were not. Is there a “best practice” for this?

When examining our variables there are two issues to address

Some of the variable are ordinal and will need the ordering coded in
Some of the ordinal variable values are very long and fill up too much space in our table making it unpleasant to read through

Factoring of Ordinal Variables
Codebook for ABC Variables

There are several ordinal variables where we need to code in the ordering of the categories.

Code

# use factoring to put an ordering to all ordinal variables
                         
             
# Why couldn't I use a levels vector!                 
 # pp_inc_7_levels <- c(
 #    "Less than $10,000", "$10,000 to $24,999",  "$25,000 to $49,999",  
 #    "$50,000 to $74,999", "$75,000 to $99,999", "$100,000 to $149,999",  
 #    "150,000 or more")

  
 abc_poll <-mutate(abc_poll, pp_inc_7 = recode_factor(pp_inc_7, 
                                   "Less than $10,000" = "I1", 
                                   "$10,000 to $24,999" = "I2",  
                                   "$25,000 to $49,999" = "I3", 
                                   "$50,000 to $74,999"= "I4", 
                                   "$75,000 to $99,999"= "I5", 
                                   "$100,000 to $149,999" = "I6",
                                   "$150,000 or more" = "I7",
                                  .ordered = TRUE))
 #pp_educ_5
 
 abc_poll <-mutate(abc_poll, pp_educ_5 = recode_factor(pp_educ_5,
        "No high school diploma or GED" = "E1",
        "High school graduate (high school diploma or the equivalent GED)"
                                     = "E2",
  "Some college or Associate degree" = "E3",
                          "Bachelor"= "E4",
                            "Master"= "E5",
                          .ordered = TRUE))
 

#pp_employment_status_3
 abc_poll <-mutate(abc_poll, pp_employment_status_3 = recode_factor(pp_employment_status_3,
                                     "Not working" = "ES1",
                                 "Working part-time"= "ES2",
                               "Working full-time" = "ES3",
                                     .ordered = TRUE))
 
 
 

 
 
 ######## I know that Q2 concern level Q4 have "ordinal" responses,
 ###Can I order a subset of the "Response" column?
 ### Should I order these variables before the pivot? (can a column be of mixed type?)
 ###Would it be better to have Nominal Questions Separated from Ordinal Questions?
 #Q2ConcernLevel_4
 ##I used the code below when I only pivoted the parts of Q1 under a Q1 column and
 # left the other questions as individual columns
 
 # abc_poll <-mutate(abc_poll, Q2ConcernLevel_4 = 
 #                     recode_factor(Q2ConcernLevel_4 ,
 #                                     "Not concerned at all" = "C0",
 #                                     "Not so concerned" = "C1",
 #                                     "Somewhat concerned" = "C2",
 #                                      "Very Concerned" = "C3",
 #                                     .ordered = TRUE))
 # 


#Q4_5
# abc_poll <-mutate(abc_poll, Q4_5 =
#                     recode_factor(Q4_5 ,
#                                     "Poor" = 0,
#                                     "Not so good" = 1,
#                                    "Good" = 2,
#                                   "Excellent" = 3,
#                                     "Skipped" = -1,
#                                     .ordered = TRUE))
#  

 abc_poll

Code

 ##Is the data frame arranged "alphabetically" or "ordinally?"
 abc_poll%>%
  arrange(desc(pp_educ_5))

We can see that all of our ordinal variables now are of type ord, our nominal variables are of type char, and our discrete variables are of type double.

Questions

What other tips are there for making smart names for variables based on their level of measurement?
What mutations should be done on the read in and what should be saved for post read in?

Here I would complete a key for all of the variables that are included in my table. Is there a better template for this?

pp_educ_5, ordinal: The reported educational attainment of the respondent.

value	Key
No high school diploma or GED	E1
high school diploma or the equivalent GED	E2
Some college or Associate degree	E3
Bachelor	E4
Masteror Higher	E5

pp_inc_7, ordinal: the reported annual income level of the respondent.

value	Key
Less than $10,000	I1
$10,000 to $24,999	I2
$25,000 to $49,999	I3
$50,000 to $74,999	I4
$75,000 to $99,999	I5
$100,000 to $149,999	I6
$150,000 or more	I7

Questions

Is this an ok template for a codebook?