Challenge 4 Solutions

challenge_4

abc_poll

Author

Caitlin Rowley

Published

October 18, 2022

Code

install.packages("summarytools")

Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror

Code

library(tidyverse)
library(lubridate)
library(readxl)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
identify variables that need to be mutated
mutate variables and sanity check all mutations

Code

# read in data:

ABC <- read_csv("_data//abc_poll_2021.csv")

Briefly describe the data

Code

# id column names
colnames(ABC)

 [1] "id"              "xspanish"        "complete_status" "ppage"          
 [5] "ppeduc5"         "ppeducat"        "ppgender"        "ppethm"         
 [9] "pphhsize"        "ppinc7"          "ppmarit5"        "ppmsacat"       
[13] "ppreg4"          "pprent"          "ppstaten"        "PPWORKA"        
[17] "ppemploy"        "Q1_a"            "Q1_b"            "Q1_c"           
[21] "Q1_d"            "Q1_e"            "Q1_f"            "Q2"             
[25] "Q3"              "Q4"              "Q5"              "QPID"           
[29] "ABCAGE"          "Contact"         "weights_pid"

Code

# break down question/variable types:

ABC %>%
select(starts_with("pp"))%>%
colnames(.)

 [1] "ppage"    "ppeduc5"  "ppeducat" "ppgender" "ppethm"   "pphhsize"
 [7] "ppinc7"   "ppmarit5" "ppmsacat" "ppreg4"   "pprent"   "ppstaten"
[13] "PPWORKA"  "ppemploy"

Code

# add 'ABCage' and 'xspanish' to demo questions


ABC %>%
select(starts_with("Q"))%>%
  colnames(.)

 [1] "Q1_a" "Q1_b" "Q1_c" "Q1_d" "Q1_e" "Q1_f" "Q2"   "Q3"   "Q4"   "Q5"  
[11] "QPID"

Code

# add 'contact' to survey questions

# find duplicates:
unique(ABC)

# A tibble: 527 × 31
        id xspanish comple…¹ ppage ppeduc5 ppedu…² ppgen…³ ppethm pphhs…⁴ ppinc7
     <dbl> <chr>    <chr>    <dbl> <chr>   <chr>   <chr>   <chr>  <chr>   <chr> 
 1 7230001 English  qualifi…    68 "High … High s… Female  White… 2       $25,0…
 2 7230002 English  qualifi…    85 "Bache… Bachel… Male    White… 2       $150,…
 3 7230003 English  qualifi…    69 "High … High s… Male    White… 2       $100,…
 4 7230004 English  qualifi…    74 "Bache… Bachel… Female  White… 1       $25,0…
 5 7230005 English  qualifi…    77 "High … High s… Male    White… 3       $10,0…
 6 7230006 English  qualifi…    70 "Bache… Bachel… Male    White… 2       $75,0…
 7 7230007 English  qualifi…    26 "Maste… Bachel… Male    Other… 3       $150,…
 8 7230008 English  qualifi…    76 "Bache… Bachel… Male    Black… 2       $50,0…
 9 7230009 English  qualifi…    78 "Bache… Bachel… Female  White… 2       $150,…
10 7230010 English  qualifi…    47 "Maste… Bachel… Male    Other… 4       $150,…
# … with 517 more rows, 21 more variables: ppmarit5 <chr>, ppmsacat <chr>,
#   ppreg4 <chr>, pprent <chr>, ppstaten <chr>, PPWORKA <chr>, ppemploy <chr>,
#   Q1_a <chr>, Q1_b <chr>, Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>,
#   Q2 <chr>, Q3 <chr>, Q4 <chr>, Q5 <chr>, QPID <chr>, ABCAGE <chr>,
#   Contact <chr>, weights_pid <dbl>, and abbreviated variable names
#   ¹complete_status, ²ppeducat, ³ppgender, ⁴pphhsize

Code

# find missing values:
missing <- is.na(ABC)
missing["TRUE"]

[1] NA

I am using data from the abc_poll_2021 data set. This data set contains information related to a national survey conducted in 2021. There are 31 variables and 527 rows of data, with each row representing a unique observation. Of these 31 variables, it appears that 16 capture participant-level information, 12 capture information related to survey questions, and 3 capture information related to survey administration.

Code

#continue cleaning:

# remove 'ABCAGE', 'ppemploy' 'ppeducat'
ABC_clean <- subset(ABC, select = -c(ABCAGE, ppemploy, ppeducat))
print(ABC_clean)

# A tibble: 527 × 28
        id xspanish comple…¹ ppage ppeduc5 ppgen…² ppethm pphhs…³ ppinc7 ppmar…⁴
     <dbl> <chr>    <chr>    <dbl> <chr>   <chr>   <chr>  <chr>   <chr>  <chr>  
 1 7230001 English  qualifi…    68 "High … Female  White… 2       $25,0… Now Ma…
 2 7230002 English  qualifi…    85 "Bache… Male    White… 2       $150,… Now Ma…
 3 7230003 English  qualifi…    69 "High … Male    White… 2       $100,… Now Ma…
 4 7230004 English  qualifi…    74 "Bache… Female  White… 1       $25,0… Divorc…
 5 7230005 English  qualifi…    77 "High … Male    White… 3       $10,0… Now Ma…
 6 7230006 English  qualifi…    70 "Bache… Male    White… 2       $75,0… Now Ma…
 7 7230007 English  qualifi…    26 "Maste… Male    Other… 3       $150,… Never …
 8 7230008 English  qualifi…    76 "Bache… Male    Black… 2       $50,0… Now Ma…
 9 7230009 English  qualifi…    78 "Bache… Female  White… 2       $150,… Now Ma…
10 7230010 English  qualifi…    47 "Maste… Male    Other… 4       $150,… Now Ma…
# … with 517 more rows, 18 more variables: ppmsacat <chr>, ppreg4 <chr>,
#   pprent <chr>, ppstaten <chr>, PPWORKA <chr>, Q1_a <chr>, Q1_b <chr>,
#   Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>, Q2 <chr>, Q3 <chr>,
#   Q4 <chr>, Q5 <chr>, QPID <chr>, Contact <chr>, weights_pid <dbl>, and
#   abbreviated variable names ¹complete_status, ²ppgender, ³pphhsize,
#   ⁴ppmarit5

Code

# print data frame summary:
print(summarytools::dfSummary(ABC_clean,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.75, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

ABC_clean

Dimensions: 527 x 28
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

id [numeric]

Mean (sd) : 7230264 (152.3)

min ≤ med ≤ max:

7230001 ≤ 7230264 ≤ 7230527

IQR (CV) : 263 (0)

527 distinct values

0 (0.0%)

xspanish [character]

1. English

2. Spanish

514	(	97.5%	)
13	(	2.5%	)

0 (0.0%)

complete_status [character]

1. qualified

527

(

100.0%

)

0 (0.0%)

ppage [numeric]

Mean (sd) : 53.4 (17.1)

min ≤ med ≤ max:

18 ≤ 55 ≤ 91

IQR (CV) : 27 (0.3)

72 distinct values

0 (0.0%)

ppeduc5 [character]

1. NA

2. NA

3. High school graduate (hig

4. No high school diploma or

5. Some college or Associate

99	(	18.8%	)
108	(	20.5%	)
133	(	25.2%	)
29	(	5.5%	)
158	(	30.0%	)

0 (0.0%)

ppgender [character]

1. Female

2. Male

254	(	48.2%	)
273	(	51.8%	)

0 (0.0%)

ppethm [character]

1. 2+ Races, Non-Hispanic

2. Black, Non-Hispanic

3. Hispanic

4. Other, Non-Hispanic

5. White, Non-Hispanic

21	(	4.0%	)
27	(	5.1%	)
51	(	9.7%	)
24	(	4.6%	)
404	(	76.7%	)

0 (0.0%)

pphhsize [character]

1. 1

2. 2

3. 3

4. 4

5. 5

6. 6 or more

80	(	15.2%	)
219	(	41.6%	)
102	(	19.4%	)
76	(	14.4%	)
35	(	6.6%	)
15	(	2.8%	)

0 (0.0%)

ppinc7 [character]

1. $10,000 to $24,999

2. $100,000 to $149,999

3. $150,000 or more

4. $25,000 to $49,999

5. $50,000 to $74,999

6. $75,000 to $99,999

7. Less than $10,000

32	(	6.1%	)
105	(	19.9%	)
137	(	26.0%	)
82	(	15.6%	)
85	(	16.1%	)
69	(	13.1%	)
17	(	3.2%	)

0 (0.0%)

ppmarit5 [character]

1. Divorced

2. Never married

3. Now Married

4. Separated

5. Widowed

43	(	8.2%	)
111	(	21.1%	)
337	(	63.9%	)
8	(	1.5%	)
28	(	5.3%	)

0 (0.0%)

ppmsacat [character]

1. Metro area

2. Non-metro area

448	(	85.0%	)
79	(	15.0%	)

0 (0.0%)

ppreg4 [character]

1. MidWest

2. NorthEast

3. South

4. West

118	(	22.4%	)
93	(	17.6%	)
190	(	36.1%	)
126	(	23.9%	)

0 (0.0%)

pprent [character]

1. Occupied without payment

2. Owned or being bought by

3. Rented for cash

10	(	1.9%	)
406	(	77.0%	)
111	(	21.1%	)

0 (0.0%)

ppstaten [character]

1. California

2. Texas

3. Florida

4. Pennsylvania

5. Illinois

6. New Jersey

7. Ohio

8. Michigan

9. New York

10. Washington

[ 39 others ]

51	(	9.7%	)
42	(	8.0%	)
34	(	6.5%	)
28	(	5.3%	)
23	(	4.4%	)
21	(	4.0%	)
21	(	4.0%	)
18	(	3.4%	)
18	(	3.4%	)
18	(	3.4%	)
253	(	48.0%	)

0 (0.0%)

PPWORKA [character]

1. Currently laid off

2. Employed full-time (by so

3. Employed part-time (by so

4. Full Time Student

5. Homemaker

6. On furlough

7. Other

8. Retired

9. Self-employed

13	(	2.5%	)
220	(	41.7%	)
31	(	5.9%	)
8	(	1.5%	)
37	(	7.0%	)
1	(	0.2%	)
20	(	3.8%	)
165	(	31.3%	)
32	(	6.1%	)

0 (0.0%)

Q1_a [character]

1. Approve

2. Disapprove

3. Skipped

329	(	62.4%	)
193	(	36.6%	)
5	(	0.9%	)

0 (0.0%)

Q1_b [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
322	(	61.1%	)
13	(	2.5%	)

0 (0.0%)

Q1_c [character]

1. Approve

2. Disapprove

3. Skipped

272	(	51.6%	)
248	(	47.1%	)
7	(	1.3%	)

0 (0.0%)

Q1_d [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
321	(	60.9%	)
14	(	2.7%	)

0 (0.0%)

Q1_e [character]

1. Approve

2. Disapprove

3. Skipped

212	(	40.2%	)
301	(	57.1%	)
14	(	2.7%	)

0 (0.0%)

Q1_f [character]

1. Approve

2. Disapprove

3. Skipped

281	(	53.3%	)
230	(	43.6%	)
16	(	3.0%	)

0 (0.0%)

Q2 [character]

1. Not concerned at all

2. Not so concerned

3. Somewhat concerned

4. Very concerned

65	(	12.3%	)
147	(	27.9%	)
221	(	41.9%	)
94	(	17.8%	)

0 (0.0%)

Q3 [character]

1. No

2. Skipped

3. Yes

107	(	20.3%	)
5	(	0.9%	)
415	(	78.7%	)

0 (0.0%)

Q4 [character]

1. Excellent

2. Good

3. Not so good

4. Poor

5. Skipped

60	(	11.4%	)
215	(	40.8%	)
97	(	18.4%	)
149	(	28.3%	)
6	(	1.1%	)

0 (0.0%)

Q5 [character]

1. Optimistic

2. Pessimistic

3. Skipped

229	(	43.5%	)
295	(	56.0%	)
3	(	0.6%	)

0 (0.0%)

QPID [character]

1. A Democrat

2. A Republican

3. An Independent

4. Skipped

5. Something else

176	(	33.4%	)
152	(	28.8%	)
168	(	31.9%	)
3	(	0.6%	)
28	(	5.3%	)

0 (0.0%)

Contact [character]

1. No, I am not willing to b

2. Yes, I am willing to be i

355	(	67.4%	)
172	(	32.6%	)

0 (0.0%)

weights_pid [numeric]

Mean (sd) : 1 (0.6)

min ≤ med ≤ max:

0.3 ≤ 0.8 ≤ 6.3

IQR (CV) : 0.5 (0.6)

453 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

There were several columns whose values were captured more precisely in re-coded variables. For example, the variable ‘ABCAGE’ captured respondents’ age ranges, while ‘ppage’ captured respondents’ exact ages. So, I removed columns (‘ABCAGE’, ‘ppemploy’, and ‘ppeducat’) that had been re-coded to more accurately capture respondent demographic data. My clean data set now has 28 columns (variables) and 527 rows of data.

We can see from the summary data frame that there are some variables whose text can be simplified. We can also see that the content questions offer a ‘skip’ option, so we should see if we can mutate the values in those columns so that ‘skipped’ questions equate to missing values.

Identify variables that need to be mutated

Remove articles from variable ‘QPID’ (respondent party ID):

Code

# remove articles from QPID (party ID):

ABC_mutate <- ABC_clean%>%
  mutate(partyid = str_remove(QPID, "A[n]*"))%>%
  select(-QPID)

# check to ensure articles have been removed:
table(ABC_mutate$partyid)


      Democrat    Independent     Republican        Skipped Something else 
           176            168            152              3             28

Replace ‘skipped’ values with missing values:

Code

# mutate so 'skipped' = missing value

ABC_mutate <- ABC_clean%>%
  mutate(across(starts_with("Q"), ~ na_if(.x, "Skipped")))

# apply mutation across variables starting with 'Q' and check to ensure 'skipped' has been replaced with a missing value:
map(select(ABC_mutate, starts_with("Q")), table)

$Q1_a

   Approve Disapprove 
       329        193 

$Q1_b

   Approve Disapprove 
       192        322 

$Q1_c

   Approve Disapprove 
       272        248 

$Q1_d

   Approve Disapprove 
       192        321 

$Q1_e

   Approve Disapprove 
       212        301 

$Q1_f

   Approve Disapprove 
       281        230 

$Q2

Not concerned at all     Not so concerned   Somewhat concerned 
                  65                  147                  221 
      Very concerned 
                  94 

$Q3

 No Yes 
107 415 

$Q4

  Excellent        Good Not so good        Poor 
         60         215          97         149 

$Q5

 Optimistic Pessimistic 
        229         295 

$QPID

    A Democrat   A Republican An Independent Something else 
           176            152            168             28

Clean up “ppethm” column values:

Code

# find unique values in "ppethm" column:

unique(ABC_mutate["ppethm"])

# A tibble: 5 × 1
  ppethm                
  <chr>                 
1 White, Non-Hispanic   
2 Other, Non-Hispanic   
3 Black, Non-Hispanic   
4 Hispanic              
5 2+ Races, Non-Hispanic

Code

# mutate so that values are more succinct:

ABC_mutate <- ABC_mutate%>%
  mutate(ethnicity = str_remove(ppethm, ", Non-Hispanic"))%>%
  select(-ppethm)

# check new values:

select(ABC_mutate, "ethnicity")

# A tibble: 527 × 1
   ethnicity
   <chr>    
 1 White    
 2 White    
 3 White    
 4 White    
 5 White    
 6 White    
 7 Other    
 8 Black    
 9 White    
10 Other    
# … with 517 more rows

Rename remaining variables:

Code

ABC_mutate_clean <- rename(ABC_mutate,
                           "resp_id" = "id",
                           "resp_language" = "xspanish", 
                           "complete_status" = "complete_status", 
                           "resp_age" = "ppage", 
                           "resp_education" = "ppeduc5", 
                           "resp_gender" = "ppgender", 
                           "resp_household_size" = "pphhsize", 
                           "resp_income" = "ppinc7", 
                           "resp_marital_status" = "ppmarit5", 
                           "resp_metro" = "ppmsacat", 
                           "resp_region" = "ppreg4", 
                           "resp_rent/own" = "pprent", 
                           "resp_state" = "ppstaten", 
                           "resp_employment" = "PPWORKA", 
                           "resp_party_ID" = "QPID", 
                           "resp_interview" = "Contact", 
                           "resp_party_ID_weight" = "weights_pid",)

colnames(ABC_mutate_clean)

 [1] "resp_id"              "resp_language"        "complete_status"     
 [4] "resp_age"             "resp_education"       "resp_gender"         
 [7] "resp_household_size"  "resp_income"          "resp_marital_status" 
[10] "resp_metro"           "resp_region"          "resp_rent/own"       
[13] "resp_state"           "resp_employment"      "Q1_a"                
[16] "Q1_b"                 "Q1_c"                 "Q1_d"                
[19] "Q1_e"                 "Q1_f"                 "Q2"                  
[22] "Q3"                   "Q4"                   "Q5"                  
[25] "resp_party_ID"        "resp_interview"       "resp_party_ID_weight"
[28] "ethnicity"

Try factor order:

Code

# identify values in "resp_education" column:

table(ABC_mutate_clean$resp_education)


                                     Master\x92s degree or above 
                                                              99 
                                            Bachelor\x92s degree 
                                                             108 
High school graduate (high school diploma or the equivalent GED) 
                                                             133 
                                   No high school diploma or GED 
                                                              29 
                                Some college or Associate degree 
                                                             158

Code

# identify unique values:

resp_edu_order <- unique(ABC_mutate_clean$resp_education)
resp_edu_order

[1] "High school graduate (high school diploma or the equivalent GED)"
[2] "Bachelor\x92s degree"                                            
[3] "Master\x92s degree or above"                                     
[4] "Some college or Associate degree"                                
[5] "No high school diploma or GED"

Code

# mutate and factor so that respondent education becomes ordinal (no high school, high school, some college, etc.):

ABC_mutate_clean <- ABC_mutate_clean%>%
  mutate(resp_edu = factor(resp_education, 
                       levels=resp_edu_order[c(5,1,4,2,3)]))%>%
  select(-resp_education)
rm(resp_edu_order)

table(ABC_mutate_clean$resp_edu)


                                   No high school diploma or GED 
                                                              29 
High school graduate (high school diploma or the equivalent GED) 
                                                             133 
                                Some college or Associate degree 
                                                             158 
                                            Bachelor\x92s degree 
                                                             108 
                                     Master\x92s degree or above 
                                                              99