DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 4 Solutions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
    • Briefly describe the data
  • Identify variables that need to be mutated

Challenge 4 Solutions

  • Show All Code
  • Hide All Code

  • View Source
challenge_4
abc_poll
Author

Caitlin Rowley

Published

October 18, 2022

Code
install.packages("summarytools")
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
Code
library(tidyverse)
library(lubridate)
library(readxl)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations
Code
# read in data:

ABC <- read_csv("_data//abc_poll_2021.csv")

Briefly describe the data

Code
# id column names
colnames(ABC)
 [1] "id"              "xspanish"        "complete_status" "ppage"          
 [5] "ppeduc5"         "ppeducat"        "ppgender"        "ppethm"         
 [9] "pphhsize"        "ppinc7"          "ppmarit5"        "ppmsacat"       
[13] "ppreg4"          "pprent"          "ppstaten"        "PPWORKA"        
[17] "ppemploy"        "Q1_a"            "Q1_b"            "Q1_c"           
[21] "Q1_d"            "Q1_e"            "Q1_f"            "Q2"             
[25] "Q3"              "Q4"              "Q5"              "QPID"           
[29] "ABCAGE"          "Contact"         "weights_pid"    
Code
# break down question/variable types:

ABC %>%
select(starts_with("pp"))%>%
colnames(.)
 [1] "ppage"    "ppeduc5"  "ppeducat" "ppgender" "ppethm"   "pphhsize"
 [7] "ppinc7"   "ppmarit5" "ppmsacat" "ppreg4"   "pprent"   "ppstaten"
[13] "PPWORKA"  "ppemploy"
Code
# add 'ABCage' and 'xspanish' to demo questions


ABC %>%
select(starts_with("Q"))%>%
  colnames(.)
 [1] "Q1_a" "Q1_b" "Q1_c" "Q1_d" "Q1_e" "Q1_f" "Q2"   "Q3"   "Q4"   "Q5"  
[11] "QPID"
Code
# add 'contact' to survey questions

# find duplicates:
unique(ABC)
# A tibble: 527 × 31
        id xspanish comple…¹ ppage ppeduc5 ppedu…² ppgen…³ ppethm pphhs…⁴ ppinc7
     <dbl> <chr>    <chr>    <dbl> <chr>   <chr>   <chr>   <chr>  <chr>   <chr> 
 1 7230001 English  qualifi…    68 "High … High s… Female  White… 2       $25,0…
 2 7230002 English  qualifi…    85 "Bache… Bachel… Male    White… 2       $150,…
 3 7230003 English  qualifi…    69 "High … High s… Male    White… 2       $100,…
 4 7230004 English  qualifi…    74 "Bache… Bachel… Female  White… 1       $25,0…
 5 7230005 English  qualifi…    77 "High … High s… Male    White… 3       $10,0…
 6 7230006 English  qualifi…    70 "Bache… Bachel… Male    White… 2       $75,0…
 7 7230007 English  qualifi…    26 "Maste… Bachel… Male    Other… 3       $150,…
 8 7230008 English  qualifi…    76 "Bache… Bachel… Male    Black… 2       $50,0…
 9 7230009 English  qualifi…    78 "Bache… Bachel… Female  White… 2       $150,…
10 7230010 English  qualifi…    47 "Maste… Bachel… Male    Other… 4       $150,…
# … with 517 more rows, 21 more variables: ppmarit5 <chr>, ppmsacat <chr>,
#   ppreg4 <chr>, pprent <chr>, ppstaten <chr>, PPWORKA <chr>, ppemploy <chr>,
#   Q1_a <chr>, Q1_b <chr>, Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>,
#   Q2 <chr>, Q3 <chr>, Q4 <chr>, Q5 <chr>, QPID <chr>, ABCAGE <chr>,
#   Contact <chr>, weights_pid <dbl>, and abbreviated variable names
#   ¹​complete_status, ²​ppeducat, ³​ppgender, ⁴​pphhsize
Code
# find missing values:
missing <- is.na(ABC)
missing["TRUE"]
[1] NA

I am using data from the abc_poll_2021 data set. This data set contains information related to a national survey conducted in 2021. There are 31 variables and 527 rows of data, with each row representing a unique observation. Of these 31 variables, it appears that 16 capture participant-level information, 12 capture information related to survey questions, and 3 capture information related to survey administration.

Code
#continue cleaning:

# remove 'ABCAGE', 'ppemploy' 'ppeducat'
ABC_clean <- subset(ABC, select = -c(ABCAGE, ppemploy, ppeducat))
print(ABC_clean)
# A tibble: 527 × 28
        id xspanish comple…¹ ppage ppeduc5 ppgen…² ppethm pphhs…³ ppinc7 ppmar…⁴
     <dbl> <chr>    <chr>    <dbl> <chr>   <chr>   <chr>  <chr>   <chr>  <chr>  
 1 7230001 English  qualifi…    68 "High … Female  White… 2       $25,0… Now Ma…
 2 7230002 English  qualifi…    85 "Bache… Male    White… 2       $150,… Now Ma…
 3 7230003 English  qualifi…    69 "High … Male    White… 2       $100,… Now Ma…
 4 7230004 English  qualifi…    74 "Bache… Female  White… 1       $25,0… Divorc…
 5 7230005 English  qualifi…    77 "High … Male    White… 3       $10,0… Now Ma…
 6 7230006 English  qualifi…    70 "Bache… Male    White… 2       $75,0… Now Ma…
 7 7230007 English  qualifi…    26 "Maste… Male    Other… 3       $150,… Never …
 8 7230008 English  qualifi…    76 "Bache… Male    Black… 2       $50,0… Now Ma…
 9 7230009 English  qualifi…    78 "Bache… Female  White… 2       $150,… Now Ma…
10 7230010 English  qualifi…    47 "Maste… Male    Other… 4       $150,… Now Ma…
# … with 517 more rows, 18 more variables: ppmsacat <chr>, ppreg4 <chr>,
#   pprent <chr>, ppstaten <chr>, PPWORKA <chr>, Q1_a <chr>, Q1_b <chr>,
#   Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>, Q2 <chr>, Q3 <chr>,
#   Q4 <chr>, Q5 <chr>, QPID <chr>, Contact <chr>, weights_pid <dbl>, and
#   abbreviated variable names ¹​complete_status, ²​ppgender, ³​pphhsize,
#   ⁴​ppmarit5
Code
# print data frame summary:
print(summarytools::dfSummary(ABC_clean,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.75, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

ABC_clean

Dimensions: 527 x 28
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
id [numeric]
Mean (sd) : 7230264 (152.3)
min ≤ med ≤ max:
7230001 ≤ 7230264 ≤ 7230527
IQR (CV) : 263 (0)
527 distinct values 0 (0.0%)
xspanish [character]
1. English
2. Spanish
514(97.5%)
13(2.5%)
0 (0.0%)
complete_status [character] 1. qualified
527(100.0%)
0 (0.0%)
ppage [numeric]
Mean (sd) : 53.4 (17.1)
min ≤ med ≤ max:
18 ≤ 55 ≤ 91
IQR (CV) : 27 (0.3)
72 distinct values 0 (0.0%)
ppeduc5 [character]
1. NA
2. NA
3. High school graduate (hig
4. No high school diploma or
5. Some college or Associate
99(18.8%)
108(20.5%)
133(25.2%)
29(5.5%)
158(30.0%)
0 (0.0%)
ppgender [character]
1. Female
2. Male
254(48.2%)
273(51.8%)
0 (0.0%)
ppethm [character]
1. 2+ Races, Non-Hispanic
2. Black, Non-Hispanic
3. Hispanic
4. Other, Non-Hispanic
5. White, Non-Hispanic
21(4.0%)
27(5.1%)
51(9.7%)
24(4.6%)
404(76.7%)
0 (0.0%)
pphhsize [character]
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6 or more
80(15.2%)
219(41.6%)
102(19.4%)
76(14.4%)
35(6.6%)
15(2.8%)
0 (0.0%)
ppinc7 [character]
1. $10,000 to $24,999
2. $100,000 to $149,999
3. $150,000 or more
4. $25,000 to $49,999
5. $50,000 to $74,999
6. $75,000 to $99,999
7. Less than $10,000
32(6.1%)
105(19.9%)
137(26.0%)
82(15.6%)
85(16.1%)
69(13.1%)
17(3.2%)
0 (0.0%)
ppmarit5 [character]
1. Divorced
2. Never married
3. Now Married
4. Separated
5. Widowed
43(8.2%)
111(21.1%)
337(63.9%)
8(1.5%)
28(5.3%)
0 (0.0%)
ppmsacat [character]
1. Metro area
2. Non-metro area
448(85.0%)
79(15.0%)
0 (0.0%)
ppreg4 [character]
1. MidWest
2. NorthEast
3. South
4. West
118(22.4%)
93(17.6%)
190(36.1%)
126(23.9%)
0 (0.0%)
pprent [character]
1. Occupied without payment
2. Owned or being bought by
3. Rented for cash
10(1.9%)
406(77.0%)
111(21.1%)
0 (0.0%)
ppstaten [character]
1. California
2. Texas
3. Florida
4. Pennsylvania
5. Illinois
6. New Jersey
7. Ohio
8. Michigan
9. New York
10. Washington
[ 39 others ]
51(9.7%)
42(8.0%)
34(6.5%)
28(5.3%)
23(4.4%)
21(4.0%)
21(4.0%)
18(3.4%)
18(3.4%)
18(3.4%)
253(48.0%)
0 (0.0%)
PPWORKA [character]
1. Currently laid off
2. Employed full-time (by so
3. Employed part-time (by so
4. Full Time Student
5. Homemaker
6. On furlough
7. Other
8. Retired
9. Self-employed
13(2.5%)
220(41.7%)
31(5.9%)
8(1.5%)
37(7.0%)
1(0.2%)
20(3.8%)
165(31.3%)
32(6.1%)
0 (0.0%)
Q1_a [character]
1. Approve
2. Disapprove
3. Skipped
329(62.4%)
193(36.6%)
5(0.9%)
0 (0.0%)
Q1_b [character]
1. Approve
2. Disapprove
3. Skipped
192(36.4%)
322(61.1%)
13(2.5%)
0 (0.0%)
Q1_c [character]
1. Approve
2. Disapprove
3. Skipped
272(51.6%)
248(47.1%)
7(1.3%)
0 (0.0%)
Q1_d [character]
1. Approve
2. Disapprove
3. Skipped
192(36.4%)
321(60.9%)
14(2.7%)
0 (0.0%)
Q1_e [character]
1. Approve
2. Disapprove
3. Skipped
212(40.2%)
301(57.1%)
14(2.7%)
0 (0.0%)
Q1_f [character]
1. Approve
2. Disapprove
3. Skipped
281(53.3%)
230(43.6%)
16(3.0%)
0 (0.0%)
Q2 [character]
1. Not concerned at all
2. Not so concerned
3. Somewhat concerned
4. Very concerned
65(12.3%)
147(27.9%)
221(41.9%)
94(17.8%)
0 (0.0%)
Q3 [character]
1. No
2. Skipped
3. Yes
107(20.3%)
5(0.9%)
415(78.7%)
0 (0.0%)
Q4 [character]
1. Excellent
2. Good
3. Not so good
4. Poor
5. Skipped
60(11.4%)
215(40.8%)
97(18.4%)
149(28.3%)
6(1.1%)
0 (0.0%)
Q5 [character]
1. Optimistic
2. Pessimistic
3. Skipped
229(43.5%)
295(56.0%)
3(0.6%)
0 (0.0%)
QPID [character]
1. A Democrat
2. A Republican
3. An Independent
4. Skipped
5. Something else
176(33.4%)
152(28.8%)
168(31.9%)
3(0.6%)
28(5.3%)
0 (0.0%)
Contact [character]
1. No, I am not willing to b
2. Yes, I am willing to be i
355(67.4%)
172(32.6%)
0 (0.0%)
weights_pid [numeric]
Mean (sd) : 1 (0.6)
min ≤ med ≤ max:
0.3 ≤ 0.8 ≤ 6.3
IQR (CV) : 0.5 (0.6)
453 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

There were several columns whose values were captured more precisely in re-coded variables. For example, the variable ‘ABCAGE’ captured respondents’ age ranges, while ‘ppage’ captured respondents’ exact ages. So, I removed columns (‘ABCAGE’, ‘ppemploy’, and ‘ppeducat’) that had been re-coded to more accurately capture respondent demographic data. My clean data set now has 28 columns (variables) and 527 rows of data.

We can see from the summary data frame that there are some variables whose text can be simplified. We can also see that the content questions offer a ‘skip’ option, so we should see if we can mutate the values in those columns so that ‘skipped’ questions equate to missing values.

Identify variables that need to be mutated

Remove articles from variable ‘QPID’ (respondent party ID):

Code
# remove articles from QPID (party ID):

ABC_mutate <- ABC_clean%>%
  mutate(partyid = str_remove(QPID, "A[n]*"))%>%
  select(-QPID)

# check to ensure articles have been removed:
table(ABC_mutate$partyid)

      Democrat    Independent     Republican        Skipped Something else 
           176            168            152              3             28 

Replace ‘skipped’ values with missing values:

Code
# mutate so 'skipped' = missing value

ABC_mutate <- ABC_clean%>%
  mutate(across(starts_with("Q"), ~ na_if(.x, "Skipped")))

# apply mutation across variables starting with 'Q' and check to ensure 'skipped' has been replaced with a missing value:
map(select(ABC_mutate, starts_with("Q")), table)
$Q1_a

   Approve Disapprove 
       329        193 

$Q1_b

   Approve Disapprove 
       192        322 

$Q1_c

   Approve Disapprove 
       272        248 

$Q1_d

   Approve Disapprove 
       192        321 

$Q1_e

   Approve Disapprove 
       212        301 

$Q1_f

   Approve Disapprove 
       281        230 

$Q2

Not concerned at all     Not so concerned   Somewhat concerned 
                  65                  147                  221 
      Very concerned 
                  94 

$Q3

 No Yes 
107 415 

$Q4

  Excellent        Good Not so good        Poor 
         60         215          97         149 

$Q5

 Optimistic Pessimistic 
        229         295 

$QPID

    A Democrat   A Republican An Independent Something else 
           176            152            168             28 

Clean up “ppethm” column values:

Code
# find unique values in "ppethm" column:

unique(ABC_mutate["ppethm"])
# A tibble: 5 × 1
  ppethm                
  <chr>                 
1 White, Non-Hispanic   
2 Other, Non-Hispanic   
3 Black, Non-Hispanic   
4 Hispanic              
5 2+ Races, Non-Hispanic
Code
# mutate so that values are more succinct:

ABC_mutate <- ABC_mutate%>%
  mutate(ethnicity = str_remove(ppethm, ", Non-Hispanic"))%>%
  select(-ppethm)

# check new values:

select(ABC_mutate, "ethnicity")
# A tibble: 527 × 1
   ethnicity
   <chr>    
 1 White    
 2 White    
 3 White    
 4 White    
 5 White    
 6 White    
 7 Other    
 8 Black    
 9 White    
10 Other    
# … with 517 more rows

Rename remaining variables:

Code
ABC_mutate_clean <- rename(ABC_mutate,
                           "resp_id" = "id",
                           "resp_language" = "xspanish", 
                           "complete_status" = "complete_status", 
                           "resp_age" = "ppage", 
                           "resp_education" = "ppeduc5", 
                           "resp_gender" = "ppgender", 
                           "resp_household_size" = "pphhsize", 
                           "resp_income" = "ppinc7", 
                           "resp_marital_status" = "ppmarit5", 
                           "resp_metro" = "ppmsacat", 
                           "resp_region" = "ppreg4", 
                           "resp_rent/own" = "pprent", 
                           "resp_state" = "ppstaten", 
                           "resp_employment" = "PPWORKA", 
                           "resp_party_ID" = "QPID", 
                           "resp_interview" = "Contact", 
                           "resp_party_ID_weight" = "weights_pid",)

colnames(ABC_mutate_clean)
 [1] "resp_id"              "resp_language"        "complete_status"     
 [4] "resp_age"             "resp_education"       "resp_gender"         
 [7] "resp_household_size"  "resp_income"          "resp_marital_status" 
[10] "resp_metro"           "resp_region"          "resp_rent/own"       
[13] "resp_state"           "resp_employment"      "Q1_a"                
[16] "Q1_b"                 "Q1_c"                 "Q1_d"                
[19] "Q1_e"                 "Q1_f"                 "Q2"                  
[22] "Q3"                   "Q4"                   "Q5"                  
[25] "resp_party_ID"        "resp_interview"       "resp_party_ID_weight"
[28] "ethnicity"           

Try factor order:

Code
# identify values in "resp_education" column:

table(ABC_mutate_clean$resp_education)

                                     Master\x92s degree or above 
                                                              99 
                                            Bachelor\x92s degree 
                                                             108 
High school graduate (high school diploma or the equivalent GED) 
                                                             133 
                                   No high school diploma or GED 
                                                              29 
                                Some college or Associate degree 
                                                             158 
Code
# identify unique values:

resp_edu_order <- unique(ABC_mutate_clean$resp_education)
resp_edu_order
[1] "High school graduate (high school diploma or the equivalent GED)"
[2] "Bachelor\x92s degree"                                            
[3] "Master\x92s degree or above"                                     
[4] "Some college or Associate degree"                                
[5] "No high school diploma or GED"                                   
Code
# mutate and factor so that respondent education becomes ordinal (no high school, high school, some college, etc.):

ABC_mutate_clean <- ABC_mutate_clean%>%
  mutate(resp_edu = factor(resp_education, 
                       levels=resp_edu_order[c(5,1,4,2,3)]))%>%
  select(-resp_education)
rm(resp_edu_order)

table(ABC_mutate_clean$resp_edu)

                                   No high school diploma or GED 
                                                              29 
High school graduate (high school diploma or the equivalent GED) 
                                                             133 
                                Some college or Associate degree 
                                                             158 
                                            Bachelor\x92s degree 
                                                             108 
                                     Master\x92s degree or above 
                                                              99 
Source Code
---
title: "Challenge 4 Solutions"
author: "Caitlin Rowley"
desription: "More data wrangling: pivoting"
date: "10/18/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_4
  - abc_poll
---

```{r}
#| label: setup
#| warning: false
#| message: false

install.packages("summarytools")

library(tidyverse)
library(lubridate)
library(readxl)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1)  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2)  tidy data (as needed, including sanity checks)
3)  identify variables that need to be mutated
4)  mutate variables and sanity check all mutations

```{r}
# read in data:

ABC <- read_csv("_data//abc_poll_2021.csv")
```

### Briefly describe the data

```{r}
# id column names
colnames(ABC)

# break down question/variable types:

ABC %>%
select(starts_with("pp"))%>%
colnames(.)
# add 'ABCage' and 'xspanish' to demo questions


ABC %>%
select(starts_with("Q"))%>%
  colnames(.)
# add 'contact' to survey questions

# find duplicates:
unique(ABC)

# find missing values:
missing <- is.na(ABC)
missing["TRUE"]

```

I am using data from the abc_poll_2021 data set. This data set contains information related to a national survey conducted in 2021. There are 31 variables and 527 rows of data, with each row representing a unique observation. Of these 31 variables, it appears that 16 capture participant-level information, 12 capture information related to survey questions, and 3 capture information related to survey administration.

```{r}
#continue cleaning:

# remove 'ABCAGE', 'ppemploy' 'ppeducat'
ABC_clean <- subset(ABC, select = -c(ABCAGE, ppemploy, ppeducat))
print(ABC_clean)

# print data frame summary:
print(summarytools::dfSummary(ABC_clean,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.75, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

```

There were several columns whose values were captured more precisely in re-coded variables. For example, the variable 'ABCAGE' captured respondents' age ranges, while 'ppage' captured respondents' exact ages. So, I removed columns ('ABCAGE', 'ppemploy', and 'ppeducat') that had been re-coded to more accurately capture respondent demographic data. My clean data set now has 28 columns (variables) and 527 rows of data.

We can see from the summary data frame that there are some variables whose text can be simplified. We can also see that the content questions offer a 'skip' option, so we should see if we can mutate the values in those columns so that 'skipped' questions equate to missing values.

## Identify variables that need to be mutated

Remove articles from variable 'QPID' (respondent party ID):

```{r}
# remove articles from QPID (party ID):

ABC_mutate <- ABC_clean%>%
  mutate(partyid = str_remove(QPID, "A[n]*"))%>%
  select(-QPID)

# check to ensure articles have been removed:
table(ABC_mutate$partyid)

```

Replace 'skipped' values with missing values:

```{r}
# mutate so 'skipped' = missing value

ABC_mutate <- ABC_clean%>%
  mutate(across(starts_with("Q"), ~ na_if(.x, "Skipped")))

# apply mutation across variables starting with 'Q' and check to ensure 'skipped' has been replaced with a missing value:
map(select(ABC_mutate, starts_with("Q")), table)


```

Clean up "ppethm" column values:

```{r}
# find unique values in "ppethm" column:

unique(ABC_mutate["ppethm"])

# mutate so that values are more succinct:

ABC_mutate <- ABC_mutate%>%
  mutate(ethnicity = str_remove(ppethm, ", Non-Hispanic"))%>%
  select(-ppethm)

# check new values:

select(ABC_mutate, "ethnicity")
```

Rename remaining variables:

```{r}
ABC_mutate_clean <- rename(ABC_mutate,
                           "resp_id" = "id",
                           "resp_language" = "xspanish", 
                           "complete_status" = "complete_status", 
                           "resp_age" = "ppage", 
                           "resp_education" = "ppeduc5", 
                           "resp_gender" = "ppgender", 
                           "resp_household_size" = "pphhsize", 
                           "resp_income" = "ppinc7", 
                           "resp_marital_status" = "ppmarit5", 
                           "resp_metro" = "ppmsacat", 
                           "resp_region" = "ppreg4", 
                           "resp_rent/own" = "pprent", 
                           "resp_state" = "ppstaten", 
                           "resp_employment" = "PPWORKA", 
                           "resp_party_ID" = "QPID", 
                           "resp_interview" = "Contact", 
                           "resp_party_ID_weight" = "weights_pid",)

colnames(ABC_mutate_clean)
```

Try factor order:

```{r}
# identify values in "resp_education" column:

table(ABC_mutate_clean$resp_education)

# identify unique values:

resp_edu_order <- unique(ABC_mutate_clean$resp_education)
resp_edu_order

# mutate and factor so that respondent education becomes ordinal (no high school, high school, some college, etc.):

ABC_mutate_clean <- ABC_mutate_clean%>%
  mutate(resp_edu = factor(resp_education, 
                       levels=resp_edu_order[c(5,1,4,2,3)]))%>%
  select(-resp_education)
rm(resp_edu_order)

table(ABC_mutate_clean$resp_edu)

```