challenge_4
abc_poll
More data wrangling: pivoting
Author

Hunter Major

Published

June 14, 2023

Code
library(tidyverse)
library(dplyr)
library(readr)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • For challenge 4, I will be reading in abc_poll.csv ⭐
Code
abc_poll <- read.csv("_data/abc_poll_2021.csv")
abc_poll
Code
dim(abc_poll)
[1] 527  31
Code
summary(abc_poll)
       id            xspanish         complete_status        ppage      
 Min.   :7230001   Length:527         Length:527         Min.   :18.00  
 1st Qu.:7230132   Class :character   Class :character   1st Qu.:40.00  
 Median :7230264   Mode  :character   Mode  :character   Median :55.00  
 Mean   :7230264                                         Mean   :53.39  
 3rd Qu.:7230396                                         3rd Qu.:67.00  
 Max.   :7230527                                         Max.   :91.00  
   ppeduc5            ppeducat           ppgender            ppethm         
 Length:527         Length:527         Length:527         Length:527        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   pphhsize            ppinc7            ppmarit5           ppmsacat        
 Length:527         Length:527         Length:527         Length:527        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    ppreg4             pprent            ppstaten           PPWORKA         
 Length:527         Length:527         Length:527         Length:527        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   ppemploy             Q1_a               Q1_b               Q1_c          
 Length:527         Length:527         Length:527         Length:527        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
     Q1_d               Q1_e               Q1_f                Q2           
 Length:527         Length:527         Length:527         Length:527        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
      Q3                 Q4                 Q5                QPID          
 Length:527         Length:527         Length:527         Length:527        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    ABCAGE            Contact           weights_pid    
 Length:527         Length:527         Min.   :0.3240  
 Class :character   Class :character   1st Qu.:0.6332  
 Mode  :character   Mode  :character   Median :0.8451  
                                       Mean   :1.0000  
                                       3rd Qu.:1.1516  
                                       Max.   :6.2553  
Code
n_distinct(abc_poll$ppstaten)
[1] 49

Briefly describe the data

The ABC Poll dataset is most likely capturing responses from a US-based nationwide poll. Running the n_distinct () function for the state or ppstaten variable determines that responses from 49 US states were featured The dim () function reveals that there are 527 rows or observations and 31 columns or variables. We can infer from the row count that 527 people responded to the poll. From the summary () function, we can deduce that the majority of the variables seem to have character values, such as education level, gender, marital status, age, region, employment status, etc–aimed at collecting demographic data and political data. There are other variables, such as the ID number variable, the Q1-Q5 variables, the Contact variable that serve to help better organize the information and respondent preferences for further interviews collected by the poll.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

The dataset is already tidy in the sense that every value does have its own cell, each variable has its own column, and each observation has its own row.

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

In terms of variables that require mutation, in this dataset, I mainly believe this takes shape in the form of reducing and cleaning some of the string variables to more sensible categories.

Mutating the QPID variable

In the QPID or political party identification variable, we can take the “A or An” out of cells so a response within this column, like “An Independent” becomes “Independent.”

Running the table () function shows that in addition to included political party responses of “A Democrat,” “A Republican,” and “An Independent,” there are also two types of responses in the QPID column: “Skipped” or “Something Else.” The “Something Else” response can remain as is but we can re-code the “Skipped” value to read as NA instead of “Skipped.”

Code
# view the QPID column within the dataset
table(abc_poll$QPID)

    A Democrat   A Republican An Independent        Skipped Something else 
           176            152            168              3             28 
Code
# mutate so that "A" or "An" can be removed from party identification responses and so that "Skipped" can read as NA
abc_poll2 <- abc_poll%>%
  mutate(party_id = str_remove(QPID, "A[n]*"),
         party_id = case_when(
           str_detect(QPID, "Skipped")~NA_character_,
           TRUE~party_id
         )) %>%
  select(-QPID)

abc_poll2
Code
# check
table(abc_poll2$party_id)

      Democrat    Independent     Republican Something else 
           176            168            152             28 
Code
unique(abc_poll2$party_id)
[1] " Democrat"      " Independent"   "Something else" " Republican"   
[5] NA              

Seems QPID variable has been renamed party_id. As expected, it also seems like “Skipped” has been changed to NA and “Democrat,” “Republican” and “Independent” responses no longer have the word “A” or “An” before them!

Changing “Skipped” to NA from in responses within the Q1-Q5 columns/variables

When poll participants didn’t offer a response when answering the Q1-Q5 questions in the poll, a response of “Skipped” was recorded. Like we did in the previous section, we can transform this into reading as NA, for analysis purposes. Instead of mutating one ‘Q’ column at a time, we can do them simultaneously by using the across() function within mutate ().

Code
# mutating "Skipped" to NA in the Q1-Q5 columns/variables
abc_poll2<-abc_poll2%>%
  mutate(across(starts_with("Q"), ~ na_if(.x, "Skipped")))
abc_poll2
Code
#checking one of the 'Q' columns, Q1_b, to see if the NA value has taken the place of "Skipped"
table(abc_poll2$Q1_b)

   Approve Disapprove 
       192        322 
Code
unique(abc_poll2$Q1_b)
[1] "Approve"    "Disapprove" NA          

Looks like NA has taken the place of “Skipped” in Q1_b (and assumingly all the Q columns) as expected!