challenge_4
abc_poll
More data wrangling: pivoting
Author

Matthew Weiner

Published

March 29, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐

Introduction

For this challenge I chose to investigate the abc_poll dataset

Understanding the Data

The first thing I needed to do was to get a better understanding of what this dataset looked like and try to figure out what it was about. I was able to do this in the following code block which examines the shape of the datset as well as the names of the variables.

Code
library(readr)
abc <- read_csv("_data/abc_poll_2021.csv")

# num of rows of dataset
nrow(abc)
[1] 527
Code
#num of cols of dataset
ncol(abc)
[1] 31
Code
#name of all the columns
colnames(abc)
 [1] "id"              "xspanish"        "complete_status" "ppage"          
 [5] "ppeduc5"         "ppeducat"        "ppgender"        "ppethm"         
 [9] "pphhsize"        "ppinc7"          "ppmarit5"        "ppmsacat"       
[13] "ppreg4"          "pprent"          "ppstaten"        "PPWORKA"        
[17] "ppemploy"        "Q1_a"            "Q1_b"            "Q1_c"           
[21] "Q1_d"            "Q1_e"            "Q1_f"            "Q2"             
[25] "Q3"              "Q4"              "Q5"              "QPID"           
[29] "ABCAGE"          "Contact"         "weights_pid"    
Code
num_states <- n_distinct(abc$ppstaten)
print(num_states)
[1] 49

These initial investigations have led me to believe that this dataset represents the data collected about participants in some poll, likely political, as part of the news network ABC. Additionally, we can see that this is a nation-wide poll as almost every state is included in the dataset

The variables include personal information about the participants such as their education, their household size, their age, etc. There are also variables related to the questions asked to the partipants:

Code
abc_Q <- abc %>% select(starts_with("Q"))%>% colnames(.)
print(abc_Q)
 [1] "Q1_a" "Q1_b" "Q1_c" "Q1_d" "Q1_e" "Q1_f" "Q2"   "Q3"   "Q4"   "Q5"  
[11] "QPID"

Tidy Data

When tidying up this data, we first want to check if there are any missing entries in the dataset.

Code
#count number of missing entries 
sum(is.na(abc))
[1] 0

The above codeblock shows us that there is no missing data in the typical form.

However if we look at the results of the different questions asked to the participants we can see that there is a value called Skipped:

Code
table(abc$Q1_a)

   Approve Disapprove    Skipped 
       329        193          5 

In order to clean this data up, we want to instead replace all Skipped questions with NA instead as this will make any future actions on this dataset easier.

Code
abc <- abc %>% mutate(across(starts_with("Q"), ~ifelse(.=="Skipped", NA, .)))

This codeblock will change every value that is Skipped to be NA instead. We can then view the results as part of our sanity check:

Code
table(abc$Q1_a)

   Approve Disapprove 
       329        193 

We can also confirm that this worked by checking the number of missing entries again:

Code
sum(is.na(abc))
[1] 86

Another thing we could fix with this dataset is the format for some of the variables. For instance, the values of the variable QPID are:

Code
unique(abc$QPID)
[1] "A Democrat"     "An Independent" "Something else" "A Republican"  
[5] NA              

While this is not a big deal, the articles at the start of each variable name are unnecessary and so we can mutate the dataset in order to remove those:

Code
abc <- abc %>%
  mutate(QPID = gsub("^A\\s|^An\\s", "", QPID))

Now we can perform a sanity check on this dataset to make sure that the articles were removed properly:

Code
table(abc$QPID)

      Democrat    Independent     Republican Something else 
           176            168            152             28 

Conclusion

In this challenge we saw how we were able to use R commands in order to mutate our dataset to improve readability and performance.