Challenge 4

challenge_4

abc_poll

More data wrangling: pivoting

Author

Matthew Weiner

Published

March 29, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
identify variables that need to be mutated
mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

abc_poll.csv ⭐
poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
FedFundsRate.csv⭐⭐⭐
hotel_bookings.csv⭐⭐⭐⭐
debt_in_trillions.xlsx ⭐⭐⭐⭐⭐

Introduction

For this challenge I chose to investigate the abc_poll dataset

Understanding the Data

The first thing I needed to do was to get a better understanding of what this dataset looked like and try to figure out what it was about. I was able to do this in the following code block which examines the shape of the datset as well as the names of the variables.

Code

library(readr)
abc <- read_csv("_data/abc_poll_2021.csv")

# num of rows of dataset
nrow(abc)

[1] 527

Code

#num of cols of dataset
ncol(abc)

[1] 31

Code

#name of all the columns
colnames(abc)

 [1] "id"              "xspanish"        "complete_status" "ppage"          
 [5] "ppeduc5"         "ppeducat"        "ppgender"        "ppethm"         
 [9] "pphhsize"        "ppinc7"          "ppmarit5"        "ppmsacat"       
[13] "ppreg4"          "pprent"          "ppstaten"        "PPWORKA"        
[17] "ppemploy"        "Q1_a"            "Q1_b"            "Q1_c"           
[21] "Q1_d"            "Q1_e"            "Q1_f"            "Q2"             
[25] "Q3"              "Q4"              "Q5"              "QPID"           
[29] "ABCAGE"          "Contact"         "weights_pid"

Code

num_states <- n_distinct(abc$ppstaten)
print(num_states)

[1] 49

These initial investigations have led me to believe that this dataset represents the data collected about participants in some poll, likely political, as part of the news network ABC. Additionally, we can see that this is a nation-wide poll as almost every state is included in the dataset

The variables include personal information about the participants such as their education, their household size, their age, etc. There are also variables related to the questions asked to the partipants:

Code

abc_Q <- abc %>% select(starts_with("Q"))%>% colnames(.)
print(abc_Q)

 [1] "Q1_a" "Q1_b" "Q1_c" "Q1_d" "Q1_e" "Q1_f" "Q2"   "Q3"   "Q4"   "Q5"  
[11] "QPID"

Tidy Data

When tidying up this data, we first want to check if there are any missing entries in the dataset.

Code

#count number of missing entries 
sum(is.na(abc))

[1] 0

The above codeblock shows us that there is no missing data in the typical form.

However if we look at the results of the different questions asked to the participants we can see that there is a value called Skipped:

Code

table(abc$Q1_a)


   Approve Disapprove    Skipped 
       329        193          5

In order to clean this data up, we want to instead replace all Skipped questions with NA instead as this will make any future actions on this dataset easier.

Code

abc <- abc %>% mutate(across(starts_with("Q"), ~ifelse(.=="Skipped", NA, .)))

This codeblock will change every value that is Skipped to be NA instead. We can then view the results as part of our sanity check:

Code

table(abc$Q1_a)


   Approve Disapprove 
       329        193

We can also confirm that this worked by checking the number of missing entries again:

Code

sum(is.na(abc))

[1] 86

Another thing we could fix with this dataset is the format for some of the variables. For instance, the values of the variable QPID are:

Code

unique(abc$QPID)

[1] "A Democrat"     "An Independent" "Something else" "A Republican"  
[5] NA

While this is not a big deal, the articles at the start of each variable name are unnecessary and so we can mutate the dataset in order to remove those:

Code

abc <- abc %>%
  mutate(QPID = gsub("^A\\s|^An\\s", "", QPID))

Now we can perform a sanity check on this dataset to make sure that the articles were removed properly:

Code

table(abc$QPID)


      Democrat    Independent     Republican Something else 
           176            168            152             28

Conclusion

In this challenge we saw how we were able to use R commands in order to mutate our dataset to improve readability and performance.