Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Matthew Weiner
March 29, 2023
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
For this challenge I chose to investigate the abc_poll dataset
The first thing I needed to do was to get a better understanding of what this dataset looked like and try to figure out what it was about. I was able to do this in the following code block which examines the shape of the datset as well as the names of the variables.
[1] 527
[1] 31
[1] "id" "xspanish" "complete_status" "ppage"
[5] "ppeduc5" "ppeducat" "ppgender" "ppethm"
[9] "pphhsize" "ppinc7" "ppmarit5" "ppmsacat"
[13] "ppreg4" "pprent" "ppstaten" "PPWORKA"
[17] "ppemploy" "Q1_a" "Q1_b" "Q1_c"
[21] "Q1_d" "Q1_e" "Q1_f" "Q2"
[25] "Q3" "Q4" "Q5" "QPID"
[29] "ABCAGE" "Contact" "weights_pid"
These initial investigations have led me to believe that this dataset represents the data collected about participants in some poll, likely political, as part of the news network ABC. Additionally, we can see that this is a nation-wide poll as almost every state is included in the dataset
The variables include personal information about the participants such as their education, their household size, their age, etc. There are also variables related to the questions asked to the partipants:
When tidying up this data, we first want to check if there are any missing entries in the dataset.
The above codeblock shows us that there is no missing data in the typical form.
However if we look at the results of the different questions asked to the participants we can see that there is a value called Skipped
:
In order to clean this data up, we want to instead replace all Skipped
questions with NA
instead as this will make any future actions on this dataset easier.
This codeblock will change every value that is Skipped
to be NA
instead. We can then view the results as part of our sanity check:
We can also confirm that this worked by checking the number of missing entries again:
Another thing we could fix with this dataset is the format for some of the variables. For instance, the values of the variable QPID
are:
While this is not a big deal, the articles at the start of each variable name are unnecessary and so we can mutate the dataset in order to remove those:
Now we can perform a sanity check on this dataset to make sure that the articles were removed properly:
In this challenge we saw how we were able to use R commands in order to mutate our dataset to improve readability and performance.
---
title: "Challenge 4"
author: "Matthew Weiner"
description: "More data wrangling: pivoting"
date: "03/29/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- abc_poll
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- abc_poll.csv ⭐
- poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
- FedFundsRate.csv⭐⭐⭐
- hotel_bookings.csv⭐⭐⭐⭐
- debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
## Introduction
For this challenge I chose to investigate the abc_poll dataset
## Understanding the Data
The first thing I needed to do was to get a better understanding of what this dataset looked like and try to figure out what it was about. I was able to do this in the following code block which examines the shape of the datset as well as the names of the variables.
```{r}
library(readr)
abc <- read_csv("_data/abc_poll_2021.csv")
# num of rows of dataset
nrow(abc)
#num of cols of dataset
ncol(abc)
#name of all the columns
colnames(abc)
```
```{r}
num_states <- n_distinct(abc$ppstaten)
print(num_states)
```
These initial investigations have led me to believe that this dataset represents the data collected about participants in some poll, likely political, as part of the news network ABC. Additionally, we can see that this is a nation-wide poll as almost every state is included in the dataset
The variables include personal information about the participants such as their education, their household size, their age, etc. There are also variables related to the questions asked to the partipants:
```{r}
abc_Q <- abc %>% select(starts_with("Q"))%>% colnames(.)
print(abc_Q)
```
## Tidy Data
When tidying up this data, we first want to check if there are any missing entries in the dataset.
```{r}
#count number of missing entries
sum(is.na(abc))
```
The above codeblock shows us that there is no missing data in the typical form.
However if we look at the results of the different questions asked to the participants we can see that there is a value called `Skipped`:
```{r}
table(abc$Q1_a)
```
In order to clean this data up, we want to instead replace all `Skipped` questions with `NA` instead as this will make any future actions on this dataset easier.
```{r}
abc <- abc %>% mutate(across(starts_with("Q"), ~ifelse(.=="Skipped", NA, .)))
```
This codeblock will change every value that is `Skipped` to be `NA` instead. We can then view the results as part of our sanity check:
```{r}
table(abc$Q1_a)
```
We can also confirm that this worked by checking the number of missing entries again:
```{r}
sum(is.na(abc))
```
Another thing we could fix with this dataset is the format for some of the variables. For instance, the values of the variable `QPID` are:
```{r}
unique(abc$QPID)
```
While this is not a big deal, the articles at the start of each variable name are unnecessary and so we can mutate the dataset in order to remove those:
```{r}
abc <- abc %>%
mutate(QPID = gsub("^A\\s|^An\\s", "", QPID))
```
Now we can perform a sanity check on this dataset to make sure that the articles were removed properly:
```{r}
table(abc$QPID)
```
# Conclusion
In this challenge we saw how we were able to use R commands in order to mutate our dataset to improve readability and performance.