Code
library(tidyverse)
library(lubridate)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Michaela Bowen
October 6, 2022
In today’s Challenge I have attempted to:
Today I am reading in:
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id [numeric] |
|
527 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
xspanish [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
complete_status [character] | 1. qualified |
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppage [numeric] |
|
72 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppeduc5 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppeducat [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppgender [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppethm [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pphhsize [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppinc7 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppmarit5 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppmsacat [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppreg4 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pprent [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppstaten [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PPWORKA [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppemploy [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_a [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_b [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_c [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_d [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_e [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_f [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q2 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q3 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q4 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q5 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
QPID [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ABCAGE [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Contact [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
weights_pid [numeric] |
|
453 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-11-26
[1] "ppage" "ppeduc5" "ppeducat" "ppgender" "ppethm" "pphhsize"
[7] "ppinc7" "ppmarit5" "ppmsacat" "ppreg4" "pprent" "ppstaten"
[13] "PPWORKA" "ppemploy"
[1] "Q1_a" "Q1_b" "Q1_c" "Q1_d" "Q1_e" "Q1_f" "Q2" "Q3" "Q4" "Q5"
[11] "QPID"
The ABC Poll Data is survey data from ABC new political polls. This data looks to be from 2021. After investigating the data we can see a few things:
There are demographic variables, such as education level, household size, income level, among other identifying information. We can also see that each of the 527 respondents have a unique “id”, so there are no double respondents.
There are 10 questions that ask political sentiments (Q_1 (a-f) - Q_5) as well as one asking political affiliation.
abc_poll_tidy <- read_csv("_data/abc_poll_2021.csv",
skip = 1,
col_names = c("delete","language","delete","pp_age","pp_education","delete","pp_gender","pp_ethnicity", "pp_house_size","pp_incomelvl","pp_marital_status","pp_metrocat","pp_region","pp_rent","pp_state","pp_working_status","pp_employment","q_1_a","q_1_b","q_1_c","q_1_d","q_1_e","q_1_f","q_2","q_3","q_4","q_5","pp_partyid","pp_age_range","pp_can_contact","weights"))%>%
select(!contains("delete"))%>%
#removing redundant "Non-Hispanic" identification from ethnicity column
mutate(pp_ethnicity = str_remove(pp_ethnicity, ", Non-Hispanic"))%>%
#mutating political identity column to remove "An" and "A"
mutate(
pp_partyid = str_remove(pp_partyid, "A[n]*"),
pp_partyid = na_if(pp_partyid, "Skipped")
)%>%
mutate(
pp_education = str_remove(pp_education, "\x92")
)
#Sanity Checks
prop.table(table(abc_poll_tidy$pp_partyid))
Democrat Independent Republican Something else
0.33587786 0.32061069 0.29007634 0.05343511
2+ Races Black Hispanic Other White
0.03984820 0.05123340 0.09677419 0.04554080 0.76660342
Bachelors degree
0.20493359
High school graduate (high school diploma or the equivalent GED)
0.25237192
Masters degree or above
0.18785579
No high school diploma or GED
0.05502846
Some college or Associate degree
0.29981025
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
For most streamline analysis there are several columns that must mutated. pp_incomelvl
: - Change from numeric data, to categorical data giving each income bracket a category
Document your work here.
abc_poll_tidy <- abc_poll_tidy%>%
#categorizing income level
mutate(pp_incomelvl = case_when(
pp_incomelvl == "Less than $10,000" ~ "Income Level 1",
pp_incomelvl == "$10,000 to $24,999" ~ "Income Level 2",
pp_incomelvl == "$25,000 to $49,999" ~ "Income Level 3",
pp_incomelvl == "$50,000 to $74,999" ~ "Income Level 4",
pp_incomelvl == "$75,000 to $99,999" ~ "Income Level 5",
pp_incomelvl == "$100,000 to $149,999" ~ "Income Level 6",
pp_incomelvl == "$150,000 or more" ~ "Income Level 7"
))
Any additional comments? Here is the final dataframe
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id [numeric] |
|
527 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
xspanish [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
complete_status [character] | 1. qualified |
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppage [numeric] |
|
72 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppeduc5 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppeducat [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppgender [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppethm [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pphhsize [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppinc7 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppmarit5 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppmsacat [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppreg4 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pprent [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppstaten [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PPWORKA [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ppemploy [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_a [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_b [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_c [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_d [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_e [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q1_f [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q2 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q3 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q4 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Q5 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
QPID [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ABCAGE [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Contact [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
weights_pid [numeric] |
|
453 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-11-26
---
title: "Challenge 4"
author: "Michaela Bowen"
desription: "More data wrangling: pivoting"
date: "10/06/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- abc_poll
- eggs
- fed_rates
- hotel_bookings
- debt
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
In today's Challenge I have attempted to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Today I am reading in:
- abc_poll.csv ⭐
```{r}
library(readr)
abc_poll <- read_csv("_data/abc_poll_2021.csv")
```
### Briefly describe the data
```{r}
library(summarytools)
##dataframe summary
print(summarytools::dfSummary(abc_poll,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
```
```{r}
##Identify Participant Demographic Variables
abc_poll%>%
select(starts_with("pp"))%>%
colnames()
```
```{r}
##Identify Political Question Variables
abc_poll%>%
select(starts_with("Q"))%>%
colnames()
```
The ABC Poll Data is survey data from ABC new political polls. This data looks to be from 2021. After investigating the data we can see a few things:
- There are demographic variables, such as education level, household size, income level, among other identifying information. We can also see that each of the 527 respondents have a unique "id", so there are no double respondents.
- There are 10 questions that ask political sentiments (Q_1 (a-f) - Q_5) as well as one asking political affiliation.
## Tidy Data
- There are a few ways I would like to tidy this data while reading it in
1. Rename columns: I will be renaming the "ID", "complete_status",and "pp_educat"(redundant information) columns to "delete" in order to remove those columns seamlessly. Other demographic columns will be renamed according to the information they contain.
2. Here is how I will mutate the following columns
- Ethnicity column responses all, but Hispanic, contain "non-Hispanic" so it is redundant and lengthy to include that in each response
- Political Identity column needs to be amended from including "An" and "A" to simply the party name. The "skipped" response also will need to be change to NA to signify missing data rather than classifying "skipped" as a response.
- Education column responses must be mutated to exclude the apostrophe in the Bachelor's and Master's columns as they are producing "NA" where we don't want that
```{r}
abc_poll_tidy <- read_csv("_data/abc_poll_2021.csv",
skip = 1,
col_names = c("delete","language","delete","pp_age","pp_education","delete","pp_gender","pp_ethnicity", "pp_house_size","pp_incomelvl","pp_marital_status","pp_metrocat","pp_region","pp_rent","pp_state","pp_working_status","pp_employment","q_1_a","q_1_b","q_1_c","q_1_d","q_1_e","q_1_f","q_2","q_3","q_4","q_5","pp_partyid","pp_age_range","pp_can_contact","weights"))%>%
select(!contains("delete"))%>%
#removing redundant "Non-Hispanic" identification from ethnicity column
mutate(pp_ethnicity = str_remove(pp_ethnicity, ", Non-Hispanic"))%>%
#mutating political identity column to remove "An" and "A"
mutate(
pp_partyid = str_remove(pp_partyid, "A[n]*"),
pp_partyid = na_if(pp_partyid, "Skipped")
)%>%
mutate(
pp_education = str_remove(pp_education, "\x92")
)
#Sanity Checks
prop.table(table(abc_poll_tidy$pp_partyid))
prop.table(table(abc_poll_tidy$pp_ethnicity))
prop.table(table(abc_poll_tidy$pp_education))
```
## Identify variables that need to be mutated
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
For most streamline analysis there are several columns that must mutated.
`pp_incomelvl`:
- Change from numeric data, to categorical data giving each income bracket a category
Document your work here.
```{r}
abc_poll_tidy <- abc_poll_tidy%>%
#categorizing income level
mutate(pp_incomelvl = case_when(
pp_incomelvl == "Less than $10,000" ~ "Income Level 1",
pp_incomelvl == "$10,000 to $24,999" ~ "Income Level 2",
pp_incomelvl == "$25,000 to $49,999" ~ "Income Level 3",
pp_incomelvl == "$50,000 to $74,999" ~ "Income Level 4",
pp_incomelvl == "$75,000 to $99,999" ~ "Income Level 5",
pp_incomelvl == "$100,000 to $149,999" ~ "Income Level 6",
pp_incomelvl == "$150,000 or more" ~ "Income Level 7"
))
```
Any additional comments?
Here is the final dataframe
```{r}
print(summarytools::dfSummary(abc_poll,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
```