Code
library(tidyverse)
library(stringr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Prasann Desai
June 30, 2023
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
# A tibble: 527 × 31
id xspanish comple…¹ ppage ppeduc5 ppedu…² ppgen…³ ppethm pphhs…⁴ ppinc7
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 7230001 English qualifi… 68 "High … High s… Female White… 2 $25,0…
2 7230002 English qualifi… 85 "Bache… Bachel… Male White… 2 $150,…
3 7230003 English qualifi… 69 "High … High s… Male White… 2 $100,…
4 7230004 English qualifi… 74 "Bache… Bachel… Female White… 1 $25,0…
5 7230005 English qualifi… 77 "High … High s… Male White… 3 $10,0…
6 7230006 English qualifi… 70 "Bache… Bachel… Male White… 2 $75,0…
7 7230007 English qualifi… 26 "Maste… Bachel… Male Other… 3 $150,…
8 7230008 English qualifi… 76 "Bache… Bachel… Male Black… 2 $50,0…
9 7230009 English qualifi… 78 "Bache… Bachel… Female White… 2 $150,…
10 7230010 English qualifi… 47 "Maste… Bachel… Male Other… 4 $150,…
# … with 517 more rows, 21 more variables: ppmarit5 <chr>, ppmsacat <chr>,
# ppreg4 <chr>, pprent <chr>, ppstaten <chr>, PPWORKA <chr>, ppemploy <chr>,
# Q1_a <chr>, Q1_b <chr>, Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>,
# Q2 <chr>, Q3 <chr>, Q4 <chr>, Q5 <chr>, QPID <chr>, ABCAGE <chr>,
# Contact <chr>, weights_pid <dbl>, and abbreviated variable names
# ¹complete_status, ²ppeducat, ³ppgender, ⁴pphhsize
[1] "id" "xspanish" "complete_status" "ppage"
[5] "ppeduc5" "ppeducat" "ppgender" "ppethm"
[9] "pphhsize" "ppinc7" "ppmarit5" "ppmsacat"
[13] "ppreg4" "pprent" "ppstaten" "PPWORKA"
[17] "ppemploy" "Q1_a" "Q1_b" "Q1_c"
[21] "Q1_d" "Q1_e" "Q1_f" "Q2"
[25] "Q3" "Q4" "Q5" "QPID"
[29] "ABCAGE" "Contact" "weights_pid"
Response:
From the above output, we can see that the dataset contains certain demographic information about the 527 participants and their respective responses to all the questions of a poll conducted in 2021. Based on the column names, we can say that there were 10 different opinion/views based questions asked to each participant.
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
Response: It appears to me that the data is already tidy. (Each variable has its own column, Each observation has its own row and Each value is in its own cell)
Bachelor\x92s degree
108
High school graduate (high school diploma or the equivalent GED)
133
Master\x92s degree or above
99
No high school diploma or GED
29
Some college or Associate degree
158
2+ Races, Non-Hispanic Black, Non-Hispanic Hispanic
21 27 51
Other, Non-Hispanic White, Non-Hispanic
24 404
Any additional comments?
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
# Removing the redundant term "Non-Hispanic" from 'ppethm' column. Add a separate "Hispanic/Non-Hispanic" flag for backward compatibility.
abc_poll_2021 <- mutate(abc_poll_2021, is_hispanic = !str_detect(ppethm, "Non-Hispanic"), pprace = str_split(ppethm, ",", simplify = TRUE)[,1])
abc_poll_2021 %>% select(ppethm, is_hispanic, pprace)
# A tibble: 527 × 3
ppethm is_hispanic pprace
<chr> <lgl> <chr>
1 White, Non-Hispanic FALSE White
2 White, Non-Hispanic FALSE White
3 White, Non-Hispanic FALSE White
4 White, Non-Hispanic FALSE White
5 White, Non-Hispanic FALSE White
6 White, Non-Hispanic FALSE White
7 Other, Non-Hispanic FALSE Other
8 Black, Non-Hispanic FALSE Black
9 White, Non-Hispanic FALSE White
10 Other, Non-Hispanic FALSE Other
# … with 517 more rows
2+ Races Black Hispanic Other White
21 27 51 24 404
# A tibble: 527 × 2
ppeduc5_clean ppeduc5
<chr> <chr>
1 High school graduate (high school diploma or the equivalent GED) "High schoo…
2 Bachelor's degree "Bachelor\x…
3 High school graduate (high school diploma or the equivalent GED) "High schoo…
4 Bachelor's degree "Bachelor\x…
5 High school graduate (high school diploma or the equivalent GED) "High schoo…
6 Bachelor's degree "Bachelor\x…
7 Master's degree or above "Master\x92…
8 Bachelor's degree "Bachelor\x…
9 Bachelor's degree "Bachelor\x…
10 Master's degree or above "Master\x92…
# … with 517 more rows
Bachelor's degree
108
High school graduate (high school diploma or the equivalent GED)
133
Master's degree or above
99
No high school diploma or GED
29
Some college or Associate degree
158
# Cleaning the 'contact' column to 'interview_consent'. Also, replacing the values to a simple "Yes" and "No" for better readability.
abc_poll_2021 <- rename(abc_poll_2021, interview_consent = Contact)
abc_poll_2021 <- mutate(abc_poll_2021, interview_consent_clean = case_when(str_detect(interview_consent, "Yes") ~ "Yes",
str_detect(interview_consent, "No") ~ "No"
))
abc_poll_2021 %>% select(interview_consent, interview_consent_clean)
# A tibble: 527 × 2
interview_consent interview_consent_clean
<chr> <chr>
1 No, I am not willing to be interviewed No
2 No, I am not willing to be interviewed No
3 No, I am not willing to be interviewed No
4 Yes, I am willing to be interviewed Yes
5 No, I am not willing to be interviewed No
6 No, I am not willing to be interviewed No
7 Yes, I am willing to be interviewed Yes
8 Yes, I am willing to be interviewed Yes
9 Yes, I am willing to be interviewed Yes
10 Yes, I am willing to be interviewed Yes
# … with 517 more rows
No Yes
355 172
Overall, I think the new columns added (is_hispanic, pprace, ppeduc5_clean, interview_consent_clean) will improve readability and understanding of the dataset as well as lead to cleaner visualizations that will be built on top of this dataset.
Any additional comments?
---
title: "Challenge 4"
author: "Prasann Desai"
description: "More data wrangling: pivoting"
date: "6/30/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- abc_poll
- Prasann Desai
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(stringr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- abc_poll.csv ⭐
- poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
- FedFundsRate.csv⭐⭐⭐
- hotel_bookings.csv⭐⭐⭐⭐
- debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
```{r}
# Function call to read a csv file
abc_poll_2021 <- read_csv("_data/abc_poll_2021.csv")
```
```{r}
abc_poll_2021
```
```{r}
colnames(abc_poll_2021)
```
### Briefly describe the data
Response:
From the above output, we can see that the dataset contains certain demographic information about the 527 participants and their respective responses to all the questions of a poll conducted in 2021. Based on the column names, we can say that there were 10 different opinion/views based questions asked to each participant.
## Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
Response: It appears to me that the data is already tidy. (Each variable has its own column, Each observation has its own row and Each value is in its own cell)
```{r}
# Sanity check for duplicate responses
n_distinct(abc_poll_2021$id)
```
```{r}
# Check to see if the category wise totals add up to 527
table(abc_poll_2021$ppeduc5)
```
```{r}
# Another Check to see if the category wise totals add up to 527
table(abc_poll_2021$ppethm)
```
Any additional comments?
## Identify variables that need to be mutated
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
```{r}
# Removing the redundant term "Non-Hispanic" from 'ppethm' column. Add a separate "Hispanic/Non-Hispanic" flag for backward compatibility.
abc_poll_2021 <- mutate(abc_poll_2021, is_hispanic = !str_detect(ppethm, "Non-Hispanic"), pprace = str_split(ppethm, ",", simplify = TRUE)[,1])
abc_poll_2021 %>% select(ppethm, is_hispanic, pprace)
table(abc_poll_2021$pprace)
```
```{r}
# Cleaning the 'ppeduc5' column values
abc_poll_2021 <- mutate(abc_poll_2021, ppeduc5_clean = str_replace(ppeduc5, "\x92", "'"))
abc_poll_2021 %>% select(ppeduc5_clean, ppeduc5)
table(abc_poll_2021$ppeduc5_clean)
```
```{r}
# Cleaning the 'contact' column to 'interview_consent'. Also, replacing the values to a simple "Yes" and "No" for better readability.
abc_poll_2021 <- rename(abc_poll_2021, interview_consent = Contact)
abc_poll_2021 <- mutate(abc_poll_2021, interview_consent_clean = case_when(str_detect(interview_consent, "Yes") ~ "Yes",
str_detect(interview_consent, "No") ~ "No"
))
abc_poll_2021 %>% select(interview_consent, interview_consent_clean)
table(abc_poll_2021$interview_consent_clean)
```
Overall, I think the new columns added (is_hispanic, pprace, ppeduc5_clean, interview_consent_clean) will improve readability and understanding of the dataset as well as lead to cleaner visualizations that will be built on top of this dataset.
Any additional comments?