Generated by summarytools 1.0.1 (R version 4.2.1) 2022-12-20
The dataset should be from a political survey. There are 527 rows and 31 columns. Each row(observation) contains information about a survey respondent as well as his/her answers to survey questions. 3 of 31 variables are ‘numeric’ variables which are id, ppage and weights_pid. All of the rest are ‘character’ variables. Actually most of these character variables should be coded as ‘factor’ variable because they are essentially categorical variables that could have a few possible values.
All variable names that start with “Q” are survey questions.
So, there 14 variables identifying respondents’ demographic characteristics. variable complete_status could be dropped from the dataset as all observations in the dataset have same value, ’qualified.
Code
abc.poll <- abc.poll %>%select(-complete_status)
When we look at the Data Frame Summary table above, we can see that there are no missing values in the dataset; however, two values of ppeduc5 variable cannot be displayed and identified as NA.
Code
table(abc.poll$ppeduc5)
Master\x92s degree or above
99
Bachelor\x92s degree
108
High school graduate (high school diploma or the equivalent GED)
133
No high school diploma or GED
29
Some college or Associate degree
158
As it can be seen above, there is an issue with properly reading these two string values. Their correct values should be “Bachelor’s degree” and “Master’s degree or above”. Let me fix it.
Code
abc.poll$ppeduc5[startsWith(abc.poll$ppeduc5, "Bac")] <-"Bachelor's degree"abc.poll$ppeduc5[startsWith(abc.poll$ppeduc5, "Mas")] <-"Master's degree or above"table(abc.poll$ppeduc5)
Bachelor's degree
108
High school graduate (high school diploma or the equivalent GED)
133
Master's degree or above
99
No high school diploma or GED
29
Some college or Associate degree
158
Also it would be better if we redefine ppeduc5 variable so that its 5 values show up in ascending order from “no high school diploma” to “master’s degree”. To do that, I will change class of ppeduc5 variable from character to factor.
Code
abc.poll <- abc.poll %>%mutate(ppeduc5 =factor(ppeduc5, levels=c("No high school diploma or GED","High school graduate (high school diploma or the equivalent GED)","Some college or Associate degree","Bachelor's degree","Master's degree or above")))class(abc.poll$ppeduc5)
[1] "factor"
Code
table(abc.poll$ppeduc5)
No high school diploma or GED
29
High school graduate (high school diploma or the equivalent GED)
133
Some college or Associate degree
158
Bachelor's degree
108
Master's degree or above
99
We can do the same class change for many of the variables so that their values could be put in a order properly based on common sense. These variables are ppeducat, ppinc7 and ppemploy.
Code
unique(abc.poll$ppeducat)
[1] "High school" "Bachelors degree or higher"
[3] "Some college" "Less than high school"
Code
unique(abc.poll$ppinc7)
[1] "$25,000 to $49,999" "$150,000 or more" "$100,000 to $149,999"
[4] "$10,000 to $24,999" "$75,000 to $99,999" "$50,000 to $74,999"
[7] "Less than $10,000"
abc.poll <- abc.poll %>%mutate(ppeducat =factor(ppeducat, levels=c("Less than high school","High school","Some college","Bachelors degree or higher")))abc.poll <- abc.poll %>%mutate(ppinc7 =factor(ppinc7, levels=c("Less than $10,000","$10,000 to $24,999","$25,000 to $49,999","$50,000 to $74,999","$75,000 to $99,999","$100,000 to $149,999","$150,000 or more")))abc.poll <- abc.poll %>%mutate(ppemploy =factor(ppemploy, levels=c("Not working","Working part-time","Working full-time")))
On the other hand, some values of pprent and Contact variables are unnecessarily very long strings, They could be shortened for neatness of further analysis on the data.
Code
unique(abc.poll$pprent)
[1] "Owned or being bought by you or someone in your household"
[2] "Occupied without payment of cash rent"
[3] "Rented for cash"
Code
unique(abc.poll$Contact)
[1] "No, I am not willing to be interviewed"
[2] "Yes, I am willing to be interviewed"
Code
abc.poll$pprent[startsWith(abc.poll$pprent, "Owned")] <-"Owned by one of the househould"abc.poll$Contact[startsWith(abc.poll$Contact, "Yes")] <-"Yes"abc.poll$Contact[startsWith(abc.poll$Contact, "No")] <-"No"abc.poll <-rename(abc.poll, willingness_to_contact= Contact)sample_n(abc.poll, 10)
# A tibble: 10 × 30
id xspanish ppage ppeduc5 ppedu…¹ ppgen…² ppethm pphhs…³ ppinc7 ppmar…⁴
<dbl> <chr> <dbl> <fct> <fct> <chr> <chr> <chr> <fct> <chr>
1 7230420 English 66 Some co… Some c… Male White… 2 $25,0… Divorc…
2 7230428 English 35 Some co… Some c… Male 2+ Ra… 2 $50,0… Now Ma…
3 7230179 English 66 Some co… Some c… Male White… 2 $150,… Now Ma…
4 7230520 English 42 Some co… Some c… Female White… 4 $100,… Now Ma…
5 7230507 English 68 High sc… High s… Female Black… 2 Less … Never …
6 7230394 English 59 Bachelo… Bachel… Female White… 2 $75,0… Now Ma…
7 7230337 English 38 Master'… Bachel… Male White… 5 $150,… Now Ma…
8 7230274 English 74 Some co… Some c… Female White… 2 $100,… Now Ma…
9 7230242 English 58 Master'… Bachel… Female White… 3 $150,… Now Ma…
10 7230167 English 48 Some co… Some c… Female White… 4 $25,0… Now Ma…
# … with 20 more variables: ppmsacat <chr>, ppreg4 <chr>, pprent <chr>,
# ppstaten <chr>, PPWORKA <chr>, ppemploy <fct>, Q1_a <chr>, Q1_b <chr>,
# Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>, Q2 <chr>, Q3 <chr>,
# Q4 <chr>, Q5 <chr>, QPID <chr>, ABCAGE <chr>, willingness_to_contact <chr>,
# weights_pid <dbl>, and abbreviated variable names ¹ppeducat, ²ppgender,
# ³pphhsize, ⁴ppmarit5
Generated by summarytools 1.0.1 (R version 4.2.1) 2022-12-20
Source Code
---title: "Challenge-4"author: "Said Arslan"desription: "More data wrangling: pivoting"date: "10/12/2022"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - challenge_4 - abc_poll---```{r}library(tidyverse)library(summarytools)knitr::opts_chunk$set(echo =TRUE, warning=FALSE, message=FALSE)```## Read Data```{r}abc.poll <-read_csv("_data/abc_poll_2021.csv")head(abc.poll)sample_n(abc.poll, 10)```## Briefly describe the data```{r}print(dfSummary(abc.poll, varnumbers=FALSE, plain.ascii=FALSE, style="grid", graph.magnif=0.80, valid.col=FALSE),method='render', table.classes='table-condensed')```The dataset should be from a political survey. There are 527 rows and 31 columns. Each row(observation) contains information about a survey respondent as well as his/her answers to survey questions.3 of 31 variables are 'numeric' variables which are `id`, `ppage` and `weights_pid`. All of the rest are 'character' variables.Actually most of these character variables should be coded as 'factor' variable because they are essentially categorical variables that could have a few possible values.All variable names that start with "Q" are survey questions.```{r}colnames(select(abc.poll, starts_with("Q")))length(colnames(select(abc.poll, starts_with("Q"))))```So, there are 11 survey questions.All variable names that start with "pp" contain demographic information about respondents.```{r}colnames(select(abc.poll, starts_with("pp")))length(colnames(select(abc.poll, starts_with("pp"))))```So, there 14 variables identifying respondents' demographic characteristics.variable `complete_status` could be dropped from the dataset as all observations in the dataset have same value, 'qualified.```{r}abc.poll <- abc.poll %>%select(-complete_status)```When we look at the Data Frame Summary table above, we can see that there are no missing values in the dataset; however, two values of `ppeduc5` variable cannot be displayed and identified as NA. ```{r}table(abc.poll$ppeduc5)```As it can be seen above, there is an issue with properly reading these two string values. Their correct values should be "Bachelor's degree" and "Master's degree or above". Let me fix it.```{r}abc.poll$ppeduc5[startsWith(abc.poll$ppeduc5, "Bac")] <-"Bachelor's degree"abc.poll$ppeduc5[startsWith(abc.poll$ppeduc5, "Mas")] <-"Master's degree or above"table(abc.poll$ppeduc5)```Also it would be better if we redefine `ppeduc5` variable so that its 5 values show up in ascending order from "no high school diploma" to "master's degree". To do that, I will change class of `ppeduc5` variable from character to factor. ```{r}abc.poll <- abc.poll %>%mutate(ppeduc5 =factor(ppeduc5, levels=c("No high school diploma or GED","High school graduate (high school diploma or the equivalent GED)","Some college or Associate degree","Bachelor's degree","Master's degree or above")))class(abc.poll$ppeduc5)table(abc.poll$ppeduc5)```We can do the same class change for many of the variables so that their values could be put in a order properly based on common sense. These variables are `ppeducat`, `ppinc7` and `ppemploy`.```{r}unique(abc.poll$ppeducat)unique(abc.poll$ppinc7)unique(abc.poll$ppemploy)``````{r}abc.poll <- abc.poll %>%mutate(ppeducat =factor(ppeducat, levels=c("Less than high school","High school","Some college","Bachelors degree or higher")))abc.poll <- abc.poll %>%mutate(ppinc7 =factor(ppinc7, levels=c("Less than $10,000","$10,000 to $24,999","$25,000 to $49,999","$50,000 to $74,999","$75,000 to $99,999","$100,000 to $149,999","$150,000 or more")))abc.poll <- abc.poll %>%mutate(ppemploy =factor(ppemploy, levels=c("Not working","Working part-time","Working full-time")))```On the other hand, some values of `pprent` and `Contact` variables are unnecessarily very long strings, They could be shortened for neatness of further analysis on the data.```{r}unique(abc.poll$pprent)unique(abc.poll$Contact)``````{r}abc.poll$pprent[startsWith(abc.poll$pprent, "Owned")] <-"Owned by one of the househould"abc.poll$Contact[startsWith(abc.poll$Contact, "Yes")] <-"Yes"abc.poll$Contact[startsWith(abc.poll$Contact, "No")] <-"No"abc.poll <-rename(abc.poll, willingness_to_contact= Contact)sample_n(abc.poll, 10)``````{r}print(dfSummary(abc.poll, varnumbers=FALSE, plain.ascii=FALSE, style="grid", graph.magnif=0.80, valid.col=FALSE),method='render', table.classes='table-condensed')```