Code
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Theresa Szczepanski
September 30, 2022
To address today’s challenge I tried to:
factorTo read in the data, I used the following process:
#Read in the abc_poll data and use the summary to decide how to best set up
# the our data frame
 abc_poll<-read_csv("_data/abc_poll_2021.csv")
 print(summarytools::dfSummary(abc_poll,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.70,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id [numeric] | 
 | 527 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| xspanish [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| complete_status [character] | 1. qualified | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ppage [numeric] | 
 | 72 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ppeduc5 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ppeducat [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ppgender [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ppethm [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pphhsize [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ppinc7 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ppmarit5 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ppmsacat [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ppreg4 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pprent [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ppstaten [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| PPWORKA [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ppemploy [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q1_a [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q1_b [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q1_c [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q1_d [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q1_e [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q1_f [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q2 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q3 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q4 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q5 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| QPID [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ABCAGE [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Contact [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| weights_pid [numeric] | 
 | 453 distinct values | 0 (0.0%) | 
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21
After examining the summary, I chose to
Filter:
id: this info won’t be used in an analysis
complete_status: everyone was qualified
ppeducat: this categorizing of ppeduc5 can be done in the data frame using a case_when()
ABCAGE: this qualitative age range variable can be replicated by using the data in the ppage variable and a case_when; one might want to examine different ranges of ages.
weights_pid variable, since this is calculated using percentages of respondents relative to their representation to the general population and can be calculated using data within our data frame.
Rename
I renamed all of the variables corresponding to demographic characteristics of the poll participant to begin with pp_.
I renamed all of the variables corresponding to survey question responses from the participants to begin with Q
If a variable had a fixed number of possible responses (which I could see from the summary), e.g., pp_marital had 5 possible responses, I included the number of “categories” or possible responses in the variable name preceded by an underscore, pp_marital_5
Mutate
I replaced the pp_hhsize_6 value of “6 or more” with 6, so that it could be of double data type
I mutated the pp_educ5 column to remove the apostrophes from “Bachelor’s” and “Master’s” that were producing the “\x92”’s in the values on read in.
If a nominal variable had lengthy values, I reduced them to the key info using mutate, str_sub, and case_when
#Filter, rename variables, and mutate values of variables on read-in
abc_poll<-read_csv("_data/abc_poll_2021.csv", skip = 1,  
                   col_names= c("delete",  "pp_Language_2",  "delete","pp_age", 
                                "pp_educ_5", "delete", "pp_gender_2", 
                                "pp_ethnicity_5", "pp_hhsize_6", "pp_inc_7", 
                                "pp_marital_5", "pp_metro_cat_2", "pp_region_4",
                                "pp_housing_3", "pp_state", 
                                "pp_working_arrangement_9", 
                                "pp_employment_status_3", "Q1a_3", "Q1b_3", 
                                "Q1c_3",  "Q1d_3","Q1e_3", "Q1f_3","Q2ConcernLevel_4",
                                "Q3_3", "Q4_5",  "Q5Optimism_3", 
                                "pp_political_id_5", "delete", "pp_contact_2",  
                                  "delete"))%>%
  select(!contains("delete"))%>%
  
  #replace "6 or more" in pp_hhsize_6 to the value of 6 so that the column can be
  # of double data type.
     mutate(pp_hhsize_6 = ifelse(pp_hhsize_6 == "6 or more", "6", pp_hhsize_6)) %>%
    transform( pp_hhsize_6 = as.numeric(pp_hhsize_6))%>%
  
  #fix the issue with apostrophes in pp_educ_5 values on read in
    mutate(pp_educ_5 = ifelse(str_starts(pp_educ_5,"Bachelor"), 
                           "Bachelor", pp_educ_5))%>%
    mutate(pp_educ_5 = ifelse(str_starts(pp_educ_5, "Master"), "Master", pp_educ_5))
  # reduce lengthy responses to necessary info in nominal variables
  abc_poll$pp_Language_2 = substr(abc_poll$pp_Language_2,1,2)
  #mutate(pp_Language_2 = (str_sub(abc_poll,pp_Language_2, 1, 2)))
  abc_poll$pp_gender_2 = substr(abc_poll$pp_gender_2,1,1)
  abc_poll$pp_contact_2 = substr(abc_poll$pp_contact_2,1,1)
  
  #reduce lengthy responses of nominal variables using Case When
  #pp_ethnicity_5
  abc_poll <-mutate(abc_poll, pp_ethnicity_5 = case_when(
    pp_ethnicity_5 == "2+ Races, Non-Hispanic" ~ "2+NH",
    pp_ethnicity_5 == "Black, Non-Hispanic" ~ "BlNH",
    pp_ethnicity_5 == "Hispanic" ~ "H",
    pp_ethnicity_5 == "Other, Non-Hispanic" ~ "OtNH",
    pp_ethnicity_5 == "White, Non-Hispanic" ~ "WhNH"
))
  
  #pp_metro_cat_2
  abc_poll <-mutate(abc_poll, pp_metro_cat_2 = case_when(
    pp_metro_cat_2 == "Metro area" ~ "M",
    pp_metro_cat_2 == "Non-metro area" ~ "NM"
))
 
  #pp_political_id_5 
 abc_poll <-mutate(abc_poll, pp_political_id_5  = case_when(
    pp_political_id_5 == "A Democrat" ~ "Dem",
    pp_political_id_5 == "A Republican" ~ "Rep",
    pp_political_id_5 == "An Independent" ~ "Ind",
    pp_political_id_5 == "Something else" ~ "Other",
    pp_political_id_5 == "Skipped" ~ "DNR"
))
 
 #pp_housing_3
abc_poll <-mutate(abc_poll, pp_housing_3 = case_when(
    pp_housing_3 == "Occupied without payment of cash rent" ~ "NonP_Occupied",
    pp_housing_3 == "Rented for cash"~ "P_Rent",
    pp_housing_3 == "Owned or being bought by you or someone in your household" ~ "P_Own"))
  #pp_region_4 
 abc_poll <-mutate(abc_poll, pp_region_4  = case_when(
    pp_region_4 == "MidWest" ~ "MW",
    pp_region_4 == "NorthEast" ~ "NE",
    pp_region_4 == "South" ~ "S",
    pp_region_4 == "West" ~ "W",
    
))
  #pp_marital_5 
 abc_poll <-mutate(abc_poll, pp_marital_5 = case_when(
   pp_marital_5 == "Never married" ~ "NM",
   pp_marital_5 == "Now Married" ~ "M",
   pp_marital_5 == "Separated" ~ "S",
   pp_marital_5 == "Divorced" ~ "D",
   pp_marital_5 == "Widowed" ~ "W"))
 
# pp_working_arrangement_9
 abc_poll <-mutate(abc_poll, pp_working_arrangement_9 = case_when(
          pp_working_arrangement_9 == "Other" ~ "Other",
          pp_working_arrangement_9 =="Retired" ~ "Retired",
          pp_working_arrangement_9 == "Homemaker" ~ "Homemaker",
          pp_working_arrangement_9 == "Student" ~ "Student",
          pp_working_arrangement_9 == "Currently laid off" ~ "Laid Off",
          pp_working_arrangement_9 == "On furlough"~ "Furlough",
          pp_working_arrangement_9 == "Employed part-time (by someone else)" ~ "Emp_PT",
          pp_working_arrangement_9 =="Self-employed" ~ "Emp_Self",
          pp_working_arrangement_9 == "Employed full-time (by someone else)"~ "Emp_FT"))
 
  #Q3_3 What is the best "coding for variables that are like "Booleans"?
 # abc_poll <-mutate(abc_poll, Q3_3 = case_when(
 #                                   Q3_3 ==  "Yes" ~ 1,
 #                                   Q3_3 ==  "No" ~ 0,
 #                                     Q3_3 ==  "Skipped" ~ 1))
 
 #Q5Optimism_3
 # abc_poll <-mutate(abc_poll, Q5Optimism_3 = case_when(
 #                                    Q5Optimism_3 == "Pessimistic"~ 0,
 #                                    Q5Optimism_3 == "Optimistic" ~ 1,
 #                                    Q5Optimism_3 == "Skipped" ~ -1))
 
 
  
  abc_pollFrom our abc_poll data frame summary, we can see that this data set contains polling results from 527 respondents to an ABC news political poll. The results consist of information for two broad categories
Demographic characteristics of the respondents themselves (e.g., language of the poll given to the respondent (Spanish or English), age, educational attainment, ethnicity, household size, ethnic make up, gender, income range, Marital status, Metro category, Geographic region, Rental status, State, Employment status, Working characteristics, Willingness to have a follow up interview)
The responses that the individuals gave to 10 questions (there are 5 broad questions Q1-Q5, but Q1 consists of 6 sub questions, a-f).
Now when we examine our summary, we can see that
each categorical variable is of character data type with the number of distinct categories included in the variable name
some of these categorical variables are ordinal and will need the ordering of their values coded in as factors.
each variable has 527 observations partitioned among the possible values
each discrete numerical variable is of double data type.
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pp_Language_2 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_age [numeric] | 
 | 72 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_educ_5 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_gender_2 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_ethnicity_5 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_hhsize_6 [numeric] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_inc_7 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_marital_5 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_metro_cat_2 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_region_4 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_housing_3 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_state [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_working_arrangement_9 [character] | 
 | 
 | 8 (1.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_employment_status_3 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q1a_3 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q1b_3 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q1c_3 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q1d_3 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q1e_3 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q1f_3 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q2ConcernLevel_4 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q3_3 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q4_5 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Q5Optimism_3 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_political_id_5 [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pp_contact_2 [character] | 
 | 
 | 0 (0.0%) | 
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21
In order to have tidy data, each row should be a unique observation. A unique case therefore should consist of all of the demographic information about the polled person and their response for one of the questions.
pp_ variables) and the specific Question variable define a case.Response.When examining the data, I noticed nominal, ordinal, discrete, and continuous variables.
Does arrange sort your ordinal data by the factors or use something else?
Should you ever pivot variables of different levels of measurement into the same column? In the abc_poll case, I could imagine pivoting all of the questions names into a column and all of the responses into a response column. But some of the questions had response values that were clearly ordinal and others that were not. Is there a “best practice” for this?
When examining our variables there are two issues to address
Some of the variable are ordinal and will need the ordering coded in
Some of the ordinal variable values are very long and fill up too much space in our table making it unpleasant to read through
There are several ordinal variables where we need to code in the ordering of the categories.
# use factoring to put an ordering to all ordinal variables
                         
             
# Why couldn't I use a levels vector!                 
 # pp_inc_7_levels <- c(
 #    "Less than $10,000", "$10,000 to $24,999",  "$25,000 to $49,999",  
 #    "$50,000 to $74,999", "$75,000 to $99,999", "$100,000 to $149,999",  
 #    "150,000 or more")
  
 abc_poll <-mutate(abc_poll, pp_inc_7 = recode_factor(pp_inc_7, 
                                   "Less than $10,000" = "I1", 
                                   "$10,000 to $24,999" = "I2",  
                                   "$25,000 to $49,999" = "I3", 
                                   "$50,000 to $74,999"= "I4", 
                                   "$75,000 to $99,999"= "I5", 
                                   "$100,000 to $149,999" = "I6",
                                   "$150,000 or more" = "I7",
                                  .ordered = TRUE))
 #pp_educ_5
 
 abc_poll <-mutate(abc_poll, pp_educ_5 = recode_factor(pp_educ_5,
        "No high school diploma or GED" = "E1",
        "High school graduate (high school diploma or the equivalent GED)"
                                     = "E2",
  "Some college or Associate degree" = "E3",
                          "Bachelor"= "E4",
                            "Master"= "E5",
                          .ordered = TRUE))
 
#pp_employment_status_3
 abc_poll <-mutate(abc_poll, pp_employment_status_3 = recode_factor(pp_employment_status_3,
                                     "Not working" = "ES1",
                                 "Working part-time"= "ES2",
                               "Working full-time" = "ES3",
                                     .ordered = TRUE))
 
 
 
 
 
 ######## I know that Q2 concern level Q4 have "ordinal" responses,
 ###Can I order a subset of the "Response" column?
 ### Should I order these variables before the pivot? (can a column be of mixed type?)
 ###Would it be better to have Nominal Questions Separated from Ordinal Questions?
 #Q2ConcernLevel_4
 ##I used the code below when I only pivoted the parts of Q1 under a Q1 column and
 # left the other questions as individual columns
 
 # abc_poll <-mutate(abc_poll, Q2ConcernLevel_4 = 
 #                     recode_factor(Q2ConcernLevel_4 ,
 #                                     "Not concerned at all" = "C0",
 #                                     "Not so concerned" = "C1",
 #                                     "Somewhat concerned" = "C2",
 #                                      "Very Concerned" = "C3",
 #                                     .ordered = TRUE))
 # 
#Q4_5
# abc_poll <-mutate(abc_poll, Q4_5 =
#                     recode_factor(Q4_5 ,
#                                     "Poor" = 0,
#                                     "Not so good" = 1,
#                                    "Good" = 2,
#                                   "Excellent" = 3,
#                                     "Skipped" = -1,
#                                     .ordered = TRUE))
#  
 abc_pollWe can see that all of our ordinal variables now are of type ord, our nominal variables are of type char, and our discrete variables are of type double.
What other tips are there for making smart names for variables based on their level of measurement?
What mutations should be done on the read in and what should be saved for post read in?
Here I would complete a key for all of the variables that are included in my table. Is there a better template for this?
pp_educ_5, ordinal: The reported educational attainment of the respondent.
| value | Key | 
|---|---|
| No high school diploma or GED | E1 | 
| high school diploma or the equivalent GED | E2 | 
| Some college or Associate degree | E3 | 
| Bachelor | E4 | 
| Masteror Higher | E5 | 
pp_inc_7, ordinal: the reported annual income level of the respondent.
| value | Key | 
|---|---|
| Less than $10,000 | I1 | 
| $10,000 to $24,999 | I2 | 
| $25,000 to $49,999 | I3 | 
| $50,000 to $74,999 | I4 | 
| $75,000 to $99,999 | I5 | 
| $100,000 to $149,999 | I6 | 
| $150,000 or more | I7 | 
---
title: "Challenge 4"
author: "Theresa Szczepanski"
desription: "More data wrangling: pivoting"
date: "9/30/2022"
format:
  html:
    df-print: paged
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - Theresa_Szczepanski
  - challenge_4
  - abc_poll
 # - fed_rates
#  - hotel_bookings
#  - debt
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Approach
To address today's challenge I tried to:
1)  read in a data set, and describe the data set using both words and any 
supporting information (e.g., tables, etc)
2)  Used summary on my data frame and examined the spreadsheet itself to 
describe the data
3) tidied the data (as needed, including sanity checks)
3)  Identified variables to be mutated
4)  Mutate all ordinal variables using `factor`
5) Create a _Codebook_ key for all of my variables
7) Make note of my questions about my process and inefficienct coding practices.
##  abc_poll.csv ⭐
::: panel-tabset
### Read in data
To read in the data, I used the following process:
- Examine the summary
- Identify information to filter out on the read in
- Identify variables to rename on the read in
- Identify variables to mutate on the read in to simplify values
- Identify variable values to mutate on the read in to fix data type issues.
::: panel-tabset
### Examine the Summary
```{r}
#Read in the abc_poll data and use the summary to decide how to best set up
# the our data frame
 abc_poll<-read_csv("_data/abc_poll_2021.csv")
 print(summarytools::dfSummary(abc_poll,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.70,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')
```
### Filter, Rename, and Mutate on Read in
After examining the summary, I chose to 
**Filter**:
- `id`: this info won't be used in an analysis
- `complete_status`: everyone was qualified
- `ppeducat`: this categorizing of `ppeduc5` can be done in the data frame
using a `case_when()`
- `ABCAGE`: this qualitative age range variable can be replicated by using the
data in the `ppage` variable and a `case_when`; one might want to examine 
different ranges of ages.
- `weights_pid` variable, since this is calculated using percentages of
 respondents relative to their representation to the general population and can
 be calculated using data within our data frame.
 
 
 
__Rename__
- I renamed all of the variables corresponding to 
_demographic characteristics of the poll participant_ 
to begin with `pp_`.
- I renamed all of the variables corresponding to _survey question responses_
from the participants to begin with `Q`
- If a variable had a fixed number of possible responses (which I could see from
the summary), e.g., `pp_marital` had 5 possible responses, 
I included the number of "categories" or possible responses
in the variable name preceded by an underscore, `pp_marital_5`
__Mutate__
 
 - I replaced the `pp_hhsize_6` value of "6 or more" with 6, so that it could
 be of double data type
 
 - I mutated the `pp_educ5` column to remove the
 apostrophes from "Bachelor's" and "Master's" that were producing the "\\x92"'s 
 in the values on read in.
 
 - If a _nominal_ variable had lengthy values, I reduced them to the key info 
 using `mutate`, `str_sub`, and `case_when`
```{r}
#Filter, rename variables, and mutate values of variables on read-in
abc_poll<-read_csv("_data/abc_poll_2021.csv", skip = 1,  
                   col_names= c("delete",  "pp_Language_2",  "delete","pp_age", 
                                "pp_educ_5", "delete", "pp_gender_2", 
                                "pp_ethnicity_5", "pp_hhsize_6", "pp_inc_7", 
                                "pp_marital_5", "pp_metro_cat_2", "pp_region_4",
                                "pp_housing_3", "pp_state", 
                                "pp_working_arrangement_9", 
                                "pp_employment_status_3", "Q1a_3", "Q1b_3", 
                                "Q1c_3",  "Q1d_3","Q1e_3", "Q1f_3","Q2ConcernLevel_4",
                                "Q3_3", "Q4_5",  "Q5Optimism_3", 
                                "pp_political_id_5", "delete", "pp_contact_2",  
                                  "delete"))%>%
  select(!contains("delete"))%>%
  
  #replace "6 or more" in pp_hhsize_6 to the value of 6 so that the column can be
  # of double data type.
     mutate(pp_hhsize_6 = ifelse(pp_hhsize_6 == "6 or more", "6", pp_hhsize_6)) %>%
    transform( pp_hhsize_6 = as.numeric(pp_hhsize_6))%>%
  
  #fix the issue with apostrophes in pp_educ_5 values on read in
    mutate(pp_educ_5 = ifelse(str_starts(pp_educ_5,"Bachelor"), 
                           "Bachelor", pp_educ_5))%>%
    mutate(pp_educ_5 = ifelse(str_starts(pp_educ_5, "Master"), "Master", pp_educ_5))
  # reduce lengthy responses to necessary info in nominal variables
  abc_poll$pp_Language_2 = substr(abc_poll$pp_Language_2,1,2)
  #mutate(pp_Language_2 = (str_sub(abc_poll,pp_Language_2, 1, 2)))
  abc_poll$pp_gender_2 = substr(abc_poll$pp_gender_2,1,1)
  abc_poll$pp_contact_2 = substr(abc_poll$pp_contact_2,1,1)
  
  #reduce lengthy responses of nominal variables using Case When
  #pp_ethnicity_5
  abc_poll <-mutate(abc_poll, pp_ethnicity_5 = case_when(
    pp_ethnicity_5 == "2+ Races, Non-Hispanic" ~ "2+NH",
    pp_ethnicity_5 == "Black, Non-Hispanic" ~ "BlNH",
    pp_ethnicity_5 == "Hispanic" ~ "H",
    pp_ethnicity_5 == "Other, Non-Hispanic" ~ "OtNH",
    pp_ethnicity_5 == "White, Non-Hispanic" ~ "WhNH"
))
  
  #pp_metro_cat_2
  abc_poll <-mutate(abc_poll, pp_metro_cat_2 = case_when(
    pp_metro_cat_2 == "Metro area" ~ "M",
    pp_metro_cat_2 == "Non-metro area" ~ "NM"
))
 
  #pp_political_id_5 
 abc_poll <-mutate(abc_poll, pp_political_id_5  = case_when(
    pp_political_id_5 == "A Democrat" ~ "Dem",
    pp_political_id_5 == "A Republican" ~ "Rep",
    pp_political_id_5 == "An Independent" ~ "Ind",
    pp_political_id_5 == "Something else" ~ "Other",
    pp_political_id_5 == "Skipped" ~ "DNR"
))
 
 #pp_housing_3
abc_poll <-mutate(abc_poll, pp_housing_3 = case_when(
    pp_housing_3 == "Occupied without payment of cash rent" ~ "NonP_Occupied",
    pp_housing_3 == "Rented for cash"~ "P_Rent",
    pp_housing_3 == "Owned or being bought by you or someone in your household" ~ "P_Own"))
  #pp_region_4 
 abc_poll <-mutate(abc_poll, pp_region_4  = case_when(
    pp_region_4 == "MidWest" ~ "MW",
    pp_region_4 == "NorthEast" ~ "NE",
    pp_region_4 == "South" ~ "S",
    pp_region_4 == "West" ~ "W",
    
))
  #pp_marital_5 
 abc_poll <-mutate(abc_poll, pp_marital_5 = case_when(
   pp_marital_5 == "Never married" ~ "NM",
   pp_marital_5 == "Now Married" ~ "M",
   pp_marital_5 == "Separated" ~ "S",
   pp_marital_5 == "Divorced" ~ "D",
   pp_marital_5 == "Widowed" ~ "W"))
 
# pp_working_arrangement_9
 abc_poll <-mutate(abc_poll, pp_working_arrangement_9 = case_when(
          pp_working_arrangement_9 == "Other" ~ "Other",
          pp_working_arrangement_9 =="Retired" ~ "Retired",
          pp_working_arrangement_9 == "Homemaker" ~ "Homemaker",
          pp_working_arrangement_9 == "Student" ~ "Student",
          pp_working_arrangement_9 == "Currently laid off" ~ "Laid Off",
          pp_working_arrangement_9 == "On furlough"~ "Furlough",
          pp_working_arrangement_9 == "Employed part-time (by someone else)" ~ "Emp_PT",
          pp_working_arrangement_9 =="Self-employed" ~ "Emp_Self",
          pp_working_arrangement_9 == "Employed full-time (by someone else)"~ "Emp_FT"))
 
  #Q3_3 What is the best "coding for variables that are like "Booleans"?
 # abc_poll <-mutate(abc_poll, Q3_3 = case_when(
 #                                   Q3_3 ==  "Yes" ~ 1,
 #                                   Q3_3 ==  "No" ~ 0,
 #                                     Q3_3 ==  "Skipped" ~ 1))
 
 #Q5Optimism_3
 # abc_poll <-mutate(abc_poll, Q5Optimism_3 = case_when(
 #                                    Q5Optimism_3 == "Pessimistic"~ 0,
 #                                    Q5Optimism_3 == "Optimistic" ~ 1,
 #                                    Q5Optimism_3 == "Skipped" ~ -1))
 
 
  
  abc_poll
  
```
## Question
- Is there a way to not be writing a mutate line for each variable the way I did 
on the read in?
:::
### Briefly describe the data
:::panel-tabset
### Broad Summary
From our `abc_poll` data frame summary, we can see that this data set
contains polling results from 527 respondents to an ABC news political poll. 
The results consist of information for two broad categories
- *Demographic characteristics* of 
the respondents themselves (e.g., language of the poll given to the respondent
(Spanish or English), age, educational attainment, ethnicity, household size,
ethnic make up, gender, income range, Marital status, Metro category, 
Geographic region, Rental status, State, Employment status, 
Working characteristics, Willingness to have a follow up interview)
- *The responses that the individuals gave* to 10
questions (there are 5 broad questions Q1-Q5, but Q1 consists of 6 
sub questions, a-f).
 Now when we examine our summary, we can see that 
 
- each categorical variable is of character data type with the number of 
distinct categories included in the variable name
- some of these categorical variables are _ordinal_ and will need the ordering 
of their values coded in as `factor`s.
- each variable has 527 observations partitioned among the possible values
- each _discrete_ numerical variable is of `double` data type.
### Post Read in Variable Summary
```{r}
print(summarytools::dfSummary(abc_poll,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE,
                        style        = "grid",
                        graph.magnif = 0.70,
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')
```
:::
### Tidy Data (as needed)
In order to have tidy data, each row should be a unique observation. A unique 
case therefore should consist of all of the demographic information about the 
polled person and their response for one of the questions.
- The demographic characteristics (our `pp_` variables) and the specific `Question`
variable define a _case_.
- The _value_ for each case is the poll participants `Response`.
```{r}
abc_poll<-abc_poll %>%
    pivot_longer(c(starts_with("Q1")), names_to = "Question 1 part", values_to = "Q1 Response")
abc_poll
```
## Questions
- When examining the data, I noticed nominal, ordinal, discrete, and continuous
 variables. 
 
 - Does `arrange` sort your `ordinal` data by the factors or use something else?
- Should you ever pivot variables of different levels of measurement into the 
same column? In the `abc_poll` case, I could imagine pivoting all of the questions 
names into a column and all of the responses into a response column. But some of the 
questions had response values that were clearly ordinal and others that were 
not. Is there a "best practice" for this?
### Identify Desired Variable Mutations
:::panel-tabset
When examining our variables there are two issues to address
- Some of the variable are _ordinal_ and will need the ordering coded in
- Some of the ordinal variable values are very long and fill up too much space 
in our table making it unpleasant to read through
::: panel-tabset
### Factoring of Ordinal Variables
There are several _ordinal_ variables where we need to code in the ordering of
 the categories.
```{r}
# use factoring to put an ordering to all ordinal variables
                         
             
# Why couldn't I use a levels vector!                 
 # pp_inc_7_levels <- c(
 #    "Less than $10,000", "$10,000 to $24,999",  "$25,000 to $49,999",  
 #    "$50,000 to $74,999", "$75,000 to $99,999", "$100,000 to $149,999",  
 #    "150,000 or more")
  
 abc_poll <-mutate(abc_poll, pp_inc_7 = recode_factor(pp_inc_7, 
                                   "Less than $10,000" = "I1", 
                                   "$10,000 to $24,999" = "I2",  
                                   "$25,000 to $49,999" = "I3", 
                                   "$50,000 to $74,999"= "I4", 
                                   "$75,000 to $99,999"= "I5", 
                                   "$100,000 to $149,999" = "I6",
                                   "$150,000 or more" = "I7",
                                  .ordered = TRUE))
 #pp_educ_5
 
 abc_poll <-mutate(abc_poll, pp_educ_5 = recode_factor(pp_educ_5,
        "No high school diploma or GED" = "E1",
        "High school graduate (high school diploma or the equivalent GED)"
                                     = "E2",
  "Some college or Associate degree" = "E3",
                          "Bachelor"= "E4",
                            "Master"= "E5",
                          .ordered = TRUE))
 
#pp_employment_status_3
 abc_poll <-mutate(abc_poll, pp_employment_status_3 = recode_factor(pp_employment_status_3,
                                     "Not working" = "ES1",
                                 "Working part-time"= "ES2",
                               "Working full-time" = "ES3",
                                     .ordered = TRUE))
 
 
 
 
 
 ######## I know that Q2 concern level Q4 have "ordinal" responses,
 ###Can I order a subset of the "Response" column?
 ### Should I order these variables before the pivot? (can a column be of mixed type?)
 ###Would it be better to have Nominal Questions Separated from Ordinal Questions?
 #Q2ConcernLevel_4
 ##I used the code below when I only pivoted the parts of Q1 under a Q1 column and
 # left the other questions as individual columns
 
 # abc_poll <-mutate(abc_poll, Q2ConcernLevel_4 = 
 #                     recode_factor(Q2ConcernLevel_4 ,
 #                                     "Not concerned at all" = "C0",
 #                                     "Not so concerned" = "C1",
 #                                     "Somewhat concerned" = "C2",
 #                                      "Very Concerned" = "C3",
 #                                     .ordered = TRUE))
 # 
#Q4_5
# abc_poll <-mutate(abc_poll, Q4_5 =
#                     recode_factor(Q4_5 ,
#                                     "Poor" = 0,
#                                     "Not so good" = 1,
#                                    "Good" = 2,
#                                   "Excellent" = 3,
#                                     "Skipped" = -1,
#                                     .ordered = TRUE))
#  
 abc_poll
 
 ##Is the data frame arranged "alphabetically" or "ordinally?"
 abc_poll%>%
  arrange(desc(pp_educ_5))
```
We can see that all of our _ordinal_ variables now are of type `ord`, our 
_nominal_ variables are of type `char`, and our _discrete_ variables are of type
`double`.
## Questions
- What other tips are there for making smart names for variables based on their 
level of measurement?
- What mutations should be done on the read in and what should be saved for post
 read in?
### Codebook for ABC Variables
Here I would complete a key for all of the variables that are included in my table.
Is there a better template for this?
`pp_educ_5`, _ordinal_: The reported educational attainment of the respondent.
| value | Key|
| ----------- |--------|
|No high school diploma or GED  | E1     | 
|high school diploma or the equivalent GED | E2     | 
| Some college or Associate degree   | E3     |
|Bachelor | E4     | 
|Masteror Higher | E5     | 
`pp_inc_7`, _ordinal_: the reported annual income level of the respondent.
| value | Key|
| ----------- |--------|
|Less than $10,000 | I1     | 
|$10,000 to $24,999 | I2     | 
|$25,000 to $49,999  | I3     |
|$50,000 to $74,999 | I4     | 
|$75,000 to $99,999 | I5     | 
| $100,000 to $149,999 | I6     | 
|$150,000 or more | I7     | 
## Questions
- Is this an ok template for a _codebook_?
:::
:::
:::