challenge_4
abc_poll
Prasann Desai
More data wrangling: pivoting
Author

Prasann Desai

Published

June 30, 2023

Code
library(tidyverse)
library(stringr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
Code
# Function call to read a csv file
abc_poll_2021 <- read_csv("_data/abc_poll_2021.csv")
Code
abc_poll_2021
# A tibble: 527 × 31
        id xspanish comple…¹ ppage ppeduc5 ppedu…² ppgen…³ ppethm pphhs…⁴ ppinc7
     <dbl> <chr>    <chr>    <dbl> <chr>   <chr>   <chr>   <chr>  <chr>   <chr> 
 1 7230001 English  qualifi…    68 "High … High s… Female  White… 2       $25,0…
 2 7230002 English  qualifi…    85 "Bache… Bachel… Male    White… 2       $150,…
 3 7230003 English  qualifi…    69 "High … High s… Male    White… 2       $100,…
 4 7230004 English  qualifi…    74 "Bache… Bachel… Female  White… 1       $25,0…
 5 7230005 English  qualifi…    77 "High … High s… Male    White… 3       $10,0…
 6 7230006 English  qualifi…    70 "Bache… Bachel… Male    White… 2       $75,0…
 7 7230007 English  qualifi…    26 "Maste… Bachel… Male    Other… 3       $150,…
 8 7230008 English  qualifi…    76 "Bache… Bachel… Male    Black… 2       $50,0…
 9 7230009 English  qualifi…    78 "Bache… Bachel… Female  White… 2       $150,…
10 7230010 English  qualifi…    47 "Maste… Bachel… Male    Other… 4       $150,…
# … with 517 more rows, 21 more variables: ppmarit5 <chr>, ppmsacat <chr>,
#   ppreg4 <chr>, pprent <chr>, ppstaten <chr>, PPWORKA <chr>, ppemploy <chr>,
#   Q1_a <chr>, Q1_b <chr>, Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>,
#   Q2 <chr>, Q3 <chr>, Q4 <chr>, Q5 <chr>, QPID <chr>, ABCAGE <chr>,
#   Contact <chr>, weights_pid <dbl>, and abbreviated variable names
#   ¹​complete_status, ²​ppeducat, ³​ppgender, ⁴​pphhsize
Code
colnames(abc_poll_2021)
 [1] "id"              "xspanish"        "complete_status" "ppage"          
 [5] "ppeduc5"         "ppeducat"        "ppgender"        "ppethm"         
 [9] "pphhsize"        "ppinc7"          "ppmarit5"        "ppmsacat"       
[13] "ppreg4"          "pprent"          "ppstaten"        "PPWORKA"        
[17] "ppemploy"        "Q1_a"            "Q1_b"            "Q1_c"           
[21] "Q1_d"            "Q1_e"            "Q1_f"            "Q2"             
[25] "Q3"              "Q4"              "Q5"              "QPID"           
[29] "ABCAGE"          "Contact"         "weights_pid"    

Briefly describe the data

Response:

From the above output, we can see that the dataset contains certain demographic information about the 527 participants and their respective responses to all the questions of a poll conducted in 2021. Based on the column names, we can say that there were 10 different opinion/views based questions asked to each participant.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Response: It appears to me that the data is already tidy. (Each variable has its own column, Each observation has its own row and Each value is in its own cell)

Code
# Sanity check for duplicate responses
n_distinct(abc_poll_2021$id)
[1] 527
Code
# Check to see if the category wise totals add up to 527
table(abc_poll_2021$ppeduc5)

                                            Bachelor\x92s degree 
                                                             108 
High school graduate (high school diploma or the equivalent GED) 
                                                             133 
                                     Master\x92s degree or above 
                                                              99 
                                   No high school diploma or GED 
                                                              29 
                                Some college or Associate degree 
                                                             158 
Code
# Another Check to see if the category wise totals add up to 527
table(abc_poll_2021$ppethm)

2+ Races, Non-Hispanic    Black, Non-Hispanic               Hispanic 
                    21                     27                     51 
   Other, Non-Hispanic    White, Non-Hispanic 
                    24                    404 

Any additional comments?

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

Code
# Removing the redundant term "Non-Hispanic" from 'ppethm' column. Add a separate "Hispanic/Non-Hispanic" flag for backward compatibility.
abc_poll_2021 <- mutate(abc_poll_2021, is_hispanic = !str_detect(ppethm, "Non-Hispanic"), pprace = str_split(ppethm, ",", simplify = TRUE)[,1])
abc_poll_2021 %>% select(ppethm, is_hispanic, pprace)
# A tibble: 527 × 3
   ppethm              is_hispanic pprace
   <chr>               <lgl>       <chr> 
 1 White, Non-Hispanic FALSE       White 
 2 White, Non-Hispanic FALSE       White 
 3 White, Non-Hispanic FALSE       White 
 4 White, Non-Hispanic FALSE       White 
 5 White, Non-Hispanic FALSE       White 
 6 White, Non-Hispanic FALSE       White 
 7 Other, Non-Hispanic FALSE       Other 
 8 Black, Non-Hispanic FALSE       Black 
 9 White, Non-Hispanic FALSE       White 
10 Other, Non-Hispanic FALSE       Other 
# … with 517 more rows
Code
table(abc_poll_2021$pprace)

2+ Races    Black Hispanic    Other    White 
      21       27       51       24      404 
Code
# Cleaning the 'ppeduc5' column values
abc_poll_2021 <- mutate(abc_poll_2021, ppeduc5_clean = str_replace(ppeduc5, "\x92", "'"))
abc_poll_2021 %>% select(ppeduc5_clean, ppeduc5)
# A tibble: 527 × 2
   ppeduc5_clean                                                    ppeduc5     
   <chr>                                                            <chr>       
 1 High school graduate (high school diploma or the equivalent GED) "High schoo…
 2 Bachelor's degree                                                "Bachelor\x…
 3 High school graduate (high school diploma or the equivalent GED) "High schoo…
 4 Bachelor's degree                                                "Bachelor\x…
 5 High school graduate (high school diploma or the equivalent GED) "High schoo…
 6 Bachelor's degree                                                "Bachelor\x…
 7 Master's degree or above                                         "Master\x92…
 8 Bachelor's degree                                                "Bachelor\x…
 9 Bachelor's degree                                                "Bachelor\x…
10 Master's degree or above                                         "Master\x92…
# … with 517 more rows
Code
table(abc_poll_2021$ppeduc5_clean)

                                               Bachelor's degree 
                                                             108 
High school graduate (high school diploma or the equivalent GED) 
                                                             133 
                                        Master's degree or above 
                                                              99 
                                   No high school diploma or GED 
                                                              29 
                                Some college or Associate degree 
                                                             158 
Code
# Cleaning the 'contact' column to 'interview_consent'. Also, replacing the values to a simple "Yes" and "No" for better readability.
abc_poll_2021 <- rename(abc_poll_2021, interview_consent = Contact)
abc_poll_2021 <- mutate(abc_poll_2021, interview_consent_clean = case_when(str_detect(interview_consent, "Yes") ~ "Yes",
                                                                     str_detect(interview_consent, "No") ~ "No"
                                                                     ))
abc_poll_2021 %>% select(interview_consent, interview_consent_clean)
# A tibble: 527 × 2
   interview_consent                      interview_consent_clean
   <chr>                                  <chr>                  
 1 No, I am not willing to be interviewed No                     
 2 No, I am not willing to be interviewed No                     
 3 No, I am not willing to be interviewed No                     
 4 Yes, I am willing to be interviewed    Yes                    
 5 No, I am not willing to be interviewed No                     
 6 No, I am not willing to be interviewed No                     
 7 Yes, I am willing to be interviewed    Yes                    
 8 Yes, I am willing to be interviewed    Yes                    
 9 Yes, I am willing to be interviewed    Yes                    
10 Yes, I am willing to be interviewed    Yes                    
# … with 517 more rows
Code
table(abc_poll_2021$interview_consent_clean)

 No Yes 
355 172 

Overall, I think the new columns added (is_hispanic, pprace, ppeduc5_clean, interview_consent_clean) will improve readability and understanding of the dataset as well as lead to cleaner visualizations that will be built on top of this dataset.

Any additional comments?