DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 6

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
    • abc_poll ⭐⭐⭐

Challenge 6

  • Show All Code
  • Hide All Code

  • View Source
challenge_6
abc_poll
Theresa_Szczepanski
Visualizing Time and Relationships
Author

Theresa Szczepanski

Published

October 21, 2022

Code
library(tidyverse)
library(ggplot2)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

abc_poll ⭐⭐⭐

  • Read in the data
  • Tidy/Mutate Data (as needed)
  • Mutated Summary
  • Visualizing Part-Whole Relationships
  • Visualizing Flow Relationship
  • Failed Flow Visuals
  • Categorical Bivariate Relationships with Heat Map
  • Time Dependent Visualization
  • Briefly describe the data
  • Post Read in Data Summary

From our abc_poll data frame summary, we can see that this data set contains polling results from 527 respondents to an ABC news political poll. The results consist of information for two broad categories

  • Demographic characteristics of the respondents themselves (e.g., language of the poll given to the respondent (Spanish or English), age, educational attainment, ethnicity, household size, ethnic make up, gender, income range, Marital status, Metro category, Geographic region, Rental status, State, Employment status, Working characteristics, Willingness to have a follow up interview)

  • The responses that the individuals gave to 11 questions (there are 5 broad questions Q1-Q5, but Q1 consists of 6 sub questions, a-f).

Code
#Filter, rename variables, and mutate values of variables on read-in

abc_poll<-read_csv("_data/abc_poll_2021.csv", skip = 1,  
                   col_names= c("pp_id",  "pp_Language_2",  "delete","pp_age", 
                                "pp_educ_5", "delete", "pp_gender_2", 
                                "pp_ethnicity_5", "pp_hhsize_6", "pp_inc_7", 
                                "pp_marital_5", "pp_metro_cat_2", "pp_region_4",
                                "pp_housing_3", "pp_state", 
                                "pp_working_arrangement_9", 
                                "pp_employment_status_3", "Q1a_3", "Q1b_3", 
                                "Q1c_3",  "Q1d_3","Q1e_3", "Q1f_3","Q2ConcernLevel_4",
                                "Q3_3", "Q4_5",  "Q5Optimism_3", 
                                "pp_political_id_5", "delete", "pp_contact_2",  
                                  "weights_pid"))%>%
  select(!contains("delete"))%>%
  
  #replace "6 or more" in pp_hhsize_6 to the value of 6 so that the column can be
  # of double data type.
     mutate(pp_hhsize_6 = ifelse(pp_hhsize_6 == "6 or more", "6", pp_hhsize_6)) %>%
    transform( pp_hhsize_6 = as.numeric(pp_hhsize_6))%>%
  
  #fix the issue with apostrophes in pp_educ_5 values on read in
    mutate(pp_educ_5 = ifelse(str_starts(pp_educ_5,"Bachelor"), 
                           "Bachelor", pp_educ_5))%>%
    mutate(pp_educ_5 = ifelse(str_starts(pp_educ_5, "Master"), "Master", pp_educ_5))

  # reduce lengthy responses to necessary info in nominal variables

  abc_poll$pp_Language_2 = substr(abc_poll$pp_Language_2,1,2)
 
  abc_poll$pp_gender_2 = substr(abc_poll$pp_gender_2,1,1)
  abc_poll$pp_contact_2 = substr(abc_poll$pp_contact_2,1,1)
  
  #reduce lengthy responses of nominal variables using Case When
  
 #pp_political_id_5 
 abc_poll<-mutate(abc_poll, pp_political_id_5  = case_when(
    pp_political_id_5 == "A Democrat" ~ "Dem",
    pp_political_id_5 == "A Republican" ~ "Rep",
    pp_political_id_5 == "An Independent" ~ "Ind",
    pp_political_id_5 == "Something else" ~ "Other",
    pp_political_id_5 == "Skipped" ~ "Skipped"
))%>%
 
 #pp_housing_3
mutate(pp_housing_3 = case_when(
    pp_housing_3 == "Occupied without payment of cash rent" ~ "NonPayment_Occupied",
    pp_housing_3 == "Rented for cash"~ "Payment_Rent",
    pp_housing_3 == "Owned or being bought by you or someone in your household" ~ "Payment_Own"))%>%

 
 
# pp_working_arrangement_9
 mutate(pp_working_arrangement_9 = case_when(
          pp_working_arrangement_9 == "Other" ~ "Other",
          pp_working_arrangement_9 =="Retired" ~ "Retired",
          pp_working_arrangement_9 == "Homemaker" ~ "Homemaker",
          pp_working_arrangement_9 == "Student" ~ "Student",
          pp_working_arrangement_9 == "Currently laid off" ~ "Laid Off",
          pp_working_arrangement_9 == "On furlough"~ "Furlough",
          pp_working_arrangement_9 == "Employed part-time (by someone else)" ~ "Employed_PT",
          pp_working_arrangement_9 =="Self-employed" ~ "Emp_Self",
          pp_working_arrangement_9 == "Employed full-time (by someone else)"~ "Employed_FT"))%>%
   
    #pp_ethnicity_5
  mutate( pp_ethnicity_5 = case_when(
    pp_ethnicity_5 == "2+ Races, Non-Hispanic" ~ "2+ \n NH",
    pp_ethnicity_5 == "Black, Non-Hispanic" ~ "Bl \n NH",
    pp_ethnicity_5 == "Hispanic" ~ "Hisp",
    pp_ethnicity_5 == "Other, Non-Hispanic" ~ "Ot \n NH",
    pp_ethnicity_5 == "White, Non-Hispanic" ~ "Wh \n NH"

))
 


 
  
  abc_poll
Code
View(abc_poll)
Code
print(summarytools::dfSummary(abc_poll,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.70,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')

Data Frame Summary

abc_poll

Dimensions: 527 x 28
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
pp_id [numeric]
Mean (sd) : 7230264 (152.3)
min ≤ med ≤ max:
7230001 ≤ 7230264 ≤ 7230527
IQR (CV) : 263 (0)
527 distinct values 0 (0.0%)
pp_Language_2 [character]
1. En
2. Sp
514(97.5%)
13(2.5%)
0 (0.0%)
pp_age [numeric]
Mean (sd) : 53.4 (17.1)
min ≤ med ≤ max:
18 ≤ 55 ≤ 91
IQR (CV) : 27 (0.3)
72 distinct values 0 (0.0%)
pp_educ_5 [character]
1. Bachelor
2. High school graduate (hig
3. Master
4. No high school diploma or
5. Some college or Associate
108(20.5%)
133(25.2%)
99(18.8%)
29(5.5%)
158(30.0%)
0 (0.0%)
pp_gender_2 [character]
1. F
2. M
254(48.2%)
273(51.8%)
0 (0.0%)
pp_ethnicity_5 [character]
1. 2+
NH
2. Bl
NH
3. Hisp
4. Ot
NH
5. Wh
NH
21(4.0%)
27(5.1%)
51(9.7%)
24(4.6%)
404(76.7%)
0 (0.0%)
pp_hhsize_6 [numeric]
Mean (sd) : 2.6 (1.3)
min ≤ med ≤ max:
1 ≤ 2 ≤ 6
IQR (CV) : 1 (0.5)
1:80(15.2%)
2:219(41.6%)
3:102(19.4%)
4:76(14.4%)
5:35(6.6%)
6:15(2.8%)
0 (0.0%)
pp_inc_7 [character]
1. $10,000 to $24,999
2. $100,000 to $149,999
3. $150,000 or more
4. $25,000 to $49,999
5. $50,000 to $74,999
6. $75,000 to $99,999
7. Less than $10,000
32(6.1%)
105(19.9%)
137(26.0%)
82(15.6%)
85(16.1%)
69(13.1%)
17(3.2%)
0 (0.0%)
pp_marital_5 [character]
1. Divorced
2. Never married
3. Now Married
4. Separated
5. Widowed
43(8.2%)
111(21.1%)
337(63.9%)
8(1.5%)
28(5.3%)
0 (0.0%)
pp_metro_cat_2 [character]
1. Metro area
2. Non-metro area
448(85.0%)
79(15.0%)
0 (0.0%)
pp_region_4 [character]
1. MidWest
2. NorthEast
3. South
4. West
118(22.4%)
93(17.6%)
190(36.1%)
126(23.9%)
0 (0.0%)
pp_housing_3 [character]
1. NonPayment_Occupied
2. Payment_Own
3. Payment_Rent
10(1.9%)
406(77.0%)
111(21.1%)
0 (0.0%)
pp_state [character]
1. California
2. Texas
3. Florida
4. Pennsylvania
5. Illinois
6. New Jersey
7. Ohio
8. Michigan
9. New York
10. Washington
[ 39 others ]
51(9.7%)
42(8.0%)
34(6.5%)
28(5.3%)
23(4.4%)
21(4.0%)
21(4.0%)
18(3.4%)
18(3.4%)
18(3.4%)
253(48.0%)
0 (0.0%)
pp_working_arrangement_9 [character]
1. Emp_Self
2. Employed_FT
3. Employed_PT
4. Furlough
5. Homemaker
6. Laid Off
7. Other
8. Retired
32(6.2%)
220(42.4%)
31(6.0%)
1(0.2%)
37(7.1%)
13(2.5%)
20(3.9%)
165(31.8%)
8 (1.5%)
pp_employment_status_3 [character]
1. Not working
2. Working full-time
3. Working part-time
221(41.9%)
245(46.5%)
61(11.6%)
0 (0.0%)
Q1a_3 [character]
1. Approve
2. Disapprove
3. Skipped
329(62.4%)
193(36.6%)
5(0.9%)
0 (0.0%)
Q1b_3 [character]
1. Approve
2. Disapprove
3. Skipped
192(36.4%)
322(61.1%)
13(2.5%)
0 (0.0%)
Q1c_3 [character]
1. Approve
2. Disapprove
3. Skipped
272(51.6%)
248(47.1%)
7(1.3%)
0 (0.0%)
Q1d_3 [character]
1. Approve
2. Disapprove
3. Skipped
192(36.4%)
321(60.9%)
14(2.7%)
0 (0.0%)
Q1e_3 [character]
1. Approve
2. Disapprove
3. Skipped
212(40.2%)
301(57.1%)
14(2.7%)
0 (0.0%)
Q1f_3 [character]
1. Approve
2. Disapprove
3. Skipped
281(53.3%)
230(43.6%)
16(3.0%)
0 (0.0%)
Q2ConcernLevel_4 [character]
1. Not concerned at all
2. Not so concerned
3. Somewhat concerned
4. Very concerned
65(12.3%)
147(27.9%)
221(41.9%)
94(17.8%)
0 (0.0%)
Q3_3 [character]
1. No
2. Skipped
3. Yes
107(20.3%)
5(0.9%)
415(78.7%)
0 (0.0%)
Q4_5 [character]
1. Excellent
2. Good
3. Not so good
4. Poor
5. Skipped
60(11.4%)
215(40.8%)
97(18.4%)
149(28.3%)
6(1.1%)
0 (0.0%)
Q5Optimism_3 [character]
1. Optimistic
2. Pessimistic
3. Skipped
229(43.5%)
295(56.0%)
3(0.6%)
0 (0.0%)
pp_political_id_5 [character]
1. Dem
2. Ind
3. Other
4. Rep
5. Skipped
176(33.4%)
168(31.9%)
28(5.3%)
152(28.8%)
3(0.6%)
0 (0.0%)
pp_contact_2 [character]
1. N
2. Y
355(67.4%)
172(32.6%)
0 (0.0%)
weights_pid [numeric]
Mean (sd) : 1 (0.6)
min ≤ med ≤ max:
0.3 ≤ 0.8 ≤ 6.3
IQR (CV) : 0.5 (0.6)
453 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21

On the read in, I chose to

Filter:

  • complete_status: everyone was qualified
  • ppeducat: this categorizing of ppeduc5 can be done in the data frame using a case_when() and factoring
  • ABCAGE: this qualitative age range variable can be replicated by using the data in the ppage variable and a case_when; one might want to examine different ranges of ages.

Rename

  • I renamed all of the variables corresponding to demographic characteristics of the poll participant to begin with pp_.

  • I renamed all of the variables corresponding to survey question responses from the participants to begin with Q

  • If a variable had a fixed number of possible responses (which I could see from the summary), e.g., pp_marital had 5 possible responses, I included the number of “categories” or possible responses in the variable name preceded by an underscore, pp_marital_5

Mutate

  • I replaced the pp_hhsize_6 value of “6 or more” with 6, so that it could be of double data type

  • I mutated the pp_educ5 column to remove the apostrophes from “Bachelor’s” and “Master’s” that were producing the “\x92”’s in the values on read in.

  • If a nominal variable had lengthy values, I reduced them to the key info using mutate, str_sub, and case_when

Because our data frame is poll data, our frame will stay relatively wide. Each polled person pp_id represents a unique case and the values for the case are

  • the demographic characteristics of the polled person and
  • the individual’s responses to a given survey question

To tidy our data, I factored the following ordinal variables:

  • pp_inc_7: The income level of the polled person
  • pp_educ_5: The educational attainment level of the polled person
  • pp_employment_status_3: The employment status of the polled person (not working, working part time, working full time)
Code
abc_poll<-abc_poll %>%
    pivot_longer(c(starts_with("Q1")), names_to = "Question 1 part", values_to = "Q1 Response")

abc_poll <-mutate(abc_poll, pp_inc_7 = recode_factor(pp_inc_7, 
                                   "Less than $10,000" = "<10,000", 
                                   "$10,000 to $24,999" =  "10,000-\n 24,999",  
                                   "$25,000 to $49,999" = "25,000- \n 49,999", 
                                   "$50,000 to $74,999"= "50,00- \n 74,999", 
                                   "$75,000 to $99,999"= "75,000- \n 99,999", 
                                   "$100,000 to $149,999" = "100,000- \n 149,999",
                                   "$150,000 or more" = "$150,000 +",
                                  .ordered = TRUE))
 #pp_educ_5
 
 abc_poll <-mutate(abc_poll, pp_educ_5 = recode_factor(
   pp_educ_5,
   "No high school diploma or GED" = "No HS",
   "High school graduate (high school diploma or the equivalent GED)" = "HS/GED",
   "Some college or Associate degree" = "Some College",
   "Bachelor"= "Bachelor",
   "Master"= "Master+",
   .ordered = TRUE))
 
 ##pp_political_id_5
 abc_poll <- mutate(abc_poll, pp_political_id_5 = recode_factor(
   pp_political_id_5,
        "Dem" = "Dem",
        "Rep" = "Rep",
        "Ind" = "Ind",
        "Other" = "Other",
        "Skipped"="Skipped",
        .ordered = TRUE))
 

#pp_employment_status_3
 abc_poll <-mutate(abc_poll, pp_employment_status_3 =recode_factor(
   pp_employment_status_3,
   "Not working" = "Not working",
   "Working part-time"= "Working part-time",
   "Working full-time" = "Working full-time",
   .ordered = TRUE))
 
 abc_poll <-mutate(abc_poll, Q2ConcernLevel_4 = recode_factor(
   Q2ConcernLevel_4 ,
   "Not concerned at all" = "Not at all",
   "Not so concerned" = "Not so concerned",
   "Somewhat concerned" = "Somewhat",
   "Very concerned" = "Very concerned",
   .ordered = TRUE))



#Q4_5
abc_poll <-mutate(abc_poll, Q4_5 = recode_factor(
  Q4_5 ,
  "Poor" = "Poor",
  "Not so good" = "Not so good",
  "Good" = "Good",
  "Excellent" = "Excellent",
  "Skipped" = "Skipped",
  .ordered = TRUE))


 abc_poll
Code
 ##Is the data frame arranged "alphabetically" or "ordinally?"
 abc_poll%>%
  arrange(desc(pp_educ_5))
Code
  print(summarytools::dfSummary(abc_poll,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.70,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')

Data Frame Summary

abc_poll

Dimensions: 3162 x 24
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
pp_id [numeric]
Mean (sd) : 7230264 (152.2)
min ≤ med ≤ max:
7230001 ≤ 7230264 ≤ 7230527
IQR (CV) : 264 (0)
527 distinct values 0 (0.0%)
pp_Language_2 [character]
1. En
2. Sp
3084(97.5%)
78(2.5%)
0 (0.0%)
pp_age [numeric]
Mean (sd) : 53.4 (17.1)
min ≤ med ≤ max:
18 ≤ 55 ≤ 91
IQR (CV) : 27 (0.3)
72 distinct values 0 (0.0%)
pp_educ_5 [ordered, factor]
1. No HS
2. HS/GED
3. Some College
4. Bachelor
5. Master+
174(5.5%)
798(25.2%)
948(30.0%)
648(20.5%)
594(18.8%)
0 (0.0%)
pp_gender_2 [character]
1. F
2. M
1524(48.2%)
1638(51.8%)
0 (0.0%)
pp_ethnicity_5 [character]
1. 2+
NH
2. Bl
NH
3. Hisp
4. Ot
NH
5. Wh
NH
126(4.0%)
162(5.1%)
306(9.7%)
144(4.6%)
2424(76.7%)
0 (0.0%)
pp_hhsize_6 [numeric]
Mean (sd) : 2.6 (1.2)
min ≤ med ≤ max:
1 ≤ 2 ≤ 6
IQR (CV) : 1 (0.5)
1:480(15.2%)
2:1314(41.6%)
3:612(19.4%)
4:456(14.4%)
5:210(6.6%)
6:90(2.8%)
0 (0.0%)
pp_inc_7 [ordered, factor]
1. <10,000
2. 10,000-
24,999
3. 25,000-
49,999
4. 50,00-
74,999
5. 75,000-
99,999
6. 100,000-
149,999
7. $150,000 +
102(3.2%)
192(6.1%)
492(15.6%)
510(16.1%)
414(13.1%)
630(19.9%)
822(26.0%)
0 (0.0%)
pp_marital_5 [character]
1. Divorced
2. Never married
3. Now Married
4. Separated
5. Widowed
258(8.2%)
666(21.1%)
2022(63.9%)
48(1.5%)
168(5.3%)
0 (0.0%)
pp_metro_cat_2 [character]
1. Metro area
2. Non-metro area
2688(85.0%)
474(15.0%)
0 (0.0%)
pp_region_4 [character]
1. MidWest
2. NorthEast
3. South
4. West
708(22.4%)
558(17.6%)
1140(36.1%)
756(23.9%)
0 (0.0%)
pp_housing_3 [character]
1. NonPayment_Occupied
2. Payment_Own
3. Payment_Rent
60(1.9%)
2436(77.0%)
666(21.1%)
0 (0.0%)
pp_state [character]
1. California
2. Texas
3. Florida
4. Pennsylvania
5. Illinois
6. New Jersey
7. Ohio
8. Michigan
9. New York
10. Washington
[ 39 others ]
306(9.7%)
252(8.0%)
204(6.5%)
168(5.3%)
138(4.4%)
126(4.0%)
126(4.0%)
108(3.4%)
108(3.4%)
108(3.4%)
1518(48.0%)
0 (0.0%)
pp_working_arrangement_9 [character]
1. Emp_Self
2. Employed_FT
3. Employed_PT
4. Furlough
5. Homemaker
6. Laid Off
7. Other
8. Retired
192(6.2%)
1320(42.4%)
186(6.0%)
6(0.2%)
222(7.1%)
78(2.5%)
120(3.9%)
990(31.8%)
48 (1.5%)
pp_employment_status_3 [ordered, factor]
1. Not working
2. Working part-time
3. Working full-time
1326(41.9%)
366(11.6%)
1470(46.5%)
0 (0.0%)
Q2ConcernLevel_4 [ordered, factor]
1. Not at all
2. Not so concerned
3. Somewhat
4. Very concerned
390(12.3%)
882(27.9%)
1326(41.9%)
564(17.8%)
0 (0.0%)
Q3_3 [character]
1. No
2. Skipped
3. Yes
642(20.3%)
30(0.9%)
2490(78.7%)
0 (0.0%)
Q4_5 [ordered, factor]
1. Poor
2. Not so good
3. Good
4. Excellent
5. Skipped
894(28.3%)
582(18.4%)
1290(40.8%)
360(11.4%)
36(1.1%)
0 (0.0%)
Q5Optimism_3 [character]
1. Optimistic
2. Pessimistic
3. Skipped
1374(43.5%)
1770(56.0%)
18(0.6%)
0 (0.0%)
pp_political_id_5 [ordered, factor]
1. Dem
2. Rep
3. Ind
4. Other
5. Skipped
1056(33.4%)
912(28.8%)
1008(31.9%)
168(5.3%)
18(0.6%)
0 (0.0%)
pp_contact_2 [character]
1. N
2. Y
2130(67.4%)
1032(32.6%)
0 (0.0%)
weights_pid [numeric]
Mean (sd) : 1 (0.6)
min ≤ med ≤ max:
0.3 ≤ 0.8 ≤ 6.3
IQR (CV) : 0.5 (0.6)
453 distinct values 0 (0.0%)
Question 1 part [character]
1. Q1a_3
2. Q1b_3
3. Q1c_3
4. Q1d_3
5. Q1e_3
6. Q1f_3
527(16.7%)
527(16.7%)
527(16.7%)
527(16.7%)
527(16.7%)
527(16.7%)
0 (0.0%)
Q1 Response [character]
1. Approve
2. Disapprove
3. Skipped
1478(46.7%)
1615(51.1%)
69(2.2%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21

There were many variables from the abc_poll that I could imagine visualizing proportional relationships and proportional relationships by groups.

I explored multiple versions of bar charts to visualize the part-whole relationship of a respondents political identification and stated level of concern in poll question 2.

Code
# Gather/Group the values of the Categorical Variables (pp_political_id_5 and 
# Q2ConcernLevel_4

abc_poll_pp_id_q2 <- abc_poll %>% 
  group_by(pp_political_id_5, Q2ConcernLevel_4) %>%
  #mutate(pp_political_id_5 = na_if(pp_political_id_5, "Skipped"))%>%
  summarise(count = n())
  • The grouped bar chart shows each of the concern levels broken down by the respondent’s political id. You can see that many respondents are somewhat concerned
Code
##Grouped Bar Chart political id and concern level

abc_poll_pp_id_q2%>%  
  ggplot(aes(fill=pp_political_id_5, y=count, x=Q2ConcernLevel_4)) + 
    geom_bar(position="dodge", stat="identity") +
  labs(subtitle ="Grouped Bar Chart" ,
       y = "Number of Respondents",
       x= "Concern Level",
       title = "Q2 Concern Level by Political Id",
      caption = "ABC News Political Poll")+
  coord_flip()

  • The stacked bar chart gives an easier to digest view of the comparative level of concern and the part of each concern level that comes from respondents from each political party.
Code
## Stacked bar 

abc_poll_pp_id_q2%>%  
  ggplot(aes(fill=pp_political_id_5, y = count, x=Q2ConcernLevel_4)) + 
    geom_bar(position="stack", stat="identity")+
  labs(subtitle = "Stacked Bar Chart",
       y = "Number of Respondents",
       x= "Concern Level",
       title = "Q2 Concern Level by Political Id",
      caption = "ABC News Political Poll") +
  coord_flip()

  • The percent stacked bar chart allows us to very quickly see the proportion of respondents from each political party that make up a given concern level. This allows us to see how strongly the level of concern seems to relate to political party.
Code
# Percent Stacked bar

abc_poll_pp_id_q2%>%  
  ggplot(aes(fill=pp_political_id_5, y=count, x=Q2ConcernLevel_4)) + 
    geom_bar(position="fill", stat="identity")+
  labs(subtitle ="Percent Stacked Bar Chart" ,
       y = "Percentage of Respondents",
       x= "Concern Level",
       title = "Q2 Proportionate Concern Level by Political Id",
      caption = "ABC News Political Poll",
      color = "Political ID") 

  • The donut chart is a visual of the distribution of political identification of the poll respondents. I read that donut charts and pie charts are not recommended. In something with only 3 groups, I thought it could be ok, although it doesn’t allow one to see subtle differences between the size of groups like one would see in a “lollipop” or a “bar chart”.
Code
# Facet Wrap with Doughnut (Facet wrap didn't work...would have to fix this)
 
# Compute percentages
abc_poll_pp_id_q2$fraction = abc_poll_pp_id_q2$count / sum(abc_poll_pp_id_q2$count)

# Compute the cumulative percentages (top of each rectangle)
abc_poll_pp_id_q2$ymax = cumsum(abc_poll_pp_id_q2$fraction)

# Compute the bottom of each rectangle
abc_poll_pp_id_q2$ymin = c(0, head(abc_poll_pp_id_q2$ymax, n=-1))
 
# Compute label position
abc_poll_pp_id_q2$labelPosition <- (abc_poll_pp_id_q2$ymax + abc_poll_pp_id_q2$ymin) / 2

# Compute a good label
abc_poll_pp_id_q2$label <- paste0(abc_poll_pp_id_q2$pp_political_id_5, "\n value: ", abc_poll_pp_id_q2$count)
# Make the plot
ggplot(abc_poll_pp_id_q2, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=pp_political_id_5)) +
     geom_rect() +
 # geom_label( x=3.5, aes(y=labelPosition, label=label), size=6) +
     coord_polar(theta="y") + # Try to remove that to understand how the chart is built initially
     xlim(c(2, 4)) +
  theme_void() +
  theme(legend.position = "right") +
  
  labs(subtitle = "Political ID of Respondents",
       title = "Donut Chart",
      caption = "ABC News Political Poll",
      ) 

Code
  #facet_wrap(vars(Q2ConcernLevel_4))

Questions

  • How do I change the label of the legend from the name of the “fill” variable?

  • In what situations, if any, is a pie/donut chart appropriate?

I chose to visualize a “flow relationship”, between a respondent’s reported level of optimism reported in question 5 and several other demographic variables. I found the “skipped” responses to Question 5 to be difficult to read in a flow chart in a way that they weren’t with stacked bar charts or pie charts, so I removed them from these visualizations.

Code
flow_region_educ <- abc_poll %>% 
  select(pp_region_4, Q5Optimism_3)%>%
  mutate(Q5Optimism_3 = na_if(Q5Optimism_3, "Skipped"))

#flow_region_educ

I represented the “flow relationship” using a chord diagram. I chose variables that had a limited number of values as my origin and destination variables.

Code
# Chord Diagrams 
# Charge the circlize library
library(circlize)
Error in library(circlize): there is no package called 'circlize'
  • Political ID to Q5 Optimism Level showed a clear “flow” of Republican and Other party to pessimistic responses and a strong “flow” of Democratic party ID to optimistic responses.
Code
#Q5 Optimism Status vs Political ID
# Gather the "edges" for our flow: origin: Political ID, destination: Q5 Optimism level
flow_pol_id_optimism <- abc_poll %>% 
  select(pp_political_id_5, Q5Optimism_3)%>%
  mutate(Q5Optimism_3 = na_if(Q5Optimism_3, "Skipped"))%>%
  mutate(pp_political_id_5 = na_if(pp_political_id_5, "Skipped"))%>%
  with(table(pp_political_id_5, Q5Optimism_3))%>%
 
# Make the circular plot
 chordDiagram(transparency = 0.5)
Error in chordDiagram(., transparency = 0.5): could not find function "chordDiagram"
Code
title(main = "Political ID to Q5 Optimism Level", sub = "ABC News Political Poll")
Error in title(main = "Political ID to Q5 Optimism Level", sub = "ABC News Political Poll"): plot.new has not been called yet
  • Geographic Region to Q5 Optimism Level showed a simple “flow” however it was not so easy to discern a distinction in the proportion of optimismtic and pessimistic responses by region.
Code
#Q5 Optimism Status vs Geographic Region
# Gather the "edges" for our flow: origin: Q5 Optimism, destination: Geographic Region
flow_region_educ <- abc_poll %>% 
  select(pp_region_4, Q5Optimism_3)%>%
  mutate(Q5Optimism_3 = na_if(Q5Optimism_3, "Skipped"))%>%
  
  with(table(Q5Optimism_3, pp_region_4))%>%

# Make the circular plot
 chordDiagram(transparency = 0.5)
Error in chordDiagram(., transparency = 0.5): could not find function "chordDiagram"
Code
title(main = "Q5 Optimism Level to Geographic Region", sub = "ABC News Political Poll")
Error in title(main = "Q5 Optimism Level to Geographic Region", sub = "ABC News Political Poll"): plot.new has not been called yet

I think there are some interesting things to see in these visuals, but it takes too much effort. I would imagine the story would probably be clearer with percent stacked bar charts.

  • Ethnicity to Geographic Region was overwhelmed visually by the size of the white only demographic. One could see the strong connection between hispanic and other demographics to the West and South.
Code
#Ethnicity
# Gather the "edges" for our flow: origin: Ethnicity, destination: Geographic Region
flow_region_educ <- abc_poll %>% 
  select(pp_region_4, pp_ethnicity_5)%>%
  with(table(pp_ethnicity_5, pp_region_4))%>%
# Make the circular plot
 chordDiagram(transparency = 0.5)
Error in chordDiagram(., transparency = 0.5): could not find function "chordDiagram"
Code
title(main = "Ethnicity to Geographic Region", sub = "ABC News Political Poll")
Error in title(main = "Ethnicity to Geographic Region", sub = "ABC News Political Poll"): plot.new has not been called yet
  • Income Level to Q5 Optimism Status was overwhelmed visually by the number of different income levels and the formatting of their labels. Also, for income levels with relatively few respondents, it is difficult to see the distinction between the optimistic and pessimistic flows. Notably, there does not seem to be a strong relationship between income level and reported optimism in question 5.
Code
#Q5 Optimism Status vs Income Level
# Gather the "edges" for our flow: origin: Income Level, destination: Q5 Optimism level
flow_optimism_inc <- abc_poll %>% 
  select(pp_inc_7, Q5Optimism_3)%>%
  mutate(Q5Optimism_3 = na_if(Q5Optimism_3, "Skipped"))%>%
  with(table(pp_inc_7, Q5Optimism_3))%>%
# Make the circular plot
 chordDiagram(transparency = 0.5)
Error in chordDiagram(., transparency = 0.5): could not find function "chordDiagram"
Code
title(main = "Income Level to Q5 Optimism Level", sub = "ABC News Poll")
Error in title(main = "Income Level to Q5 Optimism Level", sub = "ABC News Poll"): plot.new has not been called yet
Code
#Marital Status
# Gather the "edges" for our flow: origin: Marital Status, destination: Geographic Region
# flow_region_educ <- abc_poll %>% 
#   select(pp_region_4, pp_marital_5)%>%
#   
#   with(table(pp_marital_5, pp_region_4))%>%
#  
# # Make the circular plot
#  chordDiagram(transparency = 0.5)
# title(main = "Chord Diagram", sub = "Marital Status to Geographic Region")
  • Education Level to Geographic Region was overwhelmingly busy visually. Although it is pretty, I think this information would be easier to parse in a percent stacked bar chart.
Code
#Education level to Geographic Region
# Gather the "edges" for our flow: origin: Education Level, destination: Geographic Region


flow_region_educ <- abc_poll %>% 
  select(pp_region_4, pp_educ_5)%>%
  
  with(table(pp_educ_5, pp_region_4))%>%
 

# Make the circular plot
 chordDiagram(transparency = 0.5)
Error in chordDiagram(., transparency = 0.5): could not find function "chordDiagram"
Code
title(main = "Education Level to Geographic Region", sub = "ABC News Political Poll")
Error in title(main = "Education Level to Geographic Region", sub = "ABC News Political Poll"): plot.new has not been called yet

Questions

  • Why do the colors of my chord diagram change each time I run the chunk?

  • How do I fix the labels around the circle (other than using “newline”)?

  • Other than traffic/shipping/migration patterns, what are examples of ideas that are well represented by chord charts?

I noticed these in some of the earlier lesson materials and was just experimenting with them here…There were so many categorical variables in the abc_poll data, that I thought that some bivariate relationships might have a nice visual story with heat maps.

Code
# Heat Map Geographic Region, Political ID
abc_poll%>%
  count(pp_region_4, pp_political_id_5) %>%
  ggplot(mapping = aes(x = pp_political_id_5, y = pp_region_4))+
  geom_tile(mapping = aes(fill = n))

Code
# Heat Income Level, Political ID
abc_poll%>%
  count(pp_inc_7, pp_educ_5) %>%
  ggplot(mapping = aes(x = pp_educ_5, y = pp_inc_7))+
  geom_tile(mapping = aes(fill = n))

Our data frame did not consist of any variables measured over time. However, it does have each poll respondent’s age.

If each respondent was asked the same question every year, we could see the evolution of their responses over time

Questions

Do analysts ever use “age” as a variable representing time?

Source Code
---
title: "Challenge 6"
author: "Theresa Szczepanski"
description: "Visualizing Time and Relationships"
date: "10/21/2022"
format:
   html:
    toc: true
    code-copy: true
    code-tools: true
    df-print: paged
    code-fold: true
categories:
  - challenge_6
 # - hotel_bookings
  #- air_bnb
  #- fed_rate
  #- debt
  #- usa_households
  - abc_poll
  - Theresa_Szczepanski
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(ggplot2)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview


[R Graph Gallery](https://r-graph-gallery.com/) is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.




### abc_poll ⭐⭐⭐

::: panel-tabset

### Read in the data
::: panel-tabset

### Briefly describe the data

From our `abc_poll` data frame summary, we can see that this data set
contains polling results from 527 respondents to an ABC news political poll. 
The results consist of information for two broad categories


- *Demographic characteristics* of 
the respondents themselves (e.g., language of the poll given to the respondent
(Spanish or English), age, educational attainment, ethnicity, household size,
ethnic make up, gender, income range, Marital status, Metro category, 
Geographic region, Rental status, State, Employment status, 
Working characteristics, Willingness to have a follow up interview)

- *The responses that the individuals gave* to 11
questions (there are 5 broad questions Q1-Q5, but Q1 consists of 6 
sub questions, a-f).





  

```{r}
#Filter, rename variables, and mutate values of variables on read-in

abc_poll<-read_csv("_data/abc_poll_2021.csv", skip = 1,  
                   col_names= c("pp_id",  "pp_Language_2",  "delete","pp_age", 
                                "pp_educ_5", "delete", "pp_gender_2", 
                                "pp_ethnicity_5", "pp_hhsize_6", "pp_inc_7", 
                                "pp_marital_5", "pp_metro_cat_2", "pp_region_4",
                                "pp_housing_3", "pp_state", 
                                "pp_working_arrangement_9", 
                                "pp_employment_status_3", "Q1a_3", "Q1b_3", 
                                "Q1c_3",  "Q1d_3","Q1e_3", "Q1f_3","Q2ConcernLevel_4",
                                "Q3_3", "Q4_5",  "Q5Optimism_3", 
                                "pp_political_id_5", "delete", "pp_contact_2",  
                                  "weights_pid"))%>%
  select(!contains("delete"))%>%
  
  #replace "6 or more" in pp_hhsize_6 to the value of 6 so that the column can be
  # of double data type.
     mutate(pp_hhsize_6 = ifelse(pp_hhsize_6 == "6 or more", "6", pp_hhsize_6)) %>%
    transform( pp_hhsize_6 = as.numeric(pp_hhsize_6))%>%
  
  #fix the issue with apostrophes in pp_educ_5 values on read in
    mutate(pp_educ_5 = ifelse(str_starts(pp_educ_5,"Bachelor"), 
                           "Bachelor", pp_educ_5))%>%
    mutate(pp_educ_5 = ifelse(str_starts(pp_educ_5, "Master"), "Master", pp_educ_5))

  # reduce lengthy responses to necessary info in nominal variables

  abc_poll$pp_Language_2 = substr(abc_poll$pp_Language_2,1,2)
 
  abc_poll$pp_gender_2 = substr(abc_poll$pp_gender_2,1,1)
  abc_poll$pp_contact_2 = substr(abc_poll$pp_contact_2,1,1)
  
  #reduce lengthy responses of nominal variables using Case When
  
 #pp_political_id_5 
 abc_poll<-mutate(abc_poll, pp_political_id_5  = case_when(
    pp_political_id_5 == "A Democrat" ~ "Dem",
    pp_political_id_5 == "A Republican" ~ "Rep",
    pp_political_id_5 == "An Independent" ~ "Ind",
    pp_political_id_5 == "Something else" ~ "Other",
    pp_political_id_5 == "Skipped" ~ "Skipped"
))%>%
 
 #pp_housing_3
mutate(pp_housing_3 = case_when(
    pp_housing_3 == "Occupied without payment of cash rent" ~ "NonPayment_Occupied",
    pp_housing_3 == "Rented for cash"~ "Payment_Rent",
    pp_housing_3 == "Owned or being bought by you or someone in your household" ~ "Payment_Own"))%>%

 
 
# pp_working_arrangement_9
 mutate(pp_working_arrangement_9 = case_when(
          pp_working_arrangement_9 == "Other" ~ "Other",
          pp_working_arrangement_9 =="Retired" ~ "Retired",
          pp_working_arrangement_9 == "Homemaker" ~ "Homemaker",
          pp_working_arrangement_9 == "Student" ~ "Student",
          pp_working_arrangement_9 == "Currently laid off" ~ "Laid Off",
          pp_working_arrangement_9 == "On furlough"~ "Furlough",
          pp_working_arrangement_9 == "Employed part-time (by someone else)" ~ "Employed_PT",
          pp_working_arrangement_9 =="Self-employed" ~ "Emp_Self",
          pp_working_arrangement_9 == "Employed full-time (by someone else)"~ "Employed_FT"))%>%
   
    #pp_ethnicity_5
  mutate( pp_ethnicity_5 = case_when(
    pp_ethnicity_5 == "2+ Races, Non-Hispanic" ~ "2+ \n NH",
    pp_ethnicity_5 == "Black, Non-Hispanic" ~ "Bl \n NH",
    pp_ethnicity_5 == "Hispanic" ~ "Hisp",
    pp_ethnicity_5 == "Other, Non-Hispanic" ~ "Ot \n NH",
    pp_ethnicity_5 == "White, Non-Hispanic" ~ "Wh \n NH"

))
 


 
  
  abc_poll
  
View(abc_poll)
```

### Post Read in Data Summary
```{r}
print(summarytools::dfSummary(abc_poll,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.70,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')
```
:::

On the read in, I chose to 

**Filter**:

- `complete_status`: everyone was qualified
- `ppeducat`: this categorizing of `ppeduc5` can be done in the data frame
using a `case_when()` and factoring
- `ABCAGE`: this qualitative age range variable can be replicated by using the
data in the `ppage` variable and a `case_when`; one might want to examine 
different ranges of ages.

 
 

__Rename__

- I renamed all of the variables corresponding to 
_demographic characteristics of the poll participant_ 
to begin with `pp_`.

- I renamed all of the variables corresponding to _survey question responses_
from the participants to begin with `Q`

- If a variable had a fixed number of possible responses (which I could see from
the summary), e.g., `pp_marital` had 5 possible responses, 
I included the number of "categories" or possible responses
in the variable name preceded by an underscore, `pp_marital_5`

__Mutate__
 
 - I replaced the `pp_hhsize_6` value of "6 or more" with 6, so that it could
 be of double data type
 
 - I mutated the `pp_educ5` column to remove the
 apostrophes from "Bachelor's" and "Master's" that were producing the "\\x92"'s 
 in the values on read in.
 
 - If a _nominal_ variable had lengthy values, I reduced them to the key info 
 using `mutate`, `str_sub`, and `case_when`







### Tidy/Mutate Data (as needed)

Because our data frame is poll data, our frame will stay relatively wide. Each
polled person `pp_id` represents a unique case and the values for the case are

- the demographic characteristics of the polled person and
- the individual's responses to a given survey question

To tidy our data, I factored the following ordinal variables:

- `pp_inc_7`: The income level of the polled person
- `pp_educ_5`: The educational attainment level of the polled person
- `pp_employment_status_3`: The employment status of the polled person 
(not working, working part time, working full time)

```{r}
abc_poll<-abc_poll %>%
    pivot_longer(c(starts_with("Q1")), names_to = "Question 1 part", values_to = "Q1 Response")

abc_poll <-mutate(abc_poll, pp_inc_7 = recode_factor(pp_inc_7, 
                                   "Less than $10,000" = "<10,000", 
                                   "$10,000 to $24,999" =  "10,000-\n 24,999",  
                                   "$25,000 to $49,999" = "25,000- \n 49,999", 
                                   "$50,000 to $74,999"= "50,00- \n 74,999", 
                                   "$75,000 to $99,999"= "75,000- \n 99,999", 
                                   "$100,000 to $149,999" = "100,000- \n 149,999",
                                   "$150,000 or more" = "$150,000 +",
                                  .ordered = TRUE))
 #pp_educ_5
 
 abc_poll <-mutate(abc_poll, pp_educ_5 = recode_factor(
   pp_educ_5,
   "No high school diploma or GED" = "No HS",
   "High school graduate (high school diploma or the equivalent GED)" = "HS/GED",
   "Some college or Associate degree" = "Some College",
   "Bachelor"= "Bachelor",
   "Master"= "Master+",
   .ordered = TRUE))
 
 ##pp_political_id_5
 abc_poll <- mutate(abc_poll, pp_political_id_5 = recode_factor(
   pp_political_id_5,
        "Dem" = "Dem",
        "Rep" = "Rep",
        "Ind" = "Ind",
        "Other" = "Other",
        "Skipped"="Skipped",
        .ordered = TRUE))
 

#pp_employment_status_3
 abc_poll <-mutate(abc_poll, pp_employment_status_3 =recode_factor(
   pp_employment_status_3,
   "Not working" = "Not working",
   "Working part-time"= "Working part-time",
   "Working full-time" = "Working full-time",
   .ordered = TRUE))
 
 abc_poll <-mutate(abc_poll, Q2ConcernLevel_4 = recode_factor(
   Q2ConcernLevel_4 ,
   "Not concerned at all" = "Not at all",
   "Not so concerned" = "Not so concerned",
   "Somewhat concerned" = "Somewhat",
   "Very concerned" = "Very concerned",
   .ordered = TRUE))



#Q4_5
abc_poll <-mutate(abc_poll, Q4_5 = recode_factor(
  Q4_5 ,
  "Poor" = "Poor",
  "Not so good" = "Not so good",
  "Good" = "Good",
  "Excellent" = "Excellent",
  "Skipped" = "Skipped",
  .ordered = TRUE))


 abc_poll
 
 ##Is the data frame arranged "alphabetically" or "ordinally?"
 abc_poll%>%
  arrange(desc(pp_educ_5))



```

### Mutated Summary

```{r}
  print(summarytools::dfSummary(abc_poll,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.70,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')

```




### Visualizing Part-Whole Relationships
There were many variables from the `abc_poll` that I could imagine visualizing 
proportional relationships and proportional relationships by groups.

I explored multiple versions of bar charts to visualize the part-whole relationship
of a respondents political identification and stated level of concern in poll 
question 2.

```{r}
# Gather/Group the values of the Categorical Variables (pp_political_id_5 and 
# Q2ConcernLevel_4

abc_poll_pp_id_q2 <- abc_poll %>% 
  group_by(pp_political_id_5, Q2ConcernLevel_4) %>%
  #mutate(pp_political_id_5 = na_if(pp_political_id_5, "Skipped"))%>%
  summarise(count = n())

```

- The __grouped bar chart__ shows each of the concern levels broken down by the 
respondent's political id. You can see that many respondents are `somewhat concerned`

```{r}
##Grouped Bar Chart political id and concern level

abc_poll_pp_id_q2%>%  
  ggplot(aes(fill=pp_political_id_5, y=count, x=Q2ConcernLevel_4)) + 
    geom_bar(position="dodge", stat="identity") +
  labs(subtitle ="Grouped Bar Chart" ,
       y = "Number of Respondents",
       x= "Concern Level",
       title = "Q2 Concern Level by Political Id",
      caption = "ABC News Political Poll")+
  coord_flip()



```
- The __stacked bar chart__ gives an easier to digest view of the comparative level 
of concern and the part of each concern level that comes from respondents from each 
political party.

```{r}
## Stacked bar 

abc_poll_pp_id_q2%>%  
  ggplot(aes(fill=pp_political_id_5, y = count, x=Q2ConcernLevel_4)) + 
    geom_bar(position="stack", stat="identity")+
  labs(subtitle = "Stacked Bar Chart",
       y = "Number of Respondents",
       x= "Concern Level",
       title = "Q2 Concern Level by Political Id",
      caption = "ABC News Political Poll") +
  coord_flip()


```

- The __percent stacked bar chart__ allows us to very quickly see the proportion of 
respondents from each political party that make up a given concern level. This allows
us to see how strongly the level of concern seems to relate to political party.

```{r}
# Percent Stacked bar

abc_poll_pp_id_q2%>%  
  ggplot(aes(fill=pp_political_id_5, y=count, x=Q2ConcernLevel_4)) + 
    geom_bar(position="fill", stat="identity")+
  labs(subtitle ="Percent Stacked Bar Chart" ,
       y = "Percentage of Respondents",
       x= "Concern Level",
       title = "Q2 Proportionate Concern Level by Political Id",
      caption = "ABC News Political Poll",
      color = "Political ID") 
  


```
- The __donut chart__  is a visual of the distribution of political identification of the
poll respondents. I read that donut charts and pie charts are not recommended. In something with only 3 groups, I thought it could be ok, although it doesn't allow one to 
see subtle differences between the size of groups like one would see in a "lollipop" or
 a "bar chart".

```{r}
# Facet Wrap with Doughnut (Facet wrap didn't work...would have to fix this)
 
# Compute percentages
abc_poll_pp_id_q2$fraction = abc_poll_pp_id_q2$count / sum(abc_poll_pp_id_q2$count)

# Compute the cumulative percentages (top of each rectangle)
abc_poll_pp_id_q2$ymax = cumsum(abc_poll_pp_id_q2$fraction)

# Compute the bottom of each rectangle
abc_poll_pp_id_q2$ymin = c(0, head(abc_poll_pp_id_q2$ymax, n=-1))
 
# Compute label position
abc_poll_pp_id_q2$labelPosition <- (abc_poll_pp_id_q2$ymax + abc_poll_pp_id_q2$ymin) / 2

# Compute a good label
abc_poll_pp_id_q2$label <- paste0(abc_poll_pp_id_q2$pp_political_id_5, "\n value: ", abc_poll_pp_id_q2$count)
# Make the plot
ggplot(abc_poll_pp_id_q2, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=pp_political_id_5)) +
     geom_rect() +
 # geom_label( x=3.5, aes(y=labelPosition, label=label), size=6) +
     coord_polar(theta="y") + # Try to remove that to understand how the chart is built initially
     xlim(c(2, 4)) +
  theme_void() +
  theme(legend.position = "right") +
  
  labs(subtitle = "Political ID of Respondents",
       title = "Donut Chart",
      caption = "ABC News Political Poll",
      ) 


  #facet_wrap(vars(Q2ConcernLevel_4))



  


```
## Questions

- How do I change the label of the legend from the name of the "fill" variable?

- In what situations, if any, is a pie/donut chart appropriate?

### Visualizing Flow Relationship

I chose to visualize a "flow relationship", between a respondent's reported level of optimism reported in question 5 and several other demographic variables. I found the "skipped" responses to Question 5 to be difficult to read in a flow chart in a way
 that they weren't with stacked bar charts or pie charts, so I removed them from 
these visualizations.
```{r}
flow_region_educ <- abc_poll %>% 
  select(pp_region_4, Q5Optimism_3)%>%
  mutate(Q5Optimism_3 = na_if(Q5Optimism_3, "Skipped"))

#flow_region_educ

```

I represented the "flow relationship" using a __chord diagram__. I chose variables 
that had a limited number of values as my __origin__ and __destination__ variables.


```{r}
# Chord Diagrams 
# Charge the circlize library
library(circlize)
```

- Political ID to Q5 Optimism Level showed a clear "flow" of Republican and Other 
party to pessimistic responses and a strong "flow" of Democratic party ID to optimistic
responses.
```{r}
#Q5 Optimism Status vs Political ID
# Gather the "edges" for our flow: origin: Political ID, destination: Q5 Optimism level
flow_pol_id_optimism <- abc_poll %>% 
  select(pp_political_id_5, Q5Optimism_3)%>%
  mutate(Q5Optimism_3 = na_if(Q5Optimism_3, "Skipped"))%>%
  mutate(pp_political_id_5 = na_if(pp_political_id_5, "Skipped"))%>%
  with(table(pp_political_id_5, Q5Optimism_3))%>%
 
# Make the circular plot
 chordDiagram(transparency = 0.5)
title(main = "Political ID to Q5 Optimism Level", sub = "ABC News Political Poll")



```

- Geographic Region to Q5 Optimism Level showed a simple "flow" however it was not 
so easy to discern a distinction in the proportion of optimismtic and pessimistic responses by region.
```{r}
#Q5 Optimism Status vs Geographic Region
# Gather the "edges" for our flow: origin: Q5 Optimism, destination: Geographic Region
flow_region_educ <- abc_poll %>% 
  select(pp_region_4, Q5Optimism_3)%>%
  mutate(Q5Optimism_3 = na_if(Q5Optimism_3, "Skipped"))%>%
  
  with(table(Q5Optimism_3, pp_region_4))%>%

# Make the circular plot
 chordDiagram(transparency = 0.5)
title(main = "Q5 Optimism Level to Geographic Region", sub = "ABC News Political Poll")
```


### Failed Flow Visuals

I think there are some interesting things to see in these visuals, but it takes 
too much effort. I would imagine the story would probably be clearer with percent 
stacked bar charts.

- Ethnicity to Geographic Region was overwhelmed visually by the size of the 
`white only` demographic. One could see the strong connection between `hispanic` 
and `other` demographics to the `West` and `South`.
```{r}
#Ethnicity
# Gather the "edges" for our flow: origin: Ethnicity, destination: Geographic Region
flow_region_educ <- abc_poll %>% 
  select(pp_region_4, pp_ethnicity_5)%>%
  with(table(pp_ethnicity_5, pp_region_4))%>%
# Make the circular plot
 chordDiagram(transparency = 0.5)
title(main = "Ethnicity to Geographic Region", sub = "ABC News Political Poll")
```

- Income Level to Q5 Optimism Status was overwhelmed visually by the number of 
different income levels and the formatting of their labels. Also, for income levels 
with relatively few respondents, it is difficult to see the distinction between the 
`optimistic` and `pessimistic` flows. Notably, there does not seem to be a strong 
relationship between income level and reported optimism in question 5.

```{r}
#Q5 Optimism Status vs Income Level
# Gather the "edges" for our flow: origin: Income Level, destination: Q5 Optimism level
flow_optimism_inc <- abc_poll %>% 
  select(pp_inc_7, Q5Optimism_3)%>%
  mutate(Q5Optimism_3 = na_if(Q5Optimism_3, "Skipped"))%>%
  with(table(pp_inc_7, Q5Optimism_3))%>%
# Make the circular plot
 chordDiagram(transparency = 0.5)
title(main = "Income Level to Q5 Optimism Level", sub = "ABC News Poll")


#Marital Status
# Gather the "edges" for our flow: origin: Marital Status, destination: Geographic Region
# flow_region_educ <- abc_poll %>% 
#   select(pp_region_4, pp_marital_5)%>%
#   
#   with(table(pp_marital_5, pp_region_4))%>%
#  
# # Make the circular plot
#  chordDiagram(transparency = 0.5)
# title(main = "Chord Diagram", sub = "Marital Status to Geographic Region")
```

- Education Level to Geographic Region was overwhelmingly busy visually. Although it is pretty, I think this information would be easier to parse in a percent stacked bar chart. 

```{r}
#Education level to Geographic Region
# Gather the "edges" for our flow: origin: Education Level, destination: Geographic Region


flow_region_educ <- abc_poll %>% 
  select(pp_region_4, pp_educ_5)%>%
  
  with(table(pp_educ_5, pp_region_4))%>%
 

# Make the circular plot
 chordDiagram(transparency = 0.5)
title(main = "Education Level to Geographic Region", sub = "ABC News Political Poll")

```
## Questions

- Why do the colors of my chord diagram change each time I run the chunk?

- How do I fix the labels around the circle (other than using "newline")?

- Other than traffic/shipping/migration patterns, what are examples of ideas that 
are well represented by  chord charts?



### Categorical Bivariate Relationships with Heat Map
I noticed these in some of the earlier lesson materials and was just experimenting with them here...There were so many categorical variables in the `abc_poll` data, that 
I thought that some bivariate relationships might have a nice visual story with heat maps.

```{r}
# Heat Map Geographic Region, Political ID
abc_poll%>%
  count(pp_region_4, pp_political_id_5) %>%
  ggplot(mapping = aes(x = pp_political_id_5, y = pp_region_4))+
  geom_tile(mapping = aes(fill = n))

  

```
```{r}
# Heat Income Level, Political ID
abc_poll%>%
  count(pp_inc_7, pp_educ_5) %>%
  ggplot(mapping = aes(x = pp_educ_5, y = pp_inc_7))+
  geom_tile(mapping = aes(fill = n))
  

```

### Time Dependent Visualization
Our data frame did not consist of any variables measured over time. However, it 
does have each poll respondent's age. 

If each respondent was asked the same question every year, we could see the evolution 
of their responses over time


## Questions
Do analysts ever use "age" as a variable representing time?

:::