Challenge 6

challenge_6

abc_poll

Theresa_Szczepanski

Visualizing Time and Relationships

Author

Theresa Szczepanski

Published

October 21, 2022

Code

library(tidyverse)
library(ggplot2)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

abc_poll ⭐⭐⭐

Briefly describe the data
Post Read in Data Summary

From our abc_poll data frame summary, we can see that this data set contains polling results from 527 respondents to an ABC news political poll. The results consist of information for two broad categories

Demographic characteristics of the respondents themselves (e.g., language of the poll given to the respondent (Spanish or English), age, educational attainment, ethnicity, household size, ethnic make up, gender, income range, Marital status, Metro category, Geographic region, Rental status, State, Employment status, Working characteristics, Willingness to have a follow up interview)
The responses that the individuals gave to 11 questions (there are 5 broad questions Q1-Q5, but Q1 consists of 6 sub questions, a-f).

Code

#Filter, rename variables, and mutate values of variables on read-in

abc_poll<-read_csv("_data/abc_poll_2021.csv", skip = 1,  
                   col_names= c("pp_id",  "pp_Language_2",  "delete","pp_age", 
                                "pp_educ_5", "delete", "pp_gender_2", 
                                "pp_ethnicity_5", "pp_hhsize_6", "pp_inc_7", 
                                "pp_marital_5", "pp_metro_cat_2", "pp_region_4",
                                "pp_housing_3", "pp_state", 
                                "pp_working_arrangement_9", 
                                "pp_employment_status_3", "Q1a_3", "Q1b_3", 
                                "Q1c_3",  "Q1d_3","Q1e_3", "Q1f_3","Q2ConcernLevel_4",
                                "Q3_3", "Q4_5",  "Q5Optimism_3", 
                                "pp_political_id_5", "delete", "pp_contact_2",  
                                  "weights_pid"))%>%
  select(!contains("delete"))%>%
  
  #replace "6 or more" in pp_hhsize_6 to the value of 6 so that the column can be
  # of double data type.
     mutate(pp_hhsize_6 = ifelse(pp_hhsize_6 == "6 or more", "6", pp_hhsize_6)) %>%
    transform( pp_hhsize_6 = as.numeric(pp_hhsize_6))%>%
  
  #fix the issue with apostrophes in pp_educ_5 values on read in
    mutate(pp_educ_5 = ifelse(str_starts(pp_educ_5,"Bachelor"), 
                           "Bachelor", pp_educ_5))%>%
    mutate(pp_educ_5 = ifelse(str_starts(pp_educ_5, "Master"), "Master", pp_educ_5))

  # reduce lengthy responses to necessary info in nominal variables

  abc_poll$pp_Language_2 = substr(abc_poll$pp_Language_2,1,2)
 
  abc_poll$pp_gender_2 = substr(abc_poll$pp_gender_2,1,1)
  abc_poll$pp_contact_2 = substr(abc_poll$pp_contact_2,1,1)
  
  #reduce lengthy responses of nominal variables using Case When
  
 #pp_political_id_5 
 abc_poll<-mutate(abc_poll, pp_political_id_5  = case_when(
    pp_political_id_5 == "A Democrat" ~ "Dem",
    pp_political_id_5 == "A Republican" ~ "Rep",
    pp_political_id_5 == "An Independent" ~ "Ind",
    pp_political_id_5 == "Something else" ~ "Other",
    pp_political_id_5 == "Skipped" ~ "Skipped"
))%>%
 
 #pp_housing_3
mutate(pp_housing_3 = case_when(
    pp_housing_3 == "Occupied without payment of cash rent" ~ "NonPayment_Occupied",
    pp_housing_3 == "Rented for cash"~ "Payment_Rent",
    pp_housing_3 == "Owned or being bought by you or someone in your household" ~ "Payment_Own"))%>%

 
 
# pp_working_arrangement_9
 mutate(pp_working_arrangement_9 = case_when(
          pp_working_arrangement_9 == "Other" ~ "Other",
          pp_working_arrangement_9 =="Retired" ~ "Retired",
          pp_working_arrangement_9 == "Homemaker" ~ "Homemaker",
          pp_working_arrangement_9 == "Student" ~ "Student",
          pp_working_arrangement_9 == "Currently laid off" ~ "Laid Off",
          pp_working_arrangement_9 == "On furlough"~ "Furlough",
          pp_working_arrangement_9 == "Employed part-time (by someone else)" ~ "Employed_PT",
          pp_working_arrangement_9 =="Self-employed" ~ "Emp_Self",
          pp_working_arrangement_9 == "Employed full-time (by someone else)"~ "Employed_FT"))%>%
   
    #pp_ethnicity_5
  mutate( pp_ethnicity_5 = case_when(
    pp_ethnicity_5 == "2+ Races, Non-Hispanic" ~ "2+ \n NH",
    pp_ethnicity_5 == "Black, Non-Hispanic" ~ "Bl \n NH",
    pp_ethnicity_5 == "Hispanic" ~ "Hisp",
    pp_ethnicity_5 == "Other, Non-Hispanic" ~ "Ot \n NH",
    pp_ethnicity_5 == "White, Non-Hispanic" ~ "Wh \n NH"

))
 


 
  
  abc_poll

Code

View(abc_poll)

Code

print(summarytools::dfSummary(abc_poll,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.70,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')

Data Frame Summary

abc_poll

Dimensions: 527 x 28
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

pp_id [numeric]

Mean (sd) : 7230264 (152.3)

min ≤ med ≤ max:

7230001 ≤ 7230264 ≤ 7230527

IQR (CV) : 263 (0)

527 distinct values

0 (0.0%)

pp_Language_2 [character]

1. En

2. Sp

514	(	97.5%	)
13	(	2.5%	)

0 (0.0%)

pp_age [numeric]

Mean (sd) : 53.4 (17.1)

min ≤ med ≤ max:

18 ≤ 55 ≤ 91

IQR (CV) : 27 (0.3)

72 distinct values

0 (0.0%)

pp_educ_5 [character]

1. Bachelor

2. High school graduate (hig

3. Master

4. No high school diploma or

5. Some college or Associate

108	(	20.5%	)
133	(	25.2%	)
99	(	18.8%	)
29	(	5.5%	)
158	(	30.0%	)

0 (0.0%)

pp_gender_2 [character]

1. F

2. M

254	(	48.2%	)
273	(	51.8%	)

0 (0.0%)

pp_ethnicity_5 [character]

1. 2+

2. Bl

3. Hisp

4. Ot

5. Wh

21	(	4.0%	)
27	(	5.1%	)
51	(	9.7%	)
24	(	4.6%	)
404	(	76.7%	)

0 (0.0%)

pp_hhsize_6 [numeric]

Mean (sd) : 2.6 (1.3)

min ≤ med ≤ max:

1 ≤ 2 ≤ 6

IQR (CV) : 1 (0.5)

1	:	80	(	15.2%	)
2	:	219	(	41.6%	)
3	:	102	(	19.4%	)
4	:	76	(	14.4%	)
5	:	35	(	6.6%	)
6	:	15	(	2.8%	)

0 (0.0%)

pp_inc_7 [character]

1. $10,000 to $24,999

2. $100,000 to $149,999

3. $150,000 or more

4. $25,000 to $49,999

5. $50,000 to $74,999

6. $75,000 to $99,999

7. Less than $10,000

32	(	6.1%	)
105	(	19.9%	)
137	(	26.0%	)
82	(	15.6%	)
85	(	16.1%	)
69	(	13.1%	)
17	(	3.2%	)

0 (0.0%)

pp_marital_5 [character]

1. Divorced

2. Never married

3. Now Married

4. Separated

5. Widowed

43	(	8.2%	)
111	(	21.1%	)
337	(	63.9%	)
8	(	1.5%	)
28	(	5.3%	)

0 (0.0%)

pp_metro_cat_2 [character]

1. Metro area

2. Non-metro area

448	(	85.0%	)
79	(	15.0%	)

0 (0.0%)

pp_region_4 [character]

1. MidWest

2. NorthEast

3. South

4. West

118	(	22.4%	)
93	(	17.6%	)
190	(	36.1%	)
126	(	23.9%	)

0 (0.0%)

pp_housing_3 [character]

1. NonPayment_Occupied

2. Payment_Own

3. Payment_Rent

10	(	1.9%	)
406	(	77.0%	)
111	(	21.1%	)

0 (0.0%)

pp_state [character]

1. California

2. Texas

3. Florida

4. Pennsylvania

5. Illinois

6. New Jersey

7. Ohio

8. Michigan

9. New York

10. Washington

[ 39 others ]

51	(	9.7%	)
42	(	8.0%	)
34	(	6.5%	)
28	(	5.3%	)
23	(	4.4%	)
21	(	4.0%	)
21	(	4.0%	)
18	(	3.4%	)
18	(	3.4%	)
18	(	3.4%	)
253	(	48.0%	)

0 (0.0%)

pp_working_arrangement_9 [character]

1. Emp_Self

2. Employed_FT

3. Employed_PT

4. Furlough

5. Homemaker

6. Laid Off

7. Other

8. Retired

32	(	6.2%	)
220	(	42.4%	)
31	(	6.0%	)
1	(	0.2%	)
37	(	7.1%	)
13	(	2.5%	)
20	(	3.9%	)
165	(	31.8%	)

8 (1.5%)

pp_employment_status_3 [character]

1. Not working

2. Working full-time

3. Working part-time

221	(	41.9%	)
245	(	46.5%	)
61	(	11.6%	)

0 (0.0%)

Q1a_3 [character]

1. Approve

2. Disapprove

3. Skipped

329	(	62.4%	)
193	(	36.6%	)
5	(	0.9%	)

0 (0.0%)

Q1b_3 [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
322	(	61.1%	)
13	(	2.5%	)

0 (0.0%)

Q1c_3 [character]

1. Approve

2. Disapprove

3. Skipped

272	(	51.6%	)
248	(	47.1%	)
7	(	1.3%	)

0 (0.0%)

Q1d_3 [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
321	(	60.9%	)
14	(	2.7%	)

0 (0.0%)

Q1e_3 [character]

1. Approve

2. Disapprove

3. Skipped

212	(	40.2%	)
301	(	57.1%	)
14	(	2.7%	)

0 (0.0%)

Q1f_3 [character]

1. Approve

2. Disapprove

3. Skipped

281	(	53.3%	)
230	(	43.6%	)
16	(	3.0%	)

0 (0.0%)

Q2ConcernLevel_4 [character]

1. Not concerned at all

2. Not so concerned

3. Somewhat concerned

4. Very concerned

65	(	12.3%	)
147	(	27.9%	)
221	(	41.9%	)
94	(	17.8%	)

0 (0.0%)

Q3_3 [character]

1. No

2. Skipped

3. Yes

107	(	20.3%	)
5	(	0.9%	)
415	(	78.7%	)

0 (0.0%)

Q4_5 [character]

1. Excellent

2. Good

3. Not so good

4. Poor

5. Skipped

60	(	11.4%	)
215	(	40.8%	)
97	(	18.4%	)
149	(	28.3%	)
6	(	1.1%	)

0 (0.0%)

Q5Optimism_3 [character]

1. Optimistic

2. Pessimistic

3. Skipped

229	(	43.5%	)
295	(	56.0%	)
3	(	0.6%	)

0 (0.0%)

pp_political_id_5 [character]

1. Dem

2. Ind

3. Other

4. Rep

5. Skipped

176	(	33.4%	)
168	(	31.9%	)
28	(	5.3%	)
152	(	28.8%	)
3	(	0.6%	)

0 (0.0%)

pp_contact_2 [character]

1. N

2. Y

355	(	67.4%	)
172	(	32.6%	)

0 (0.0%)

weights_pid [numeric]

Mean (sd) : 1 (0.6)

min ≤ med ≤ max:

0.3 ≤ 0.8 ≤ 6.3

IQR (CV) : 0.5 (0.6)

453 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21

On the read in, I chose to

Filter:

complete_status: everyone was qualified
ppeducat: this categorizing of ppeduc5 can be done in the data frame using a case_when() and factoring
ABCAGE: this qualitative age range variable can be replicated by using the data in the ppage variable and a case_when; one might want to examine different ranges of ages.

Rename

I renamed all of the variables corresponding to demographic characteristics of the poll participant to begin with pp_.
I renamed all of the variables corresponding to survey question responses from the participants to begin with Q
If a variable had a fixed number of possible responses (which I could see from the summary), e.g., pp_marital had 5 possible responses, I included the number of “categories” or possible responses in the variable name preceded by an underscore, pp_marital_5

Mutate

I replaced the pp_hhsize_6 value of “6 or more” with 6, so that it could be of double data type
I mutated the pp_educ5 column to remove the apostrophes from “Bachelor’s” and “Master’s” that were producing the “\x92”’s in the values on read in.
If a nominal variable had lengthy values, I reduced them to the key info using mutate, str_sub, and case_when

Because our data frame is poll data, our frame will stay relatively wide. Each polled person pp_id represents a unique case and the values for the case are

the demographic characteristics of the polled person and
the individual’s responses to a given survey question

To tidy our data, I factored the following ordinal variables:

pp_inc_7: The income level of the polled person
pp_educ_5: The educational attainment level of the polled person
pp_employment_status_3: The employment status of the polled person (not working, working part time, working full time)

Code

abc_poll<-abc_poll %>%
    pivot_longer(c(starts_with("Q1")), names_to = "Question 1 part", values_to = "Q1 Response")

abc_poll <-mutate(abc_poll, pp_inc_7 = recode_factor(pp_inc_7, 
                                   "Less than $10,000" = "<10,000", 
                                   "$10,000 to $24,999" =  "10,000-\n 24,999",  
                                   "$25,000 to $49,999" = "25,000- \n 49,999", 
                                   "$50,000 to $74,999"= "50,00- \n 74,999", 
                                   "$75,000 to $99,999"= "75,000- \n 99,999", 
                                   "$100,000 to $149,999" = "100,000- \n 149,999",
                                   "$150,000 or more" = "$150,000 +",
                                  .ordered = TRUE))
 #pp_educ_5
 
 abc_poll <-mutate(abc_poll, pp_educ_5 = recode_factor(
   pp_educ_5,
   "No high school diploma or GED" = "No HS",
   "High school graduate (high school diploma or the equivalent GED)" = "HS/GED",
   "Some college or Associate degree" = "Some College",
   "Bachelor"= "Bachelor",
   "Master"= "Master+",
   .ordered = TRUE))
 
 ##pp_political_id_5
 abc_poll <- mutate(abc_poll, pp_political_id_5 = recode_factor(
   pp_political_id_5,
        "Dem" = "Dem",
        "Rep" = "Rep",
        "Ind" = "Ind",
        "Other" = "Other",
        "Skipped"="Skipped",
        .ordered = TRUE))
 

#pp_employment_status_3
 abc_poll <-mutate(abc_poll, pp_employment_status_3 =recode_factor(
   pp_employment_status_3,
   "Not working" = "Not working",
   "Working part-time"= "Working part-time",
   "Working full-time" = "Working full-time",
   .ordered = TRUE))
 
 abc_poll <-mutate(abc_poll, Q2ConcernLevel_4 = recode_factor(
   Q2ConcernLevel_4 ,
   "Not concerned at all" = "Not at all",
   "Not so concerned" = "Not so concerned",
   "Somewhat concerned" = "Somewhat",
   "Very concerned" = "Very concerned",
   .ordered = TRUE))



#Q4_5
abc_poll <-mutate(abc_poll, Q4_5 = recode_factor(
  Q4_5 ,
  "Poor" = "Poor",
  "Not so good" = "Not so good",
  "Good" = "Good",
  "Excellent" = "Excellent",
  "Skipped" = "Skipped",
  .ordered = TRUE))


 abc_poll

Code

 ##Is the data frame arranged "alphabetically" or "ordinally?"
 abc_poll%>%
  arrange(desc(pp_educ_5))

Code

  print(summarytools::dfSummary(abc_poll,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.70,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')

Data Frame Summary

abc_poll

Dimensions: 3162 x 24
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

pp_id [numeric]

Mean (sd) : 7230264 (152.2)

min ≤ med ≤ max:

7230001 ≤ 7230264 ≤ 7230527

IQR (CV) : 264 (0)

527 distinct values

0 (0.0%)

pp_Language_2 [character]

1. En

2. Sp

3084	(	97.5%	)
78	(	2.5%	)

0 (0.0%)

pp_age [numeric]

Mean (sd) : 53.4 (17.1)

min ≤ med ≤ max:

18 ≤ 55 ≤ 91

IQR (CV) : 27 (0.3)

72 distinct values

0 (0.0%)

pp_educ_5 [ordered, factor]

1. No HS

2. HS/GED

3. Some College

4. Bachelor

5. Master+

174	(	5.5%	)
798	(	25.2%	)
948	(	30.0%	)
648	(	20.5%	)
594	(	18.8%	)

0 (0.0%)

pp_gender_2 [character]

1. F

2. M

1524	(	48.2%	)
1638	(	51.8%	)

0 (0.0%)

pp_ethnicity_5 [character]

1. 2+

2. Bl

3. Hisp

4. Ot

5. Wh

126	(	4.0%	)
162	(	5.1%	)
306	(	9.7%	)
144	(	4.6%	)
2424	(	76.7%	)

0 (0.0%)

pp_hhsize_6 [numeric]

Mean (sd) : 2.6 (1.2)

min ≤ med ≤ max:

1 ≤ 2 ≤ 6

IQR (CV) : 1 (0.5)

1	:	480	(	15.2%	)
2	:	1314	(	41.6%	)
3	:	612	(	19.4%	)
4	:	456	(	14.4%	)
5	:	210	(	6.6%	)
6	:	90	(	2.8%	)

0 (0.0%)

pp_inc_7 [ordered, factor]

1. <10,000
2. 10,000-
24,999
3. 25,000-
49,999
4. 50,00-
74,999
5. 75,000-
99,999
6. 100,000-
149,999
7. $150,000 +

102	(	3.2%	)
192	(	6.1%	)
492	(	15.6%	)
510	(	16.1%	)
414	(	13.1%	)
630	(	19.9%	)
822	(	26.0%	)

0 (0.0%)

pp_marital_5 [character]

1. Divorced

2. Never married

3. Now Married

4. Separated

5. Widowed

258	(	8.2%	)
666	(	21.1%	)
2022	(	63.9%	)
48	(	1.5%	)
168	(	5.3%	)

0 (0.0%)

pp_metro_cat_2 [character]

1. Metro area

2. Non-metro area

2688	(	85.0%	)
474	(	15.0%	)

0 (0.0%)

pp_region_4 [character]

1. MidWest

2. NorthEast

3. South

4. West

708	(	22.4%	)
558	(	17.6%	)
1140	(	36.1%	)
756	(	23.9%	)

0 (0.0%)

pp_housing_3 [character]

1. NonPayment_Occupied

2. Payment_Own

3. Payment_Rent

60	(	1.9%	)
2436	(	77.0%	)
666	(	21.1%	)

0 (0.0%)

pp_state [character]

1. California

2. Texas

3. Florida

4. Pennsylvania

5. Illinois

6. New Jersey

7. Ohio

8. Michigan

9. New York

10. Washington

[ 39 others ]

306	(	9.7%	)
252	(	8.0%	)
204	(	6.5%	)
168	(	5.3%	)
138	(	4.4%	)
126	(	4.0%	)
126	(	4.0%	)
108	(	3.4%	)
108	(	3.4%	)
108	(	3.4%	)
1518	(	48.0%	)

0 (0.0%)

pp_working_arrangement_9 [character]

1. Emp_Self

2. Employed_FT

3. Employed_PT

4. Furlough

5. Homemaker

6. Laid Off

7. Other

8. Retired

192	(	6.2%	)
1320	(	42.4%	)
186	(	6.0%	)
6	(	0.2%	)
222	(	7.1%	)
78	(	2.5%	)
120	(	3.9%	)
990	(	31.8%	)

48 (1.5%)

pp_employment_status_3 [ordered, factor]

1. Not working

2. Working part-time

3. Working full-time

1326	(	41.9%	)
366	(	11.6%	)
1470	(	46.5%	)

0 (0.0%)

Q2ConcernLevel_4 [ordered, factor]

1. Not at all

2. Not so concerned

3. Somewhat

4. Very concerned

390	(	12.3%	)
882	(	27.9%	)
1326	(	41.9%	)
564	(	17.8%	)

0 (0.0%)

Q3_3 [character]

1. No

2. Skipped

3. Yes

642	(	20.3%	)
30	(	0.9%	)
2490	(	78.7%	)

0 (0.0%)

Q4_5 [ordered, factor]

1. Poor

2. Not so good

3. Good

4. Excellent

5. Skipped

894	(	28.3%	)
582	(	18.4%	)
1290	(	40.8%	)
360	(	11.4%	)
36	(	1.1%	)

0 (0.0%)

Q5Optimism_3 [character]

1. Optimistic

2. Pessimistic

3. Skipped

1374	(	43.5%	)
1770	(	56.0%	)
18	(	0.6%	)

0 (0.0%)

pp_political_id_5 [ordered, factor]

1. Dem

2. Rep

3. Ind

4. Other

5. Skipped

1056	(	33.4%	)
912	(	28.8%	)
1008	(	31.9%	)
168	(	5.3%	)
18	(	0.6%	)

0 (0.0%)

pp_contact_2 [character]

1. N

2. Y

2130	(	67.4%	)
1032	(	32.6%	)

0 (0.0%)

weights_pid [numeric]

Mean (sd) : 1 (0.6)

min ≤ med ≤ max:

0.3 ≤ 0.8 ≤ 6.3

IQR (CV) : 0.5 (0.6)

453 distinct values

0 (0.0%)

Question 1 part [character]

1. Q1a_3

2. Q1b_3

3. Q1c_3

4. Q1d_3

5. Q1e_3

6. Q1f_3

527	(	16.7%	)
527	(	16.7%	)
527	(	16.7%	)
527	(	16.7%	)
527	(	16.7%	)
527	(	16.7%	)

0 (0.0%)

Q1 Response [character]

1. Approve

2. Disapprove

3. Skipped

1478	(	46.7%	)
1615	(	51.1%	)
69	(	2.2%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21

There were many variables from the abc_poll that I could imagine visualizing proportional relationships and proportional relationships by groups.

I explored multiple versions of bar charts to visualize the part-whole relationship of a respondents political identification and stated level of concern in poll question 2.

Code

# Gather/Group the values of the Categorical Variables (pp_political_id_5 and 
# Q2ConcernLevel_4

abc_poll_pp_id_q2 <- abc_poll %>% 
  group_by(pp_political_id_5, Q2ConcernLevel_4) %>%
  #mutate(pp_political_id_5 = na_if(pp_political_id_5, "Skipped"))%>%
  summarise(count = n())

The grouped bar chart shows each of the concern levels broken down by the respondent’s political id. You can see that many respondents are somewhat concerned

Code

##Grouped Bar Chart political id and concern level

abc_poll_pp_id_q2%>%  
  ggplot(aes(fill=pp_political_id_5, y=count, x=Q2ConcernLevel_4)) + 
    geom_bar(position="dodge", stat="identity") +
  labs(subtitle ="Grouped Bar Chart" ,
       y = "Number of Respondents",
       x= "Concern Level",
       title = "Q2 Concern Level by Political Id",
      caption = "ABC News Political Poll")+
  coord_flip()

The stacked bar chart gives an easier to digest view of the comparative level of concern and the part of each concern level that comes from respondents from each political party.

Code

## Stacked bar 

abc_poll_pp_id_q2%>%  
  ggplot(aes(fill=pp_political_id_5, y = count, x=Q2ConcernLevel_4)) + 
    geom_bar(position="stack", stat="identity")+
  labs(subtitle = "Stacked Bar Chart",
       y = "Number of Respondents",
       x= "Concern Level",
       title = "Q2 Concern Level by Political Id",
      caption = "ABC News Political Poll") +
  coord_flip()

The percent stacked bar chart allows us to very quickly see the proportion of respondents from each political party that make up a given concern level. This allows us to see how strongly the level of concern seems to relate to political party.

Code

# Percent Stacked bar

abc_poll_pp_id_q2%>%  
  ggplot(aes(fill=pp_political_id_5, y=count, x=Q2ConcernLevel_4)) + 
    geom_bar(position="fill", stat="identity")+
  labs(subtitle ="Percent Stacked Bar Chart" ,
       y = "Percentage of Respondents",
       x= "Concern Level",
       title = "Q2 Proportionate Concern Level by Political Id",
      caption = "ABC News Political Poll",
      color = "Political ID")

The donut chart is a visual of the distribution of political identification of the poll respondents. I read that donut charts and pie charts are not recommended. In something with only 3 groups, I thought it could be ok, although it doesn’t allow one to see subtle differences between the size of groups like one would see in a “lollipop” or a “bar chart”.

Code

# Facet Wrap with Doughnut (Facet wrap didn't work...would have to fix this)
 
# Compute percentages
abc_poll_pp_id_q2$fraction = abc_poll_pp_id_q2$count / sum(abc_poll_pp_id_q2$count)

# Compute the cumulative percentages (top of each rectangle)
abc_poll_pp_id_q2$ymax = cumsum(abc_poll_pp_id_q2$fraction)

# Compute the bottom of each rectangle
abc_poll_pp_id_q2$ymin = c(0, head(abc_poll_pp_id_q2$ymax, n=-1))
 
# Compute label position
abc_poll_pp_id_q2$labelPosition <- (abc_poll_pp_id_q2$ymax + abc_poll_pp_id_q2$ymin) / 2

# Compute a good label
abc_poll_pp_id_q2$label <- paste0(abc_poll_pp_id_q2$pp_political_id_5, "\n value: ", abc_poll_pp_id_q2$count)
# Make the plot
ggplot(abc_poll_pp_id_q2, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=pp_political_id_5)) +
     geom_rect() +
 # geom_label( x=3.5, aes(y=labelPosition, label=label), size=6) +
     coord_polar(theta="y") + # Try to remove that to understand how the chart is built initially
     xlim(c(2, 4)) +
  theme_void() +
  theme(legend.position = "right") +
  
  labs(subtitle = "Political ID of Respondents",
       title = "Donut Chart",
      caption = "ABC News Political Poll",
      )

Code

  #facet_wrap(vars(Q2ConcernLevel_4))

Questions

How do I change the label of the legend from the name of the “fill” variable?
In what situations, if any, is a pie/donut chart appropriate?

I chose to visualize a “flow relationship”, between a respondent’s reported level of optimism reported in question 5 and several other demographic variables. I found the “skipped” responses to Question 5 to be difficult to read in a flow chart in a way that they weren’t with stacked bar charts or pie charts, so I removed them from these visualizations.

Code

flow_region_educ <- abc_poll %>% 
  select(pp_region_4, Q5Optimism_3)%>%
  mutate(Q5Optimism_3 = na_if(Q5Optimism_3, "Skipped"))

#flow_region_educ

I represented the “flow relationship” using a chord diagram. I chose variables that had a limited number of values as my origin and destination variables.

Code

# Chord Diagrams 
# Charge the circlize library
library(circlize)

Error in library(circlize): there is no package called 'circlize'

Political ID to Q5 Optimism Level showed a clear “flow” of Republican and Other party to pessimistic responses and a strong “flow” of Democratic party ID to optimistic responses.

Code

#Q5 Optimism Status vs Political ID
# Gather the "edges" for our flow: origin: Political ID, destination: Q5 Optimism level
flow_pol_id_optimism <- abc_poll %>% 
  select(pp_political_id_5, Q5Optimism_3)%>%
  mutate(Q5Optimism_3 = na_if(Q5Optimism_3, "Skipped"))%>%
  mutate(pp_political_id_5 = na_if(pp_political_id_5, "Skipped"))%>%
  with(table(pp_political_id_5, Q5Optimism_3))%>%
 
# Make the circular plot
 chordDiagram(transparency = 0.5)

Error in chordDiagram(., transparency = 0.5): could not find function "chordDiagram"

Code

title(main = "Political ID to Q5 Optimism Level", sub = "ABC News Political Poll")

Error in title(main = "Political ID to Q5 Optimism Level", sub = "ABC News Political Poll"): plot.new has not been called yet

Geographic Region to Q5 Optimism Level showed a simple “flow” however it was not so easy to discern a distinction in the proportion of optimismtic and pessimistic responses by region.

Code

#Q5 Optimism Status vs Geographic Region
# Gather the "edges" for our flow: origin: Q5 Optimism, destination: Geographic Region
flow_region_educ <- abc_poll %>% 
  select(pp_region_4, Q5Optimism_3)%>%
  mutate(Q5Optimism_3 = na_if(Q5Optimism_3, "Skipped"))%>%
  
  with(table(Q5Optimism_3, pp_region_4))%>%

# Make the circular plot
 chordDiagram(transparency = 0.5)

Error in chordDiagram(., transparency = 0.5): could not find function "chordDiagram"

Code

title(main = "Q5 Optimism Level to Geographic Region", sub = "ABC News Political Poll")

Error in title(main = "Q5 Optimism Level to Geographic Region", sub = "ABC News Political Poll"): plot.new has not been called yet

I think there are some interesting things to see in these visuals, but it takes too much effort. I would imagine the story would probably be clearer with percent stacked bar charts.

Ethnicity to Geographic Region was overwhelmed visually by the size of the white only demographic. One could see the strong connection between hispanic and other demographics to the West and South.

Code

#Ethnicity
# Gather the "edges" for our flow: origin: Ethnicity, destination: Geographic Region
flow_region_educ <- abc_poll %>% 
  select(pp_region_4, pp_ethnicity_5)%>%
  with(table(pp_ethnicity_5, pp_region_4))%>%
# Make the circular plot
 chordDiagram(transparency = 0.5)

Error in chordDiagram(., transparency = 0.5): could not find function "chordDiagram"

Code

title(main = "Ethnicity to Geographic Region", sub = "ABC News Political Poll")

Error in title(main = "Ethnicity to Geographic Region", sub = "ABC News Political Poll"): plot.new has not been called yet

Income Level to Q5 Optimism Status was overwhelmed visually by the number of different income levels and the formatting of their labels. Also, for income levels with relatively few respondents, it is difficult to see the distinction between the optimistic and pessimistic flows. Notably, there does not seem to be a strong relationship between income level and reported optimism in question 5.

Code

#Q5 Optimism Status vs Income Level
# Gather the "edges" for our flow: origin: Income Level, destination: Q5 Optimism level
flow_optimism_inc <- abc_poll %>% 
  select(pp_inc_7, Q5Optimism_3)%>%
  mutate(Q5Optimism_3 = na_if(Q5Optimism_3, "Skipped"))%>%
  with(table(pp_inc_7, Q5Optimism_3))%>%
# Make the circular plot
 chordDiagram(transparency = 0.5)

Error in chordDiagram(., transparency = 0.5): could not find function "chordDiagram"

Code

title(main = "Income Level to Q5 Optimism Level", sub = "ABC News Poll")

Error in title(main = "Income Level to Q5 Optimism Level", sub = "ABC News Poll"): plot.new has not been called yet

Code

#Marital Status
# Gather the "edges" for our flow: origin: Marital Status, destination: Geographic Region
# flow_region_educ <- abc_poll %>% 
#   select(pp_region_4, pp_marital_5)%>%
#   
#   with(table(pp_marital_5, pp_region_4))%>%
#  
# # Make the circular plot
#  chordDiagram(transparency = 0.5)
# title(main = "Chord Diagram", sub = "Marital Status to Geographic Region")

Education Level to Geographic Region was overwhelmingly busy visually. Although it is pretty, I think this information would be easier to parse in a percent stacked bar chart.

Code

#Education level to Geographic Region
# Gather the "edges" for our flow: origin: Education Level, destination: Geographic Region


flow_region_educ <- abc_poll %>% 
  select(pp_region_4, pp_educ_5)%>%
  
  with(table(pp_educ_5, pp_region_4))%>%
 

# Make the circular plot
 chordDiagram(transparency = 0.5)

Error in chordDiagram(., transparency = 0.5): could not find function "chordDiagram"

Code

title(main = "Education Level to Geographic Region", sub = "ABC News Political Poll")

Error in title(main = "Education Level to Geographic Region", sub = "ABC News Political Poll"): plot.new has not been called yet

Questions

Why do the colors of my chord diagram change each time I run the chunk?
How do I fix the labels around the circle (other than using “newline”)?
Other than traffic/shipping/migration patterns, what are examples of ideas that are well represented by chord charts?

I noticed these in some of the earlier lesson materials and was just experimenting with them here…There were so many categorical variables in the abc_poll data, that I thought that some bivariate relationships might have a nice visual story with heat maps.

Code

# Heat Map Geographic Region, Political ID
abc_poll%>%
  count(pp_region_4, pp_political_id_5) %>%
  ggplot(mapping = aes(x = pp_political_id_5, y = pp_region_4))+
  geom_tile(mapping = aes(fill = n))

Code

# Heat Income Level, Political ID
abc_poll%>%
  count(pp_inc_7, pp_educ_5) %>%
  ggplot(mapping = aes(x = pp_educ_5, y = pp_inc_7))+
  geom_tile(mapping = aes(fill = n))

Our data frame did not consist of any variables measured over time. However, it does have each poll respondent’s age.

If each respondent was asked the same question every year, we could see the evolution of their responses over time

Questions

Do analysts ever use “age” as a variable representing time?