Challenge 7

challenge_7

australian_marriage

abc_poll

Visualizing Multiple Dimensions

Author

Ananya Pujary

Published

August 24, 2022

library(tidyverse)
library(ggplot2)
library(skimr)
library(summarytools)
library(readxl)
if(!require(janitor))
  install.packages("janitor",repos = "https://cran.us.r-project.org")
library(purrr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Australian Marriage Dataset

Read in data

Reading in the ‘australian_marriage_tidy.csv’ dataset.

australian_marriage <- read_csv("_data/australian_marriage_tidy.csv")

Briefly describe the data

print(summarytools::dfSummary(australian_marriage, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.50, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

australian_marriage

Dimensions: 16 x 4
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

territory [character]

1. Australian Capital Territ

2. New South Wales

3. Northern Territory(b)

4. Queensland

5. South Australia

6. Tasmania

7. Victoria

8. Western Australia

2	(	12.5%	)
2	(	12.5%	)
2	(	12.5%	)
2	(	12.5%	)
2	(	12.5%	)
2	(	12.5%	)
2	(	12.5%	)
2	(	12.5%	)

0 (0.0%)

resp [character]

1. no

2. yes

8	(	50.0%	)
8	(	50.0%	)

0 (0.0%)

count [numeric]

Mean (sd) : 793202.1 (778436.9)

min ≤ med ≤ max:

31690 ≤ 524226 ≤ 2374362

IQR (CV) : 1083581 (1)

16 distinct values

0 (0.0%)

percent [numeric]

Mean (sd) : 50 (14.7)

min ≤ med ≤ max:

26 ≤ 50 ≤ 74

IQR (CV) : 25.5 (0.3)

16 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-31

skim(australian_marriage)

Data summary
Name	australian_marriage
Number of rows	16
Number of columns	4
_______________________
Column type frequency:
character	2
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
territory	0	1	8	31	0	8	0
resp	0	1	2	3	0	2	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
count	0	1	793202.1	778436.94	31690	159008.00	524226	1242588.50	2374362	▇▃▁▂▂
percent	0	1	50.0	14.67	26	37.23	50	62.77	74	▂▇▁▇▂

This dataset has 16 rows and 4 columns, of which 2 are character type and 2 are numeric. There are no missing values. It describes the number of people who support/don’t support same sex marriage in 8 Australian regions.

Tidy Data

The entries in the ‘territory’ column can be turned into factors to aid visualization. Also, the ‘resp’ column values of “yes” and “no” can be converted to “2” and “1” respectively.

australian_marriage$territory <- as.factor(australian_marriage$territory)
class(australian_marriage$territory)

[1] "factor"

australian_marriage$resp <- as.numeric(as.factor(australian_marriage$resp))

Visualization with Multiple Dimensions

I used a pie chart to figure out the proportion of married to not-married people in the different Australian states.

australian_marriage$resp <- as.factor(australian_marriage$resp)

marriage_pie <- ggplot(australian_marriage, aes(x="", y=percent, fill=resp)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) + labs(title = "Support for Same-Sex Marriage Across Australian States") + theme_void() + 
  theme(strip.text = element_text(size = 5)) + guides(fill=guide_legend(title="Responses")) + scale_fill_manual(values = unique(australian_marriage$resp), labels = c("No","Yes")) + geom_text(aes(label = paste0(round(percent), "%")),
                position = position_stack(vjust = 0.5))

marriage_pie <- marriage_pie+facet_wrap(~ territory)
marriage_pie

The Australian Capital Territory (c) overall seems to support same-sex marriage the most, while New South Wales has lesser people that support it.

ABC Poll

Read in data

abc_poll <- read_csv("_data/abc_poll_2021.csv")

Briefly describe the data

print(summarytools::dfSummary(abc_poll, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.50, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

abc_poll

Dimensions: 527 x 31
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

id [numeric]

Mean (sd) : 7230264 (152.3)

min ≤ med ≤ max:

7230001 ≤ 7230264 ≤ 7230527

IQR (CV) : 263 (0)

527 distinct values

0 (0.0%)

xspanish [character]

1. English

2. Spanish

514	(	97.5%	)
13	(	2.5%	)

0 (0.0%)

complete_status [character]

1. qualified

527

(

100.0%

)

0 (0.0%)

ppage [numeric]

Mean (sd) : 53.4 (17.1)

min ≤ med ≤ max:

18 ≤ 55 ≤ 91

IQR (CV) : 27 (0.3)

72 distinct values

0 (0.0%)

ppeduc5 [character]

1. NA

2. NA

3. High school graduate (hig

4. No high school diploma or

5. Some college or Associate

99	(	18.8%	)
108	(	20.5%	)
133	(	25.2%	)
29	(	5.5%	)
158	(	30.0%	)

0 (0.0%)

ppeducat [character]

1. Bachelors degree or highe

2. High school

3. Less than high school

4. Some college

207	(	39.3%	)
133	(	25.2%	)
29	(	5.5%	)
158	(	30.0%	)

0 (0.0%)

ppgender [character]

1. Female

2. Male

254	(	48.2%	)
273	(	51.8%	)

0 (0.0%)

ppethm [character]

1. 2+ Races, Non-Hispanic

2. Black, Non-Hispanic

3. Hispanic

4. Other, Non-Hispanic

5. White, Non-Hispanic

21	(	4.0%	)
27	(	5.1%	)
51	(	9.7%	)
24	(	4.6%	)
404	(	76.7%	)

0 (0.0%)

pphhsize [character]

1. 1

2. 2

3. 3

4. 4

5. 5

6. 6 or more

80	(	15.2%	)
219	(	41.6%	)
102	(	19.4%	)
76	(	14.4%	)
35	(	6.6%	)
15	(	2.8%	)

0 (0.0%)

ppinc7 [character]

1. $10,000 to $24,999

2. $100,000 to $149,999

3. $150,000 or more

4. $25,000 to $49,999

5. $50,000 to $74,999

6. $75,000 to $99,999

7. Less than $10,000

32	(	6.1%	)
105	(	19.9%	)
137	(	26.0%	)
82	(	15.6%	)
85	(	16.1%	)
69	(	13.1%	)
17	(	3.2%	)

0 (0.0%)

ppmarit5 [character]

1. Divorced

2. Never married

3. Now Married

4. Separated

5. Widowed

43	(	8.2%	)
111	(	21.1%	)
337	(	63.9%	)
8	(	1.5%	)
28	(	5.3%	)

0 (0.0%)

ppmsacat [character]

1. Metro area

2. Non-metro area

448	(	85.0%	)
79	(	15.0%	)

0 (0.0%)

ppreg4 [character]

1. MidWest

2. NorthEast

3. South

4. West

118	(	22.4%	)
93	(	17.6%	)
190	(	36.1%	)
126	(	23.9%	)

0 (0.0%)

pprent [character]

1. Occupied without payment

2. Owned or being bought by

3. Rented for cash

10	(	1.9%	)
406	(	77.0%	)
111	(	21.1%	)

0 (0.0%)

ppstaten [character]

1. California

2. Texas

3. Florida

4. Pennsylvania

5. Illinois

6. New Jersey

7. Ohio

8. Michigan

9. New York

10. Washington

[ 39 others ]

51	(	9.7%	)
42	(	8.0%	)
34	(	6.5%	)
28	(	5.3%	)
23	(	4.4%	)
21	(	4.0%	)
21	(	4.0%	)
18	(	3.4%	)
18	(	3.4%	)
18	(	3.4%	)
253	(	48.0%	)

0 (0.0%)

PPWORKA [character]

1. Currently laid off

2. Employed full-time (by so

3. Employed part-time (by so

4. Full Time Student

5. Homemaker

6. On furlough

7. Other

8. Retired

9. Self-employed

13	(	2.5%	)
220	(	41.7%	)
31	(	5.9%	)
8	(	1.5%	)
37	(	7.0%	)
1	(	0.2%	)
20	(	3.8%	)
165	(	31.3%	)
32	(	6.1%	)

0 (0.0%)

ppemploy [character]

1. Not working

2. Working full-time

3. Working part-time

221	(	41.9%	)
245	(	46.5%	)
61	(	11.6%	)

0 (0.0%)

Q1_a [character]

1. Approve

2. Disapprove

3. Skipped

329	(	62.4%	)
193	(	36.6%	)
5	(	0.9%	)

0 (0.0%)

Q1_b [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
322	(	61.1%	)
13	(	2.5%	)

0 (0.0%)

Q1_c [character]

1. Approve

2. Disapprove

3. Skipped

272	(	51.6%	)
248	(	47.1%	)
7	(	1.3%	)

0 (0.0%)

Q1_d [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
321	(	60.9%	)
14	(	2.7%	)

0 (0.0%)

Q1_e [character]

1. Approve

2. Disapprove

3. Skipped

212	(	40.2%	)
301	(	57.1%	)
14	(	2.7%	)

0 (0.0%)

Q1_f [character]

1. Approve

2. Disapprove

3. Skipped

281	(	53.3%	)
230	(	43.6%	)
16	(	3.0%	)

0 (0.0%)

Q2 [character]

1. Not concerned at all

2. Not so concerned

3. Somewhat concerned

4. Very concerned

65	(	12.3%	)
147	(	27.9%	)
221	(	41.9%	)
94	(	17.8%	)

0 (0.0%)

Q3 [character]

1. No

2. Skipped

3. Yes

107	(	20.3%	)
5	(	0.9%	)
415	(	78.7%	)

0 (0.0%)

Q4 [character]

1. Excellent

2. Good

3. Not so good

4. Poor

5. Skipped

60	(	11.4%	)
215	(	40.8%	)
97	(	18.4%	)
149	(	28.3%	)
6	(	1.1%	)

0 (0.0%)

Q5 [character]

1. Optimistic

2. Pessimistic

3. Skipped

229	(	43.5%	)
295	(	56.0%	)
3	(	0.6%	)

0 (0.0%)

QPID [character]

1. A Democrat

2. A Republican

3. An Independent

4. Skipped

5. Something else

176	(	33.4%	)
152	(	28.8%	)
168	(	31.9%	)
3	(	0.6%	)
28	(	5.3%	)

0 (0.0%)

ABCAGE [character]

1. 18-29

2. 30-49

3. 50-64

4. 65+

60	(	11.4%	)
148	(	28.1%	)
157	(	29.8%	)
162	(	30.7%	)

0 (0.0%)

Contact [character]

1. No, I am not willing to b

2. Yes, I am willing to be i

355	(	67.4%	)
172	(	32.6%	)

0 (0.0%)

weights_pid [numeric]

Mean (sd) : 1 (0.6)

min ≤ med ≤ max:

0.3 ≤ 0.8 ≤ 6.3

IQR (CV) : 0.5 (0.6)

453 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-31

skim(abc_poll)

Data summary
Name	abc_poll
Number of rows	527
Number of columns	31
_______________________
Column type frequency:
character	28
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
xspanish	1	7	7	2
complete_status	1	9	9	1
ppeduc5	1	29	64	5
ppeducat	1	11	26	4
ppgender	1	4	6	2
ppethm	1	8	22	5
pphhsize	1	1	9	6
ppinc7	1	16	20	7
ppmarit5	1	7	13	5
ppmsacat	1	10	14	2
ppreg4	1	4	9	4
pprent	1	15	57	3
ppstaten	1	4	20	49
PPWORKA	1	5	36	9
ppemploy	1	11	17	3
Q1_a	1	7	10	3
Q1_b	1	7	10	3
Q1_c	1	7	10	3
Q1_d	1	7	10	3
Q1_e	1	7	10	3
Q1_f	1	7	10	3
Q2	1	14	20	4
Q3	1	2	7	3
Q4	1	4	11	5
Q5	1	7	11	3
QPID	1	7	14	5
ABCAGE	1	3	5	4
Contact	1	35	38	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	7230264.00	152.28	7230001.00	7230132.50	7230264.00	7230395.50	7230527.00	▇▇▇▇▇
ppage	1	53.39	17.10	18.00	40.00	55.00	67.00	91.00	▅▆▇▇▂
weights_pid	1	1.00	0.60	0.32	0.63	0.85	1.15	6.26	▇▁▁▁▁

This dataset seems to be a national sample survey from 2019 that includes questions related to respondents’ political attitudes and demographics, along with some survey administration questions. There are 527 rows and 31 columns in this dataset, of which 28 are character type and 3 are numeric. No missing values are present. ppeducat seems to have more complete information than ppeduc5, so I’ll be using this column for further visualization of respondents’ educational background.

Tidy Data

First, I’ll be ordering ppeducat values by increasing educational level instead of alphabetically.

education_levels <- unique(abc_poll$ppeducat)
education_levels

[1] "High school"                "Bachelors degree or higher"
[3] "Some college"               "Less than high school"

abc_poll_new <- abc_poll %>%
  mutate(education = factor(ppeducat, levels =  education_levels[c(4,1,3,2)])) %>%
  select(-ppeducat)

table(abc_poll_new$education)


     Less than high school                High school 
                        29                        133 
              Some college Bachelors degree or higher 
                       158                        207

unique(abc_poll$QPID)

[1] "A Democrat"     "An Independent" "Something else" "A Republican"  
[5] "Skipped"

# removing extra characters such as "A" and "An"
abc_poll_new <- abc_poll_new %>%
  mutate(partyid = str_remove(QPID, "A[n]* "),
         partyid = na_if(partyid, "Skipped")) %>%
  select(-QPID)

table(abc_poll_new$partyid)


      Democrat    Independent     Republican Something else 
           176            168            152             28

Then, I replaced all “Skipped” entries across columns with NA:

abc_poll_new <-abc_poll_new%>%
  mutate(across(starts_with("Q"), ~ na_if(.x, "Skipped")))

map(select(abc_poll_new, starts_with("Q")), table)

$Q1_a

   Approve Disapprove 
       329        193 

$Q1_b

   Approve Disapprove 
       192        322 

$Q1_c

   Approve Disapprove 
       272        248 

$Q1_d

   Approve Disapprove 
       192        321 

$Q1_e

   Approve Disapprove 
       212        301 

$Q1_f

   Approve Disapprove 
       281        230 

$Q2

Not concerned at all     Not so concerned   Somewhat concerned 
                  65                  147                  221 
      Very concerned 
                  94 

$Q3

 No Yes 
107 415 

$Q4

  Excellent        Good Not so good        Poor 
         60         215          97         149 

$Q5

 Optimistic Pessimistic 
        229         295

Finally, I shortened some entries in the ppethm column by removing the redundant “Non-Hispanic” part.

unique(abc_poll_new$ppethm)

[1] "White, Non-Hispanic"    "Other, Non-Hispanic"    "Black, Non-Hispanic"   
[4] "Hispanic"               "2+ Races, Non-Hispanic"

# shortening 
abc_poll_new <- abc_poll_new %>%
  mutate(ethnicity = str_remove(ppethm,", Non-Hispanic")) %>%
  select(-ppethm)

unique(abc_poll_new$ethnicity)

[1] "White"    "Other"    "Black"    "Hispanic" "2+ Races"

Visualization with Multiple Dimensions

First, I look at the distribution of party affiliations by age using geom_point() and percentage labels for easy comparison:

abc_poll_viz <- abc_poll_new %>%
  select(partyid,ABCAGE) %>%
  group_by(partyid,ABCAGE) %>%
  tally()
abc_poll_viz

# A tibble: 19 × 3
# Groups:   partyid [5]
   partyid        ABCAGE     n
   <chr>          <chr>  <int>
 1 Democrat       18-29     23
 2 Democrat       30-49     54
 3 Democrat       50-64     49
 4 Democrat       65+       50
 5 Independent    18-29     20
 6 Independent    30-49     44
 7 Independent    50-64     50
 8 Independent    65+       54
 9 Republican     18-29     10
10 Republican     30-49     41
11 Republican     50-64     52
12 Republican     65+       49
13 Something else 18-29      6
14 Something else 30-49      8
15 Something else 50-64      6
16 Something else 65+        8
17 <NA>           18-29      1
18 <NA>           30-49      1
19 <NA>           65+        1

abc_table <- ggplot(abc_poll_viz, aes(partyid, ABCAGE)) + geom_point(aes(size = n), colour = "orange") + xlab("") + ylab("")
abc_table + scale_size_continuous(range=c(10,20)) + geom_text(aes(label = scales::percent(n/sum(n), accuracy = .1, trim = FALSE)), size = 3) + theme(panel.background=element_blank(), panel.grid.major = element_line(color = "red", size = 0.5),legend.position="none") + labs(title = "Distribution of Party Affiliations by Age")

Then, I wanted to see how the proportion of people who were optimistic about the next 12 months with the current party by age and party affiliation, but I wasn’t able to properly visualize it beyond these tables:

abc_poll_table <- abc_poll_new %>%
  tabyl(partyid,ABCAGE,Q5, show_missing_levels = FALSE) %>%
  adorn_totals("row") %>%
  adorn_percentages("all") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns %>%
  adorn_title()

abc_poll_table

$Optimistic
                    ABCAGE                                 
        partyid      18-29      30-49      50-64        65+
       Democrat  7.4% (17) 15.3% (35) 17.5% (40) 17.5% (40)
    Independent  2.6%  (6)  7.4% (17)  8.7% (20)  7.4% (17)
     Republican  0.9%  (2)  3.1%  (7)  3.9%  (9)  3.5%  (8)
 Something else  1.3%  (3)  0.9%  (2)  1.3%  (3)  0.9%  (2)
           <NA>  0.4%  (1)  0.0%  (0)  0.0%  (0)  0.0%  (0)
          Total 12.7% (29) 26.6% (61) 31.4% (72) 29.3% (67)

$Pessimistic
                    ABCAGE                                 
        partyid      18-29      30-49      50-64        65+
       Democrat  2.0%  (6)  6.4% (19)  3.1%  (9)  3.1%  (9)
    Independent  4.7% (14)  9.2% (27) 10.2% (30) 12.5% (37)
     Republican  2.7%  (8) 11.5% (34) 14.6% (43) 13.9% (41)
 Something else  1.0%  (3)  2.0%  (6)  1.0%  (3)  1.7%  (5)
           <NA>  0.0%  (0)  0.3%  (1)  0.0%  (0)  0.0%  (0)
          Total 10.5% (31) 29.5% (87) 28.8% (85) 31.2% (92)

$NA_
                    ABCAGE
        partyid        65+
       Democrat  33.3% (1)
 Something else  33.3% (1)
           <NA>  33.3% (1)
          Total 100.0% (3)

:::