Challenge 7

challenge_7
australian_marriage
abc_poll
Visualizing Multiple Dimensions
Author

Ananya Pujary

Published

August 24, 2022

library(tidyverse)
library(ggplot2)
library(skimr)
library(summarytools)
library(readxl)
if(!require(janitor))
  install.packages("janitor",repos = "https://cran.us.r-project.org")
library(purrr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

Reading in the ‘australian_marriage_tidy.csv’ dataset.

australian_marriage <- read_csv("_data/australian_marriage_tidy.csv")

Briefly describe the data

print(summarytools::dfSummary(australian_marriage, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.50, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

australian_marriage

Dimensions: 16 x 4
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
territory [character]
1. Australian Capital Territ
2. New South Wales
3. Northern Territory(b)
4. Queensland
5. South Australia
6. Tasmania
7. Victoria
8. Western Australia
2(12.5%)
2(12.5%)
2(12.5%)
2(12.5%)
2(12.5%)
2(12.5%)
2(12.5%)
2(12.5%)
0 (0.0%)
resp [character]
1. no
2. yes
8(50.0%)
8(50.0%)
0 (0.0%)
count [numeric]
Mean (sd) : 793202.1 (778436.9)
min ≤ med ≤ max:
31690 ≤ 524226 ≤ 2374362
IQR (CV) : 1083581 (1)
16 distinct values 0 (0.0%)
percent [numeric]
Mean (sd) : 50 (14.7)
min ≤ med ≤ max:
26 ≤ 50 ≤ 74
IQR (CV) : 25.5 (0.3)
16 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-31

skim(australian_marriage)
Data summary
Name australian_marriage
Number of rows 16
Number of columns 4
_______________________
Column type frequency:
character 2
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
territory 0 1 8 31 0 8 0
resp 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
count 0 1 793202.1 778436.94 31690 159008.00 524226 1242588.50 2374362 ▇▃▁▂▂
percent 0 1 50.0 14.67 26 37.23 50 62.77 74 ▂▇▁▇▂

This dataset has 16 rows and 4 columns, of which 2 are character type and 2 are numeric. There are no missing values. It describes the number of people who support/don’t support same sex marriage in 8 Australian regions.

Tidy Data

The entries in the ‘territory’ column can be turned into factors to aid visualization. Also, the ‘resp’ column values of “yes” and “no” can be converted to “2” and “1” respectively.

australian_marriage$territory <- as.factor(australian_marriage$territory)
class(australian_marriage$territory)
[1] "factor"
australian_marriage$resp <- as.numeric(as.factor(australian_marriage$resp))

Visualization with Multiple Dimensions

I used a pie chart to figure out the proportion of married to not-married people in the different Australian states.

australian_marriage$resp <- as.factor(australian_marriage$resp)

marriage_pie <- ggplot(australian_marriage, aes(x="", y=percent, fill=resp)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) + labs(title = "Support for Same-Sex Marriage Across Australian States") + theme_void() + 
  theme(strip.text = element_text(size = 5)) + guides(fill=guide_legend(title="Responses")) + scale_fill_manual(values = unique(australian_marriage$resp), labels = c("No","Yes")) + geom_text(aes(label = paste0(round(percent), "%")),
                position = position_stack(vjust = 0.5))

marriage_pie <- marriage_pie+facet_wrap(~ territory)
marriage_pie

The Australian Capital Territory (c) overall seems to support same-sex marriage the most, while New South Wales has lesser people that support it.

ABC Poll

Read in data

abc_poll <- read_csv("_data/abc_poll_2021.csv")

Briefly describe the data

print(summarytools::dfSummary(abc_poll, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.50, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

abc_poll

Dimensions: 527 x 31
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
id [numeric]
Mean (sd) : 7230264 (152.3)
min ≤ med ≤ max:
7230001 ≤ 7230264 ≤ 7230527
IQR (CV) : 263 (0)
527 distinct values 0 (0.0%)
xspanish [character]
1. English
2. Spanish
514(97.5%)
13(2.5%)
0 (0.0%)
complete_status [character] 1. qualified
527(100.0%)
0 (0.0%)
ppage [numeric]
Mean (sd) : 53.4 (17.1)
min ≤ med ≤ max:
18 ≤ 55 ≤ 91
IQR (CV) : 27 (0.3)
72 distinct values 0 (0.0%)
ppeduc5 [character]
1. NA
2. NA
3. High school graduate (hig
4. No high school diploma or
5. Some college or Associate
99(18.8%)
108(20.5%)
133(25.2%)
29(5.5%)
158(30.0%)
0 (0.0%)
ppeducat [character]
1. Bachelors degree or highe
2. High school
3. Less than high school
4. Some college
207(39.3%)
133(25.2%)
29(5.5%)
158(30.0%)
0 (0.0%)
ppgender [character]
1. Female
2. Male
254(48.2%)
273(51.8%)
0 (0.0%)
ppethm [character]
1. 2+ Races, Non-Hispanic
2. Black, Non-Hispanic
3. Hispanic
4. Other, Non-Hispanic
5. White, Non-Hispanic
21(4.0%)
27(5.1%)
51(9.7%)
24(4.6%)
404(76.7%)
0 (0.0%)
pphhsize [character]
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6 or more
80(15.2%)
219(41.6%)
102(19.4%)
76(14.4%)
35(6.6%)
15(2.8%)
0 (0.0%)
ppinc7 [character]
1. $10,000 to $24,999
2. $100,000 to $149,999
3. $150,000 or more
4. $25,000 to $49,999
5. $50,000 to $74,999
6. $75,000 to $99,999
7. Less than $10,000
32(6.1%)
105(19.9%)
137(26.0%)
82(15.6%)
85(16.1%)
69(13.1%)
17(3.2%)
0 (0.0%)
ppmarit5 [character]
1. Divorced
2. Never married
3. Now Married
4. Separated
5. Widowed
43(8.2%)
111(21.1%)
337(63.9%)
8(1.5%)
28(5.3%)
0 (0.0%)
ppmsacat [character]
1. Metro area
2. Non-metro area
448(85.0%)
79(15.0%)
0 (0.0%)
ppreg4 [character]
1. MidWest
2. NorthEast
3. South
4. West
118(22.4%)
93(17.6%)
190(36.1%)
126(23.9%)
0 (0.0%)
pprent [character]
1. Occupied without payment
2. Owned or being bought by
3. Rented for cash
10(1.9%)
406(77.0%)
111(21.1%)
0 (0.0%)
ppstaten [character]
1. California
2. Texas
3. Florida
4. Pennsylvania
5. Illinois
6. New Jersey
7. Ohio
8. Michigan
9. New York
10. Washington
[ 39 others ]
51(9.7%)
42(8.0%)
34(6.5%)
28(5.3%)
23(4.4%)
21(4.0%)
21(4.0%)
18(3.4%)
18(3.4%)
18(3.4%)
253(48.0%)
0 (0.0%)
PPWORKA [character]
1. Currently laid off
2. Employed full-time (by so
3. Employed part-time (by so
4. Full Time Student
5. Homemaker
6. On furlough
7. Other
8. Retired
9. Self-employed
13(2.5%)
220(41.7%)
31(5.9%)
8(1.5%)
37(7.0%)
1(0.2%)
20(3.8%)
165(31.3%)
32(6.1%)
0 (0.0%)
ppemploy [character]
1. Not working
2. Working full-time
3. Working part-time
221(41.9%)
245(46.5%)
61(11.6%)
0 (0.0%)
Q1_a [character]
1. Approve
2. Disapprove
3. Skipped
329(62.4%)
193(36.6%)
5(0.9%)
0 (0.0%)
Q1_b [character]
1. Approve
2. Disapprove
3. Skipped
192(36.4%)
322(61.1%)
13(2.5%)
0 (0.0%)
Q1_c [character]
1. Approve
2. Disapprove
3. Skipped
272(51.6%)
248(47.1%)
7(1.3%)
0 (0.0%)
Q1_d [character]
1. Approve
2. Disapprove
3. Skipped
192(36.4%)
321(60.9%)
14(2.7%)
0 (0.0%)
Q1_e [character]
1. Approve
2. Disapprove
3. Skipped
212(40.2%)
301(57.1%)
14(2.7%)
0 (0.0%)
Q1_f [character]
1. Approve
2. Disapprove
3. Skipped
281(53.3%)
230(43.6%)
16(3.0%)
0 (0.0%)
Q2 [character]
1. Not concerned at all
2. Not so concerned
3. Somewhat concerned
4. Very concerned
65(12.3%)
147(27.9%)
221(41.9%)
94(17.8%)
0 (0.0%)
Q3 [character]
1. No
2. Skipped
3. Yes
107(20.3%)
5(0.9%)
415(78.7%)
0 (0.0%)
Q4 [character]
1. Excellent
2. Good
3. Not so good
4. Poor
5. Skipped
60(11.4%)
215(40.8%)
97(18.4%)
149(28.3%)
6(1.1%)
0 (0.0%)
Q5 [character]
1. Optimistic
2. Pessimistic
3. Skipped
229(43.5%)
295(56.0%)
3(0.6%)
0 (0.0%)
QPID [character]
1. A Democrat
2. A Republican
3. An Independent
4. Skipped
5. Something else
176(33.4%)
152(28.8%)
168(31.9%)
3(0.6%)
28(5.3%)
0 (0.0%)
ABCAGE [character]
1. 18-29
2. 30-49
3. 50-64
4. 65+
60(11.4%)
148(28.1%)
157(29.8%)
162(30.7%)
0 (0.0%)
Contact [character]
1. No, I am not willing to b
2. Yes, I am willing to be i
355(67.4%)
172(32.6%)
0 (0.0%)
weights_pid [numeric]
Mean (sd) : 1 (0.6)
min ≤ med ≤ max:
0.3 ≤ 0.8 ≤ 6.3
IQR (CV) : 0.5 (0.6)
453 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-31

skim(abc_poll)
Data summary
Name abc_poll
Number of rows 527
Number of columns 31
_______________________
Column type frequency:
character 28
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
xspanish 0 1 7 7 0 2 0
complete_status 0 1 9 9 0 1 0
ppeduc5 0 1 29 64 0 5 0
ppeducat 0 1 11 26 0 4 0
ppgender 0 1 4 6 0 2 0
ppethm 0 1 8 22 0 5 0
pphhsize 0 1 1 9 0 6 0
ppinc7 0 1 16 20 0 7 0
ppmarit5 0 1 7 13 0 5 0
ppmsacat 0 1 10 14 0 2 0
ppreg4 0 1 4 9 0 4 0
pprent 0 1 15 57 0 3 0
ppstaten 0 1 4 20 0 49 0
PPWORKA 0 1 5 36 0 9 0
ppemploy 0 1 11 17 0 3 0
Q1_a 0 1 7 10 0 3 0
Q1_b 0 1 7 10 0 3 0
Q1_c 0 1 7 10 0 3 0
Q1_d 0 1 7 10 0 3 0
Q1_e 0 1 7 10 0 3 0
Q1_f 0 1 7 10 0 3 0
Q2 0 1 14 20 0 4 0
Q3 0 1 2 7 0 3 0
Q4 0 1 4 11 0 5 0
Q5 0 1 7 11 0 3 0
QPID 0 1 7 14 0 5 0
ABCAGE 0 1 3 5 0 4 0
Contact 0 1 35 38 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 7230264.00 152.28 7230001.00 7230132.50 7230264.00 7230395.50 7230527.00 ▇▇▇▇▇
ppage 0 1 53.39 17.10 18.00 40.00 55.00 67.00 91.00 ▅▆▇▇▂
weights_pid 0 1 1.00 0.60 0.32 0.63 0.85 1.15 6.26 ▇▁▁▁▁

This dataset seems to be a national sample survey from 2019 that includes questions related to respondents’ political attitudes and demographics, along with some survey administration questions. There are 527 rows and 31 columns in this dataset, of which 28 are character type and 3 are numeric. No missing values are present. ppeducat seems to have more complete information than ppeduc5, so I’ll be using this column for further visualization of respondents’ educational background.

Tidy Data

First, I’ll be ordering ppeducat values by increasing educational level instead of alphabetically.

education_levels <- unique(abc_poll$ppeducat)
education_levels
[1] "High school"                "Bachelors degree or higher"
[3] "Some college"               "Less than high school"     
abc_poll_new <- abc_poll %>%
  mutate(education = factor(ppeducat, levels =  education_levels[c(4,1,3,2)])) %>%
  select(-ppeducat)

table(abc_poll_new$education)

     Less than high school                High school 
                        29                        133 
              Some college Bachelors degree or higher 
                       158                        207 
unique(abc_poll$QPID)
[1] "A Democrat"     "An Independent" "Something else" "A Republican"  
[5] "Skipped"       
# removing extra characters such as "A" and "An"
abc_poll_new <- abc_poll_new %>%
  mutate(partyid = str_remove(QPID, "A[n]* "),
         partyid = na_if(partyid, "Skipped")) %>%
  select(-QPID)

table(abc_poll_new$partyid)

      Democrat    Independent     Republican Something else 
           176            168            152             28 

Then, I replaced all “Skipped” entries across columns with NA:

abc_poll_new <-abc_poll_new%>%
  mutate(across(starts_with("Q"), ~ na_if(.x, "Skipped")))

map(select(abc_poll_new, starts_with("Q")), table)
$Q1_a

   Approve Disapprove 
       329        193 

$Q1_b

   Approve Disapprove 
       192        322 

$Q1_c

   Approve Disapprove 
       272        248 

$Q1_d

   Approve Disapprove 
       192        321 

$Q1_e

   Approve Disapprove 
       212        301 

$Q1_f

   Approve Disapprove 
       281        230 

$Q2

Not concerned at all     Not so concerned   Somewhat concerned 
                  65                  147                  221 
      Very concerned 
                  94 

$Q3

 No Yes 
107 415 

$Q4

  Excellent        Good Not so good        Poor 
         60         215          97         149 

$Q5

 Optimistic Pessimistic 
        229         295 

Finally, I shortened some entries in the ppethm column by removing the redundant “Non-Hispanic” part.

unique(abc_poll_new$ppethm)
[1] "White, Non-Hispanic"    "Other, Non-Hispanic"    "Black, Non-Hispanic"   
[4] "Hispanic"               "2+ Races, Non-Hispanic"
# shortening 
abc_poll_new <- abc_poll_new %>%
  mutate(ethnicity = str_remove(ppethm,", Non-Hispanic")) %>%
  select(-ppethm)

unique(abc_poll_new$ethnicity)
[1] "White"    "Other"    "Black"    "Hispanic" "2+ Races"

Visualization with Multiple Dimensions

First, I look at the distribution of party affiliations by age using geom_point() and percentage labels for easy comparison:

abc_poll_viz <- abc_poll_new %>%
  select(partyid,ABCAGE) %>%
  group_by(partyid,ABCAGE) %>%
  tally()
abc_poll_viz
# A tibble: 19 × 3
# Groups:   partyid [5]
   partyid        ABCAGE     n
   <chr>          <chr>  <int>
 1 Democrat       18-29     23
 2 Democrat       30-49     54
 3 Democrat       50-64     49
 4 Democrat       65+       50
 5 Independent    18-29     20
 6 Independent    30-49     44
 7 Independent    50-64     50
 8 Independent    65+       54
 9 Republican     18-29     10
10 Republican     30-49     41
11 Republican     50-64     52
12 Republican     65+       49
13 Something else 18-29      6
14 Something else 30-49      8
15 Something else 50-64      6
16 Something else 65+        8
17 <NA>           18-29      1
18 <NA>           30-49      1
19 <NA>           65+        1
abc_table <- ggplot(abc_poll_viz, aes(partyid, ABCAGE)) + geom_point(aes(size = n), colour = "orange") + xlab("") + ylab("")
abc_table + scale_size_continuous(range=c(10,20)) + geom_text(aes(label = scales::percent(n/sum(n), accuracy = .1, trim = FALSE)), size = 3) + theme(panel.background=element_blank(), panel.grid.major = element_line(color = "red", size = 0.5),legend.position="none") + labs(title = "Distribution of Party Affiliations by Age")

Then, I wanted to see how the proportion of people who were optimistic about the next 12 months with the current party by age and party affiliation, but I wasn’t able to properly visualize it beyond these tables:

abc_poll_table <- abc_poll_new %>%
  tabyl(partyid,ABCAGE,Q5, show_missing_levels = FALSE) %>%
  adorn_totals("row") %>%
  adorn_percentages("all") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns %>%
  adorn_title()

abc_poll_table
$Optimistic
                    ABCAGE                                 
        partyid      18-29      30-49      50-64        65+
       Democrat  7.4% (17) 15.3% (35) 17.5% (40) 17.5% (40)
    Independent  2.6%  (6)  7.4% (17)  8.7% (20)  7.4% (17)
     Republican  0.9%  (2)  3.1%  (7)  3.9%  (9)  3.5%  (8)
 Something else  1.3%  (3)  0.9%  (2)  1.3%  (3)  0.9%  (2)
           <NA>  0.4%  (1)  0.0%  (0)  0.0%  (0)  0.0%  (0)
          Total 12.7% (29) 26.6% (61) 31.4% (72) 29.3% (67)

$Pessimistic
                    ABCAGE                                 
        partyid      18-29      30-49      50-64        65+
       Democrat  2.0%  (6)  6.4% (19)  3.1%  (9)  3.1%  (9)
    Independent  4.7% (14)  9.2% (27) 10.2% (30) 12.5% (37)
     Republican  2.7%  (8) 11.5% (34) 14.6% (43) 13.9% (41)
 Something else  1.0%  (3)  2.0%  (6)  1.0%  (3)  1.7%  (5)
           <NA>  0.0%  (0)  0.3%  (1)  0.0%  (0)  0.0%  (0)
          Total 10.5% (31) 29.5% (87) 28.8% (85) 31.2% (92)

$NA_
                    ABCAGE
        partyid        65+
       Democrat  33.3% (1)
 Something else  33.3% (1)
           <NA>  33.3% (1)
          Total 100.0% (3)

:::