Challenge 5 Marriage in Australia

challenge_5
australian_marriage
Introduction to Visualization
Author

Nanci Kopecky

Published

April 1, 2023

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. create at least two univariate visualizations
  • try to make them “publication” ready
  • Explain why you choose the specific graph type
  1. Create at least one bivariate visualization
  • try to make them “publication” ready
  • Explain why you choose the specific graph type

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • cereal.csv ⭐
  • Total_cost_for_top_15_pathogens_2018.xlsx ⭐
  • Australian Marriage ⭐⭐
  • AB_NYC_2019.csv ⭐⭐⭐
  • StateCounty2012.xls ⭐⭐⭐
  • Public School Characteristics ⭐⭐⭐⭐
  • USA Households ⭐⭐⭐⭐⭐
library(readr)
aussie_marry <- read.csv(file = "_data/australian_marriage_tidy.csv",
                header=TRUE,
                sep = ",")
View(aussie_marry)
Error in check_for_XQuartz(file.path(R.home("modules"), "R_de.so")): X11 library is missing: install XQuartz from www.xquartz.org
head(aussie_marry)
        territory resp   count percent
1 New South Wales  yes 2374362    57.8
2 New South Wales   no 1736838    42.2
3        Victoria  yes 2145629    64.9
4        Victoria   no 1161098    35.1
5      Queensland  yes 1487060    60.7
6      Queensland   no  961015    39.3
ncol(aussie_marry)
[1] 4
nrow(aussie_marry)
[1] 16

Briefly describe the data

The data represents the proportions and counts people married in 8 Australian territories.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

The data is tidy where each cell represents a value and there are no missing values. There are 4 variables, 2 categorical and 2 numerical, and 16 rows. I used the pivot_wider function here to see if the table easier to read and use later for visualizations.

aussie_marry2 <- aussie_marry %>% pivot_wider(names_from = resp, values_from = c(count, percent))
aussie_marry2
# A tibble: 8 × 5
  territory                       count_yes count_no percent_yes percent_no
  <chr>                               <int>    <int>       <dbl>      <dbl>
1 New South Wales                   2374362  1736838        57.8       42.2
2 Victoria                          2145629  1161098        64.9       35.1
3 Queensland                        1487060   961015        60.7       39.3
4 South Australia                    592528   356247        62.5       37.5
5 Western Australia                  801575   455924        63.7       36.3
6 Tasmania                           191948   109655        63.6       36.4
7 Northern Territory(b)               48686    31690        60.6       39.4
8 Australian Capital Territory(c)    175459    61520        74         26  

Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

I changed the count to be on the scale of 10,000’s so the graph does not have big numbers. I did not use the .after feature because I did not want the table to get too wide.

aussie_marry3 <- aussie_marry2 %>% mutate(count_yes = count_yes/10000, 
                         count_no = count_no/10,000)

aussie_marry3
# A tibble: 8 × 6
  territory                       count_yes count_no percent_yes percent…¹   `0`
  <chr>                               <dbl>    <dbl>       <dbl>     <dbl> <dbl>
1 New South Wales                    237.    173684.        57.8      42.2     0
2 Victoria                           215.    116110.        64.9      35.1     0
3 Queensland                         149.     96102.        60.7      39.3     0
4 South Australia                     59.3    35625.        62.5      37.5     0
5 Western Australia                   80.2    45592.        63.7      36.3     0
6 Tasmania                            19.2    10966.        63.6      36.4     0
7 Northern Territory(b)                4.87    3169         60.6      39.4     0
8 Australian Capital Territory(c)     17.5     6152         74        26       0
# … with abbreviated variable name ¹​percent_no

Univariate Visualizations

I started with the basics of geom_histogram representing the count of those responded yes and then added more details of the historgram adding fill and labels. The second graph is more colorful and easier to read.

ggplot(aussie_marry3, aes(count_yes)) + geom_histogram (bins = 15)

ggplot(aussie_marry3, aes(count_yes, fill = territory)) + 
  geom_histogram(bins = 15) + 
  labs(title = "How Many Said YES?!", x = "Said YES! x 10,000", y = "Frequency") 

Bar Graphs

Here I practiced making bar graphs and exploring different ways to show clearly the x axis labels. I had to use geom_col in instead of geom_bar because the data was already summarized with the territory and percent that said yes.

barplot(aussie_marry3$percent_yes)

aussie_marry3 %>% ggplot(aes(x = territory, y = percent_yes)) + 
  geom_col(aes(fill = territory)) + 
  labs(title = "Said YES!", x = "Aussie Territory", y = "Percent %") + 
  scale_x_discrete(guide = guide_axis(n.dodge = 3)) +
  NULL

aussie_marry3 %>% ggplot(aes(x = territory, y = percent_yes)) + 
  geom_col(aes(fill = territory)) + 
  labs(title = "Said YES!", x = "Aussie Territory", y = "Percent %") + 
  coord_flip()

Bivariate Visualization(s)

Any additional comments?

I used the only two pieces of numerical data to make a scatterplot. One variable was the count and the other was the percent of the same characteristic. I do not expect the graph to be interesting or informative. And while linear regression would not apply here, I practiced using the function for future reference.

ggplot(aussie_marry3, aes(`count_yes`, `percent_yes`)) + 
  geom_point( ) + 
  geom_smooth( ) + 
  labs(title = "YES! Count and Percent on a Scatterplot", x = "Count that Said YES!", y = "% that said YES!")

sp1<-ggplot(aussie_marry3, aes(x=count_yes, y=percent_yes)) + geom_point() 
sp1