library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 5 Marriage in Australia
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- create at least two univariate visualizations
- try to make them “publication” ready
- Explain why you choose the specific graph type
- Create at least one bivariate visualization
- try to make them “publication” ready
- Explain why you choose the specific graph type
R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- cereal.csv ⭐
- Total_cost_for_top_15_pathogens_2018.xlsx ⭐
- Australian Marriage ⭐⭐
- AB_NYC_2019.csv ⭐⭐⭐
- StateCounty2012.xls ⭐⭐⭐
- Public School Characteristics ⭐⭐⭐⭐
- USA Households ⭐⭐⭐⭐⭐
library(readr)
<- read.csv(file = "_data/australian_marriage_tidy.csv",
aussie_marry header=TRUE,
sep = ",")
View(aussie_marry)
Error in check_for_XQuartz(file.path(R.home("modules"), "R_de.so")): X11 library is missing: install XQuartz from www.xquartz.org
head(aussie_marry)
territory resp count percent
1 New South Wales yes 2374362 57.8
2 New South Wales no 1736838 42.2
3 Victoria yes 2145629 64.9
4 Victoria no 1161098 35.1
5 Queensland yes 1487060 60.7
6 Queensland no 961015 39.3
ncol(aussie_marry)
[1] 4
nrow(aussie_marry)
[1] 16
Briefly describe the data
The data represents the proportions and counts people married in 8 Australian territories.
Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
The data is tidy where each cell represents a value and there are no missing values. There are 4 variables, 2 categorical and 2 numerical, and 16 rows. I used the pivot_wider function here to see if the table easier to read and use later for visualizations.
<- aussie_marry %>% pivot_wider(names_from = resp, values_from = c(count, percent))
aussie_marry2 aussie_marry2
# A tibble: 8 × 5
territory count_yes count_no percent_yes percent_no
<chr> <int> <int> <dbl> <dbl>
1 New South Wales 2374362 1736838 57.8 42.2
2 Victoria 2145629 1161098 64.9 35.1
3 Queensland 1487060 961015 60.7 39.3
4 South Australia 592528 356247 62.5 37.5
5 Western Australia 801575 455924 63.7 36.3
6 Tasmania 191948 109655 63.6 36.4
7 Northern Territory(b) 48686 31690 60.6 39.4
8 Australian Capital Territory(c) 175459 61520 74 26
Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
I changed the count to be on the scale of 10,000’s so the graph does not have big numbers. I did not use the .after feature because I did not want the table to get too wide.
<- aussie_marry2 %>% mutate(count_yes = count_yes/10000,
aussie_marry3 count_no = count_no/10,000)
aussie_marry3
# A tibble: 8 × 6
territory count_yes count_no percent_yes percent…¹ `0`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 New South Wales 237. 173684. 57.8 42.2 0
2 Victoria 215. 116110. 64.9 35.1 0
3 Queensland 149. 96102. 60.7 39.3 0
4 South Australia 59.3 35625. 62.5 37.5 0
5 Western Australia 80.2 45592. 63.7 36.3 0
6 Tasmania 19.2 10966. 63.6 36.4 0
7 Northern Territory(b) 4.87 3169 60.6 39.4 0
8 Australian Capital Territory(c) 17.5 6152 74 26 0
# … with abbreviated variable name ¹percent_no
Univariate Visualizations
I started with the basics of geom_histogram representing the count of those responded yes and then added more details of the historgram adding fill and labels. The second graph is more colorful and easier to read.
ggplot(aussie_marry3, aes(count_yes)) + geom_histogram (bins = 15)
ggplot(aussie_marry3, aes(count_yes, fill = territory)) +
geom_histogram(bins = 15) +
labs(title = "How Many Said YES?!", x = "Said YES! x 10,000", y = "Frequency")
Bar Graphs
Here I practiced making bar graphs and exploring different ways to show clearly the x axis labels. I had to use geom_col in instead of geom_bar because the data was already summarized with the territory and percent that said yes.
barplot(aussie_marry3$percent_yes)
%>% ggplot(aes(x = territory, y = percent_yes)) +
aussie_marry3 geom_col(aes(fill = territory)) +
labs(title = "Said YES!", x = "Aussie Territory", y = "Percent %") +
scale_x_discrete(guide = guide_axis(n.dodge = 3)) +
NULL
%>% ggplot(aes(x = territory, y = percent_yes)) +
aussie_marry3 geom_col(aes(fill = territory)) +
labs(title = "Said YES!", x = "Aussie Territory", y = "Percent %") +
coord_flip()
Bivariate Visualization(s)
Any additional comments?
I used the only two pieces of numerical data to make a scatterplot. One variable was the count and the other was the percent of the same characteristic. I do not expect the graph to be interesting or informative. And while linear regression would not apply here, I practiced using the function for future reference.
ggplot(aussie_marry3, aes(`count_yes`, `percent_yes`)) +
geom_point( ) +
geom_smooth( ) +
labs(title = "YES! Count and Percent on a Scatterplot", x = "Count that Said YES!", y = "% that said YES!")
<-ggplot(aussie_marry3, aes(x=count_yes, y=percent_yes)) + geom_point()
sp1 sp1