Code
library(tidyverse)
library(readxl)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Justine Shakespeare
March 8, 2023
First we’ll read in the data. We’ll be focusing on the not-yet-tidy dataset on the Australian Marriage Law Survey. After inspecting the data I can see that this dataset includes some additional rows at the top of the sheet and the column names are in multiple rows. For ease of analysis, I’ll skip the first few lines and rename the rows when reading in this data.
aussieMarriage <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet = 2, skip = 7, col_names = c("State_territory", "Yes", "Yes %", "No", "No %", "Total", "Total %", "delete", "Response clear", "Response clear %", "Response not clear", "Response not clear %", "Non-responding", "Non-responding %", "Resp Total", "Resp Total %"))
aussieMarriage
# A tibble: 15 × 16
State…¹ Yes `Yes %` No `No %` Total Total…² delete Respo…³ Respo…⁴
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl>
1 New So… 2374362 57.8 1736838 42.2 4.11e6 100 NA 4.11e6 79.2
2 Victor… 2145629 64.9 1161098 35.1 3.31e6 100 NA 3.31e6 81.4
3 Queens… 1487060 60.7 961015 39.3 2.45e6 100 NA 2.45e6 77.7
4 South … 592528 62.5 356247 37.5 9.49e5 100 NA 9.49e5 79.5
5 Wester… 801575 63.7 455924 36.3 1.26e6 100 NA 1.26e6 78.3
6 Tasman… 191948 63.6 109655 36.4 3.02e5 100 NA 3.02e5 79.5
7 Northe… 48686 60.6 31690 39.4 8.04e4 100 NA 8.04e4 58.2
8 Austra… 175459 74 61520 26 2.37e5 100 NA 2.37e5 82.3
9 Austra… 7817247 61.6 4873987 38.4 1.27e7 100 NA 1.27e7 79.3
10 <NA> NA NA NA NA NA NA NA NA NA
11 (a) In… NA NA NA NA NA NA NA NA NA
12 (b) In… NA NA NA NA NA NA NA NA NA
13 (c) In… NA NA NA NA NA NA NA NA NA
14 <NA> NA NA NA NA NA NA NA NA NA
15 © Comm… NA NA NA NA NA NA NA NA NA
# … with 6 more variables: `Response not clear` <dbl>,
# `Response not clear %` <dbl>, `Non-responding` <dbl>,
# `Non-responding %` <dbl>, `Resp Total` <dbl>, `Resp Total %` <dbl>, and
# abbreviated variable names ¹State_territory, ²`Total %`, ³`Response clear`,
# ⁴`Response clear %`
This data shows the results of the Australian Marriage Law Postal Survey from 2017, which was administered to eligible participants throughout Australia. Surprisingly, I could not find the question on the survey in the actual dataset. A quick google search revealed that the question on the survey was, “Should the law be changed to allow same-sex couples to marry?”
This data shows the responses of eligible participants organized by region (state or territory).
Before we attempt any pivoting, we should clean up this dataset a bit more. Assuming we’re interested in the results of the survey and not so much in how many responses were clear, let’s select just the relevant columns. We can also leave off the percentage columns, since we can calculate those ourselves with the raw counts. We’ll also drop the last few lines since they are endnotes.
# A tibble: 9 × 4
State_territory Yes No Total
<chr> <dbl> <dbl> <dbl>
1 New South Wales 2374362 1736838 4111200
2 Victoria 2145629 1161098 3306727
3 Queensland 1487060 961015 2448075
4 South Australia 592528 356247 948775
5 Western Australia 801575 455924 1257499
6 Tasmania 191948 109655 301603
7 Northern Territory(b) 48686 31690 80376
8 Australian Capital Territory(c) 175459 61520 236979
9 Australia 7817247 4873987 12691234
Now that we have a clean dataset we can take a closer look. Using the dim()
and head()
functions we can see that we have 9 observations and 4 variables. The 4 variables include the state or territory variable which indicates the region within Australia a vote is from, including Australia at the end. Then there is a Yes column with the number of “yes” votes, a No column with the number of “no” votes, and finally a Total column with the total number of votes.
[1] 9 4
# A tibble: 9 × 4
State_territory Yes No Total
<chr> <dbl> <dbl> <dbl>
1 New South Wales 2374362 1736838 4111200
2 Victoria 2145629 1161098 3306727
3 Queensland 1487060 961015 2448075
4 South Australia 592528 356247 948775
5 Western Australia 801575 455924 1257499
6 Tasmania 191948 109655 301603
7 Northern Territory(b) 48686 31690 80376
8 Australian Capital Territory(c) 175459 61520 236979
9 Australia 7817247 4873987 12691234
Let’s use the pivot_longer()
function to move all of the yes and no votes to two columns called “Response” and “Count”. We can also calculate the percentage using these counts and the total column.
aussieMarriagePL <- aussieMarriage2 %>%
pivot_longer(col = c("Yes", "No"),
names_to = "Response",
values_to = "Count") %>%
mutate("Percent" = Count/Total)
aussieMarriageFinal <- select(aussieMarriagePL, State_territory, Response, Count, Percent, Total) # reordering the columns so that Total shows up on the far right.
head(aussieMarriageFinal, 20)
# A tibble: 18 × 5
State_territory Response Count Percent Total
<chr> <chr> <dbl> <dbl> <dbl>
1 New South Wales Yes 2374362 0.578 4111200
2 New South Wales No 1736838 0.422 4111200
3 Victoria Yes 2145629 0.649 3306727
4 Victoria No 1161098 0.351 3306727
5 Queensland Yes 1487060 0.607 2448075
6 Queensland No 961015 0.393 2448075
7 South Australia Yes 592528 0.625 948775
8 South Australia No 356247 0.375 948775
9 Western Australia Yes 801575 0.637 1257499
10 Western Australia No 455924 0.363 1257499
11 Tasmania Yes 191948 0.636 301603
12 Tasmania No 109655 0.364 301603
13 Northern Territory(b) Yes 48686 0.606 80376
14 Northern Territory(b) No 31690 0.394 80376
15 Australian Capital Territory(c) Yes 175459 0.740 236979
16 Australian Capital Territory(c) No 61520 0.260 236979
17 Australia Yes 7817247 0.616 12691234
18 Australia No 4873987 0.384 12691234
This new dataset shows the same information but organized differently. Now each observation shows the number of Yes or No votes in a state/territory, along with the Percent those votes represent out of the total in that region.
---
title: "Challenge 3 - Australian Marriage Survey"
author: "Justine Shakespeare"
description: "Tidy Data: Pivoting"
date: "03/08/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- Justine Shakespeare
- australian_marriage
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
### Read in data
First we'll read in the data. We'll be focusing on the not-yet-tidy dataset on the Australian Marriage Law Survey. After inspecting the data I can see that this dataset includes some additional rows at the top of the sheet and the column names are in multiple rows. For ease of analysis, I'll skip the first few lines and rename the rows when reading in this data.
```{r}
aussieMarriage <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet = 2, skip = 7, col_names = c("State_territory", "Yes", "Yes %", "No", "No %", "Total", "Total %", "delete", "Response clear", "Response clear %", "Response not clear", "Response not clear %", "Non-responding", "Non-responding %", "Resp Total", "Resp Total %"))
aussieMarriage
```
### Data Description
This data shows the results of the Australian Marriage Law Postal Survey from 2017, which was administered to eligible participants throughout Australia. Surprisingly, I could not find the question on the survey in the actual dataset. A [quick google search](https://www.abs.gov.au/ausstats/abs@.nsf/mf/1800.0) revealed that the question on the survey was, "Should the law be changed to allow same-sex couples to marry?"
This data shows the responses of eligible participants organized by region (state or territory).
### Data Cleaning
Before we attempt any pivoting, we should clean up this dataset a bit more. Assuming we're interested in the *results* of the survey and not so much in how many responses were clear, let's select just the relevant columns. We can also leave off the percentage columns, since we can calculate those ourselves with the raw counts. We'll also drop the last few lines since they are endnotes.
```{r}
aussieMarriage1 <- select(aussieMarriage, "State_territory", "Yes", "No", "Total")
aussieMarriage2 <- slice(aussieMarriage1, 1:(nrow(aussieMarriage) - 6))
aussieMarriage2
```
### Data Review and Pivot
Now that we have a clean dataset we can take a closer look. Using the `dim()` and `head()` functions we can see that we have 9 observations and 4 variables. The 4 variables include the *state or territory* variable which indicates the region within Australia a vote is from, including Australia at the end. Then there is a *Yes* column with the number of "yes" votes, a *No* column with the number of "no" votes, and finally a *Total* column with the total number of votes.
```{r}
dim(aussieMarriage2)
head(aussieMarriage2, 10)
```
Let's use the `pivot_longer()` function to move all of the *yes* and *no* votes to two columns called "Response" and "Count". We can also calculate the percentage using these counts and the total column.
```{r}
aussieMarriagePL <- aussieMarriage2 %>%
pivot_longer(col = c("Yes", "No"),
names_to = "Response",
values_to = "Count") %>%
mutate("Percent" = Count/Total)
aussieMarriageFinal <- select(aussieMarriagePL, State_territory, Response, Count, Percent, Total) # reordering the columns so that Total shows up on the far right.
head(aussieMarriageFinal, 20)
```
This new dataset shows the same information but organized differently. Now each observation shows the number of *Yes* or *No* votes in a state/territory, along with the *Percent* those votes represent out of the total in that region.