Tidy Australian Marriage data

challenge_3
australian_marriage
Sahan Prasad Podduturi Reddy
Author

Sahan Prasad Podduturi Reddy

Published

March 18, 2023

Introduction

I was trying to analyze the ‘australian_marriage*.xls’ dataset by splitting the dataset into two separate dataframes and then tidying the data to interpret the results. This dataset summarizes the responses given by eligible participants of the Australian Marriage Law Postal Survey, 2017. We first start by importing the necessary libraries and setting the working directory to point to the location where the spreadsheet is located.

Code
library(tidyverse)
library(readxl)
library(DescTools)
library(kableExtra)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
setwd("D:/MyDocs/Class Slides/DACSS601/601_Spring_2023/posts/_data")

Read in the Data

Looking at the dataset, we can see that the data can be split up into two separate dataframes which makes it much easier to understand the context of the data we are trying to analyze. We use read_excel() to read in the excel sheet and then use “sheet=2” and “skip=6” as parameters. This removes the first six lines of the sheet as we import data into our dataframe. We can then assign a different column name to each column using colnames().We then use slice() to get only the data regarding the states/territories and ignore the remaining text in the spreadsheet.

Code
setwd("D:/MyDocs/Class Slides/DACSS601/601_Spring_2023/posts/_data")
dataframe <- read_excel("australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet = 2, skip = 6)
dataframe <- slice(dataframe, 1:9)
colnames(dataframe) <- c("State/Territory", "Response Clear Yes", "Response Clear Yes %", "Response Clear No", "Response Clear No %", "Response Clear Total", "Response Clear Total %", "Blank", "Response Clear Total Eligible", "Response Clear Total Eligible %", "Response Not Clear Total Eligible", "Response Not Clear Total Eligible %", "No Response", "No Response %", "Total", "Total %")
print(dataframe)
# A tibble: 9 × 16
  State/…¹ Respo…² Respo…³ Respo…⁴ Respo…⁵ Respo…⁶ Respo…⁷ Blank Respo…⁸ Respo…⁹
  <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <lgl>   <dbl>   <dbl>
1 New Sou… 2374362    57.8 1736838    42.2  4.11e6     100 NA     4.11e6    79.2
2 Victoria 2145629    64.9 1161098    35.1  3.31e6     100 NA     3.31e6    81.4
3 Queensl… 1487060    60.7  961015    39.3  2.45e6     100 NA     2.45e6    77.7
4 South A…  592528    62.5  356247    37.5  9.49e5     100 NA     9.49e5    79.5
5 Western…  801575    63.7  455924    36.3  1.26e6     100 NA     1.26e6    78.3
6 Tasmania  191948    63.6  109655    36.4  3.02e5     100 NA     3.02e5    79.5
7 Norther…   48686    60.6   31690    39.4  8.04e4     100 NA     8.04e4    58.2
8 Austral…  175459    74     61520    26    2.37e5     100 NA     2.37e5    82.3
9 Austral… 7817247    61.6 4873987    38.4  1.27e7     100 NA     1.27e7    79.3
# … with 6 more variables: `Response Not Clear Total Eligible` <dbl>,
#   `Response Not Clear Total Eligible %` <dbl>, `No Response` <dbl>,
#   `No Response %` <dbl>, Total <dbl>, `Total %` <dbl>, and abbreviated
#   variable names ¹​`State/Territory`, ²​`Response Clear Yes`,
#   ³​`Response Clear Yes %`, ⁴​`Response Clear No`, ⁵​`Response Clear No %`,
#   ⁶​`Response Clear Total`, ⁷​`Response Clear Total %`,
#   ⁸​`Response Clear Total Eligible`, ⁹​`Response Clear Total Eligible %`

Brief Description

We first create a subset containing information on total number of 1)clear responses, 2)unclear responses, and 3)no responses by eligible participants in the survey over different states/territories. Then from this subset, we select only the participants from the “Clear Response” column and analyze their responses to the survey in another dataframe.

In our third dataframe, we take the number of clear responses and sort them by ‘Yes’ and ‘No’ Response per each state and calculate how much they account for over the total Response Count. This is the data that we are trying to tidy. We are trying to clean this data because the format can be better. Currently we have separate columns for ‘Yes’ and ‘No’ response counts. We can easily group these into a name column named “Response” holding the response value using pivot_longer() and place the response counts in a separate Response Count column. This makes it much simpler to interpret the data at a glance.

We can use nrow() and ncol() to find that the current dataset we’re working with has 9 rows and 4 cols.

Code
dataframe2 <- select(dataframe, "State/Territory", "Response Clear Total Eligible", "Response Clear Total Eligible %", "Response Not Clear Total Eligible", "Response Not Clear Total Eligible %", "No Response", "No Response %", "Total", "Total %")
colnames(dataframe2) <- c("State/Territory", "Clear Response", "Clear Response %", "Unclear Response", "Unclear Response %", "No Response", "No Response %", "Total Eligible Participants", "Total %")

kable(select(dataframe2, "State/Territory", "Clear Response", "Unclear Response", "No Response", "Total Eligible Participants"), "html") %>% 
  column_spec(2, background = "yellow")
State/Territory Clear Response Unclear Response No Response Total Eligible Participants
New South Wales 4111200 11036 1065445 5187681
Victoria 3306727 11028 743634 4061389
Queensland 2448075 7088 695710 3150873
South Australia 948775 2778 242027 1193580
Western Australia 1257499 3188 346333 1607020
Tasmania 301603 805 77020 379428
Northern Territory(b) 80376 229 57496 138101
Australian Capital Territory(c) 236979 534 50595 288108
Australia 12691234 36686 3278260 16006180
Code
dataframe3 <- select(dataframe, "State/Territory", "Response Clear Yes", "Response Clear Yes %", "Response Clear No", "Response Clear No %")
colnames(dataframe3) <- c("State/Territory", "Yes", "Yes %", "No", "No %")
print(dataframe3)
# A tibble: 9 × 5
  `State/Territory`                   Yes `Yes %`      No `No %`
  <chr>                             <dbl>   <dbl>   <dbl>  <dbl>
1 New South Wales                 2374362    57.8 1736838   42.2
2 Victoria                        2145629    64.9 1161098   35.1
3 Queensland                      1487060    60.7  961015   39.3
4 South Australia                  592528    62.5  356247   37.5
5 Western Australia                801575    63.7  455924   36.3
6 Tasmania                         191948    63.6  109655   36.4
7 Northern Territory(b)             48686    60.6   31690   39.4
8 Australian Capital Territory(c)  175459    74     61520   26  
9 Australia                       7817247    61.6 4873987   38.4
Code
nrow(dataframe3)
[1] 9
Code
ncol(dataframe3)
[1] 5

Anticipate the data

We are first going to pivot the data based on columns “Yes” and “No” using the pivot_longer() command. Using nrow() and ncol(), we find that we have 9 rows and 5 cols in our dataframe. Using pivot_longer, we are pivoting 2 variable names into the names_to column - “Response”. So we will end up with 9*(2) = 18 rows in the dataframe. The cols will remain the same(5 columns) because we are removing 2 cols and adding 2 cols with the “names_to” and “values_to” attributes.

Next we will want to substitute “Yes %” and “No %” with a single column - “Percentage Response Count” for the dataframe such that column value for a particular row is equal to value in “Yes %” if “Response == Yes”, otherwise value will be equal to value in “No %”. Thus overall, we will lose one column and end up with 4 columns.

Initial Dimensions = 9 x 5 Final Prediction = 18 x 4

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check. We first use the pivot_longer() function to pivot the “Yes” and “No” columns and then apply the mutate() function to get the “Percentage Response Count” column. In the end, we print out the number of rows and columns to verify our prediction.

Code
dataframe4 <- dataframe3 %>% pivot_longer(cols=c("Yes","No"), names_to = "Response", values_to = "Response Count") %>%
  mutate(`Percentage Response Count` = case_when(
    Response == "Yes" ~ `Yes %`,
    Response == "No" ~ `No %`
  )) %>%
  select(-"Yes %", -"No %")

print(dataframe4)
# A tibble: 18 × 4
   `State/Territory`               Response `Response Count` Percentage Respon…¹
   <chr>                           <chr>               <dbl>               <dbl>
 1 New South Wales                 Yes               2374362                57.8
 2 New South Wales                 No                1736838                42.2
 3 Victoria                        Yes               2145629                64.9
 4 Victoria                        No                1161098                35.1
 5 Queensland                      Yes               1487060                60.7
 6 Queensland                      No                 961015                39.3
 7 South Australia                 Yes                592528                62.5
 8 South Australia                 No                 356247                37.5
 9 Western Australia               Yes                801575                63.7
10 Western Australia               No                 455924                36.3
11 Tasmania                        Yes                191948                63.6
12 Tasmania                        No                 109655                36.4
13 Northern Territory(b)           Yes                 48686                60.6
14 Northern Territory(b)           No                  31690                39.4
15 Australian Capital Territory(c) Yes                175459                74  
16 Australian Capital Territory(c) No                  61520                26  
17 Australia                       Yes               7817247                61.6
18 Australia                       No                4873987                38.4
# … with abbreviated variable name ¹​`Percentage Response Count`
Code
nrow(dataframe4)
[1] 18
Code
ncol(dataframe4)
[1] 4