Data Analytics and Computational Social Science: Australian Survey About Marriage Law Change

Tory Bartelloni

Introduction

The following page will describe my personal journey of very briefly analyzing a survey conducted by the Australian Bureau of Statistics in the fall of 2017. This survey was done through the postal service, was completely voluntary, and was sent to all registered voters in Australia to gauge the public’s opinion of changing the law to allow same sex couples to marry. The intent of this page is to clearly outline my thought process and coding steps taken to practice reading, wrangling, and operating on a not-so-tidy data set.

Reading and Understanding the Avaialble Data

First, we will read in the data from the results of the survey.

am_survey <- read_xls("australian_marriage_law_postal_survey_2017_-_response_final.xls") 
am_survey

# A tibble: 23 x 3
   `Australian Bureau of Statistics`                  ...2       ...3 
   <chr>                                              <chr>      <chr>
 1 1800.0 Australian Marriage Law Postal Survey, 2017 <NA>       <NA> 
 2 Released on 15 November 2017                       <NA>       <NA> 
 3 <NA>                                               <NA>       <NA> 
 4 <NA>                                               Contents   <NA> 
 5 <NA>                                               Tables     <NA> 
 6 <NA>                                               Table 1    Resp~
 7 <NA>                                               Table 2    Resp~
 8 <NA>                                               <NA>       <NA> 
 9 <NA>                                               Explanato~ <NA> 
10 <NA>                                               <NA>       <NA> 
# ... with 13 more rows

After reading and viewing the imported data we notice that the table that was read looks to be a title page for additional pages. There are three referenced links on this page including references to “Table 1” and “Table 2”, which are likely of interest. Now we will read in each of these sheets and review them as well.

am_survey_tbl1 <- read_xls("australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet = "Table 1")
am_survey_tbl2 <- read_xls("australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet = "Table 2")

am_survey_tbl1

# A tibble: 21 x 16
   `Australian Burea~` ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 
   <chr>               <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
 1 1800.0 Australian ~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 2 Released on 15 Nov~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 3 Table 1 Response b~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 4 <NA>                Resp~ <NA>  <NA>  <NA>  <NA>  <NA>  NA    Elig~
 5 <NA>                Yes   <NA>  No    <NA>  Total <NA>  NA    Resp~
 6 <NA>                no.   %     no.   %     no.   %     NA    no.  
 7 New South Wales     2374~ 57.7~ 1736~ 42.2~ 4111~ 100   NA    4111~
 8 Victoria            2145~ 64.9~ 1161~ 35.1~ 3306~ 100   NA    3306~
 9 Queensland          1487~ 60.7~ 9610~ 39.2~ 2448~ 100   NA    2448~
10 South Australia     5925~ 62.5  3562~ 37.5  9487~ 100   NA    9487~
# ... with 11 more rows, and 7 more variables: ...10 <chr>,
#   ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
#   ...16 <chr>

am_survey_tbl2

# A tibble: 190 x 16
   `Australian Burea~` ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 
   <chr>               <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
 1 1800.0 Australian ~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 2 Released on 15 Nov~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 3 Table 2 Response b~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 4 <NA>                Resp~ <NA>  <NA>  <NA>  <NA>  <NA>  NA    Elig~
 5 <NA>                Yes   <NA>  No    <NA>  Total <NA>  NA    Resp~
 6 <NA>                no.   %     no.   %     no.   %     NA    no.  
 7 New South Wales Di~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 8 Banks               37736 44.8~ 46343 55.1~ 84079 100   NA    84079
 9 Barton              37153 43.6~ 47984 56.3~ 85137 100   NA    85137
10 Bennelong           42943 49.7~ 43215 50.2~ 86158 100   NA    86158
# ... with 180 more rows, and 7 more variables: ...10 <chr>,
#   ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
#   ...16 <chr>

Reviewing the table sheets and their associated data we now see that Table 1 is an aggregate of data from Table 2 by State/Territory. From here on we can focus on Table 2 as it has all of the underlying data that we will be interested in.

Wrangling and Cleaning the Data

Focusing on Table 2, we see that several of the first rows are used for description, several more rows are used to describe groups of variables, several columns are duplicates or aggregates, and it includes title and sub-total lines for each State/Territory.

To deal with this we will 1) re-read the data and skip the descriptive rows, 2) select only the columns that we will need, 3) add a State/Territory variable to identify the Divisions, and 4) remove the title and sub-total lines. We will also take this opportunity to assign appropriate names to our variables.

# Read in and skip first 7  rows with miscellaneous information
am_survey_final <- read_xls("australian_marriage_law_postal_survey_2017_-_response_final.xls", 
                           sheet = "Table 2", skip=7)
# Subset to only the columns we're interested in. Will be keeping: Yes answers, No answers,
# Eligible participants with clear responses, without clear responses, and non-responders.
am_survey_final <- am_survey_final[,c(1,2,4,6,11,13)]

# Add a State_Territory variable
am_survey_final$State_Territory <- NA

# Assign appropriate values to the State_Territory variable
am_survey_final[1:47,]$State_Territory <- "New South Wales"
am_survey_final[51:87,]$State_Territory <- "Victoria"
am_survey_final[91:120,]$State_Territory <- "Queensland"
am_survey_final[124:134,]$State_Territory <- "South Australia"
am_survey_final[138:153,]$State_Territory <- "Western Australia"
am_survey_final[157:161,]$State_Territory <- "Tasmania"
am_survey_final[165:166,]$State_Territory <- "Northern Territory"
am_survey_final[170:171,]$State_Territory <- "Australian Capital"

# Remove all total and title lines
am_survey_final <- am_survey_final %>% filter(!is.na(State_Territory))

# Add clear column names
colnames(am_survey_final) <- c("Division","Yes","No","Clear Responses","Not Clear Responses","Non-Responders","State_Territory")

Alright, let’s check it out.

str(am_survey_final)

tibble [150 x 7] (S3: tbl_df/tbl/data.frame)
 $ Division           : chr [1:150] "Banks" "Barton" "Bennelong" "Berowra" ...
 $ Yes                : num [1:150] 37736 37153 42943 48471 20406 ...
 $ No                 : num [1:150] 46343 47984 43215 40369 57926 ...
 $ Clear Responses    : num [1:150] 84079 85137 86158 88840 78332 ...
 $ Not Clear Responses: num [1:150] 247 226 244 212 220 202 285 263 229 315 ...
 $ Non-Responders     : num [1:150] 20928 24008 19973 16038 25883 ...
 $ State_Territory    : chr [1:150] "New South Wales" "New South Wales" "New South Wales" "New South Wales" ...

rmarkdown::paged_table(am_survey_final)

Yay! We have a decent data set to work with!

Finalized Variable Definitions

Below is a list of variable definitions to better understand what variables we have decided to take.

Division: This is an identifier of what State/Territory Division the other variables are referencing.
Yes: This is a count of how many responses from the Division were clearly “Yes”.
No: This is a count of how many responses from the Division were clearly “No”.
Clear Responses: This is a total count of how many responses were clear and used.
Not Clear Responses: This is a total count of how many responses were received, but not clear and therefore not used.
Non-Responders: This is s count of how many eligible persons did not respond to the survey.
State_Territory: This is an identifier of the Federal State/Territory that the other variables are referencing.

Explore the Data

Now we will perform some operations on the data to explore it.

am_survey_NSW <- am_survey_final %>% filter(State_Territory == "New South Wales") %>%
  arrange(-Yes)

rmarkdown::paged_table(am_survey_NSW)

And, for fun, let’s plot some of the results. I am interested to see how the population responded in terms of proportion of YES and NO responses and how that distribution may be impacted by the State/Territory of the populations.

So what we will do is calculate the proportions of responses, plot the distribution, and highlight the results by State/Territory to see what, if any, patterns emerge.

am_survey_ST_grouped <- am_survey_final %>% group_by(State_Territory,Division) %>%
  summarise(Total_Responses = sum(`Clear Responses`,`Not Clear Responses`),
            Married_Perc = sum(Yes)/sum((Yes+No))) %>%
  arrange(-Total_Responses)
rmarkdown::paged_table(am_survey_ST_grouped)

am_survey_ST_grouped %>% ggplot(aes(x=Married_Perc)) +
  geom_histogram(color="black",aes(fill=State_Territory),binwidth = 0.025) +
  labs(title = "Survey of Australians About Marriage Law Change",
       subtitle = "Should the law be changed to allow same sex couples to marry?",
       x="Percent of Population Who Responded YES",
       y="Count of Divisions")

Interesting…No conclusions today and a lot of unanswered questions, but a good start to understanding the opinions of eligible Australian voters on the topic of the legality of same-sex marriage.

Some notes before we leave.

I would like to better understand and apply formatting within Rmarkdown/Distill documents. This page is not as clear as I would want it to be without additional adjustments.
There are a couple of cleaning/wrangling operations that I would like to learn to perform in a more efficient manner. For instance, assigning the State_Territory variable was entirely manual and not scalable.

Comment on this article Share:

Australian Survey About Marriage Law Change