Challenge 3: Australian Marriages

challenge_3
Daniel Hannon
australian_marriage
Author

Daniel Hannon

Published

March 12, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

Read in the Data

Code
state_and_territory <-readxl::read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet= "Table 1", skip=7, n_max= 9, col_names = c("Division", "Yes Count", "Yes %", "No Count", "No %", "Clear Total Count", "Clear Count %", "White Space", "Clear Count", "Clear %", "Unclear Count", "Unclear %", "Non-response Count", "Non-response %", "Total Count", "Total %"))

federal_electoral_divison <-readxl::read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet= "Table 2", skip=7, n_max= 173, col_names = c("Zone", "Yes Count", "Yes %", "No Count", "No %", "Clear Total Count", "Clear Count %", "White Space", "Clear Count", "Clear %", "Unclear Count", "Unclear %", "Non-response Count", "Non-response %", "Total Count", "Total %"))

The data table is a collection of responses to a mailed out survey question to Australians who were registered to vote in 2017. One table is the responses for particular territories and states, while one is the responses for each voting district within each state and territory. Each row shows the number and percent of Yes and No votes of the clear responses. It then breaks down the number and percent of clear responses, the number and percent of unclear responses, and the number and percent of non-responses, or people who didn’t respond at all.

Clean the Data

This data has very unclear columns, with many data points per row and many rows that provide no information. Before I plan the final shape of the tibble I will first remove unnecessary rows and redundant columns. This means removing the column of white space on both sheets, all of the total percentages (its always 100) and the rows showing total counts per division from the second sheet because they are shown on the first sheet. Also removing the rows that show percentages, because they are easy to recover given the Counts and removing the Clear Total number because that can be found by adding the yes and no votes.

Code
#A clean version of the federal_electoral_division data
clean_fed <- federal_electoral_divison %>% 
        select(!c("White Space", "Clear Count", "Clear Total Count") & !contains("%")) %>%
        filter(!grepl("(Total)", Zone)) %>%
        mutate(Division = case_when(str_ends(Zone, "Divisions") ~ Zone,
                                    TRUE ~ NA_character_ )) %>%
        fill(Division, .direction = "down") %>%
        filter(!is.na(`Total Count`))
  
clean_fed
# A tibble: 150 × 7
   Zone      `Yes Count` `No Count` `Unclear Count` Non-respon…¹ Total…² Divis…³
   <chr>           <dbl>      <dbl>           <dbl>        <dbl>   <dbl> <chr>  
 1 Banks           37736      46343             247        20928  105254 New So…
 2 Barton          37153      47984             226        24008  109371 New So…
 3 Bennelong       42943      43215             244        19973  106375 New So…
 4 Berowra         48471      40369             212        16038  105090 New So…
 5 Blaxland        20406      57926             220        25883  104435 New So…
 6 Bradfield       53681      34927             202        17261  106071 New So…
 7 Calare          54091      35779             285        25342  115497 New So…
 8 Chifley         32871      46702             263        28180  108016 New So…
 9 Cook            47505      38804             229        18713  105251 New So…
10 Cowper          57493      38317             315        25197  121322 New So…
# … with 140 more rows, and abbreviated variable names ¹​`Non-response Count`,
#   ²​`Total Count`, ³​Division
Code
#A clean version of the state_and_territory data
clean_sat <- state_and_territory %>% 
              select(!c("White Space", "Clear Count", "Clear Total Count") & !contains("%"))
clean_sat
# A tibble: 9 × 6
  Division                        `Yes Count` `No Count` Uncle…¹ Non-r…² Total…³
  <chr>                                 <dbl>      <dbl>   <dbl>   <dbl>   <dbl>
1 New South Wales                     2374362    1736838   11036 1065445  5.19e6
2 Victoria                            2145629    1161098   11028  743634  4.06e6
3 Queensland                          1487060     961015    7088  695710  3.15e6
4 South Australia                      592528     356247    2778  242027  1.19e6
5 Western Australia                    801575     455924    3188  346333  1.61e6
6 Tasmania                             191948     109655     805   77020  3.79e5
7 Northern Territory(b)                 48686      31690     229   57496  1.38e5
8 Australian Capital Territory(c)      175459      61520     534   50595  2.88e5
9 Australia                           7817247    4873987   36686 3278260  1.60e7
# … with abbreviated variable names ¹​`Unclear Count`, ²​`Non-response Count`,
#   ³​`Total Count`

Reshape the Data

With the data frames clean we can see our starting tibble dimensions are 9x12 for the states and territories (S.a.T.) table and 150x12 for the federal electoral division (F.E.D.) table. For the S.a.T. table in order to make it tidy We want each row to contain one observation. I plan on pivoting the table longer so that each observation consist of the count for one type of response, with 4 responses total (Yes, No, Unclear, Unresponsive, or total) for each Division. This will leave me with 3 total columns and 45 rows.

For the F.E.D. table we will do the same pivot except we will add an extra column for division, so that each electoral zone can be group by which territory division they fall into. This should leave the F.E.D table with dimensions 750x4.

Code
tidy_sat <- clean_sat %>%
        pivot_longer(-c(`Division`),
        names_to = c('Vote Type', '.value'),
        names_sep = ('\\s')
        )

tidy_sat
# A tibble: 45 × 3
   Division        `Vote Type`    Count
   <chr>           <chr>          <dbl>
 1 New South Wales Yes          2374362
 2 New South Wales No           1736838
 3 New South Wales Unclear        11036
 4 New South Wales Non-response 1065445
 5 New South Wales Total        5187681
 6 Victoria        Yes          2145629
 7 Victoria        No           1161098
 8 Victoria        Unclear        11028
 9 Victoria        Non-response  743634
10 Victoria        Total        4061389
# … with 35 more rows
Code
tidy_fed <- clean_fed %>%
        pivot_longer(-c(`Zone`, `Division`),
        names_to = c('Vote Type', '.value'),
        names_sep = ('\\s')
        )

tidy_fed
# A tibble: 750 × 4
   Zone   Division                  `Vote Type`   Count
   <chr>  <chr>                     <chr>         <dbl>
 1 Banks  New South Wales Divisions Yes           37736
 2 Banks  New South Wales Divisions No            46343
 3 Banks  New South Wales Divisions Unclear         247
 4 Banks  New South Wales Divisions Non-response  20928
 5 Banks  New South Wales Divisions Total        105254
 6 Barton New South Wales Divisions Yes           37153
 7 Barton New South Wales Divisions No            47984
 8 Barton New South Wales Divisions Unclear         226
 9 Barton New South Wales Divisions Non-response  24008
10 Barton New South Wales Divisions Total        109371
# … with 740 more rows