HW 3 Continued with the Australian Marriage Dataset

Author

Affiliation

Allyson Beach

 

Published

Oct. 4, 2021

Citation

Beach, 2021

library(tidyverse)
library(readxl)
library(stringr)
library(rmarkdown)
library(here)
data_path <- paste(here(), "/_data/", sep="")
knitr::opts_chunk$set(echo = TRUE)

Our next data wrangling will be with the Australian Marriage Dataset. First we will just read in the excel sheet to see what this shows. Sheet 3 has the marriage data that we are interested in. This is hardcoded and can be adjusted later to be more flexible by referencing the name of the sheet instead of the position.

## Format this data this way:
## Case = division + town 
## Value = only get the counts of 1. yes 2. no 3. no response 4. not clear
aus_data_path <- paste(data_path, "australian_marriage_law_postal_survey_2017_-_response_final.xls", sep="")
aus_data <- read_excel(aus_data_path, sheet=3)
paged_table(aus_data)
ABCDEFGHIJ0123456789
Australian Bureau of Statistics
<chr>
1800.0 Australian Marriage Law Postal Survey, 2017
Released on 15 November 2017
Table 2 Response by Federal Electoral Division(a)
NA
NA
NA
New South Wales Divisions
Banks
Barton
Bennelong

Now that we are able to read in the raw data, we are going to rename the columns to something easier to read and manipulate. We will name the columns that we want to delete “D” plus some number. We can also see that we can skip 6 of the empty rows in the beginning of the data set to make room for these new column names.

aus_data <- read_excel(aus_data_path, sheet=3, skip=6, 
                       col_names = c("Towns", "Yes", "D1", 
                                     "No", "D2", "D3", "D4", 
                                     "D5", "D6","D11", "Response_Not_Clear", 
                                     "D7", "No_Response", "D8", "D9", "D10"))
paged_table(aus_data)
ABCDEFGHIJ0123456789
Towns
<chr>
Yes
<chr>
D1
<chr>
NAno.%
New South Wales DivisionsNANA
Banks3773644.899999999999999
Barton3715343.600000000000001
Bennelong4294349.799999999999997
Berowra4847154.600000000000001
Blaxland2040626.100000000000001
Bradfield5368160.600000000000001
Calare5409160.200000000000003
Chifley3287141.299999999999997

We then check to make sure that our column names align correctly with the correct data. Then we can skip one more row to replace the row that labelled the “no.” and “%” data.

aus_data <- read_excel(aus_data_path, sheet=3, skip=7, 
                       col_names = c("Towns", "Yes", "D1", 
                                     "No", "D2", "D3", "D4", 
                                     "D5", "D6","D11", "Response_Not_Clear", 
                                     "D7", "No_Response", "D8", "D9", "D10"))
paged_table(aus_data)
ABCDEFGHIJ0123456789
Towns
<chr>
Yes
<dbl>
D1
<dbl>
No
<dbl>
D2
<dbl>
New South Wales DivisionsNANANANA
Banks3773644.94634355.1
Barton3715343.64798456.4
Bennelong4294349.84321550.2
Berowra4847154.64036945.4
Blaxland2040626.15792673.9
Bradfield5368160.63492739.4
Calare5409160.23577939.8
Chifley3287141.34670258.7
Cook4750555.03880445.0

Now that we have properly labeled the data, we can remove the columns with the data that we will not use. As mentioned before, we labeled these columns as “D” to make it easier to delete. We will also remove any of the blank rows, the total rows, and the notes starting with “(a)” at the end of the dataset.

aus_data <- aus_data %>% 
  select(!starts_with("d", ignore.case = TRUE)) %>% 
  filter(!grepl("total", Towns, ignore.case = TRUE)) %>% 
  filter(grepl("^[A-Z]", Towns, ignore.case = TRUE)) %>% 
  filter(!is.na(Towns))
paged_table(aus_data)
ABCDEFGHIJ0123456789
Towns
<chr>
Yes
<dbl>
No
<dbl>
Response_Not_Clear
<dbl>
New South Wales DivisionsNANANA
Banks3773646343247
Barton3715347984226
Bennelong4294343215244
Berowra4847140369212
Blaxland2040657926220
Bradfield5368134927202
Calare5409135779285
Chifley3287146702263
Cook4750538804229

Something I learned with using grepl is that the singular * does not work to escape characters. So I had to use a double \* to escape the parentheses in my search for the notes at the bottom of the data set. I am not sure why this is at the moment, but now the dataset has filtered out the subnotes, the total rows, and the columns with the excessive information. Another lesson was that the is.na() takes the name of the columns, not as strings, but as objects.

Now, we should separate the Towns column that contains the divisions and the cities/towns. One approach would be to take any row that contains an NA in “Yes”, “No”, “Response_Not_Clear”, or “No_Response” column. A simpler approach will be to search for anything that contains the word “Divisions” within the Towns column.

aus_data <- aus_data %>% 
  mutate(Divisions = case_when(str_ends(Towns, "Divisions") ~ Towns, TRUE ~ NA_character_)) %>% 
  fill(Divisions) %>% 
  filter(!str_ends(Towns, "Divisions"))

paged_table(aus_data)
ABCDEFGHIJ0123456789
Towns
<chr>
Yes
<dbl>
No
<dbl>
Response_Not_Clear
<dbl>
No_Response
<dbl>
Banks377364634324720928
Barton371534798422624008
Bennelong429434321524419973
Berowra484714036921216038
Blaxland204065792622025883
Bradfield536813492720217261
Calare540913577928525342
Chifley328714670226328180
Cook475053880422918713
Cowper574933831731525197

Now we need to pivot the types of responses into one variable. A case is really defined by the type of response from Town and Division and the count of those responses.

aus_data <- aus_data %>% 
  pivot_longer(c(`Yes`, `No`, `Response_Not_Clear`, `No_Response`), names_to = "Response", values_to = "Count")

paged_table(aus_data)
ABCDEFGHIJ0123456789
Towns
<chr>
Divisions
<chr>
BanksNew South Wales Divisions
BanksNew South Wales Divisions
BanksNew South Wales Divisions
BanksNew South Wales Divisions
BartonNew South Wales Divisions
BartonNew South Wales Divisions
BartonNew South Wales Divisions
BartonNew South Wales Divisions
BennelongNew South Wales Divisions
BennelongNew South Wales Divisions

Footnotes

    Reuse

    Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

    Citation

    For attribution, please cite this work as

    Beach (2021, Oct. 4). DACSS 601 Fall 2021: HW 3 Continued with the Australian Marriage Dataset. Retrieved from https://mrolfe.github.io/DACSS601Fall21/posts/2021-10-04-hw3-continued-allyson-beach/

    BibTeX citation

    @misc{beach2021hw,
      author = {Beach, Allyson},
      title = {DACSS 601 Fall 2021: HW 3 Continued with the Australian Marriage Dataset},
      url = {https://mrolfe.github.io/DACSS601Fall21/posts/2021-10-04-hw3-continued-allyson-beach/},
      year = {2021}
    }