Working with Australian Marriages Laws.
The first thing I need to do is read the data into R. For the sake of spelling I will call this “amarriages” and I’ll use the HERE function. This time I’ll also focus on leaving all of the library functions I need in setup. Since I will use the functions in a more systematic way instead of a explaining my process way.
#I'll name this amarriages due to this being australian marriages
amarriages <- read_excel(here("_data", "australian_marriage_law_postal_survey_2017_-_response_final.xls"))
kable(amarriages [1:4, ], caption = "Here is untidy data of Australian Marriages")
Australian Bureau of Statistics | …2 | …3 |
---|---|---|
1800.0 Australian Marriage Law Postal Survey, 2017 | NA | NA |
Released on 15 November 2017 | NA | NA |
NA | NA | NA |
NA | Contents | NA |
I noticed that what’s happening is I’m reading in the first tab! This is causing some errors about gathering the data I would want. After looking around I found that by default read_excel reads the first sheet of the worksheet. Since we want to focus on tab 3, also ironically named table 2, I’ll redo the read function. This time I’ll use a page table.
#I'll name this marriages3 due to the sheet number
marriages3 <- read_excel(here("_data", "australian_marriage_law_postal_survey_2017_-_response_final.xls"), 3)
paged_table(marriages3)
The paged table worked to show off what I wanted. Yay! However the columns are…a mess, and that’s to put it lightly. Looking at the original file it’s because all the columns are starting differently. Lets try to get rid of the first 6 columns.
#I will name this marriages6 due to the 6 skips
marriages6 <- read_excel(here(
"_data", "australian_marriage_law_postal_survey_2017_-_response_final.xls"),
3, skip=6)
paged_table(marriages6)
It seems this is working out correctly! However we just lost a lot of information that was important. We don’t know what the columns mean anymore. Luckily if we go to “marriages3” we can see what everything was! Lets go ahead and pipe in column names. Also a FYI putting sheet makes it easier to understand what sheet you’re using!
#marriage7 for the sake of keeping different databases
marriages7 <- read_excel(here("_data", "australian_marriage_law_postal_survey_2017_-_response_final.xls"),
sheet= 3, skip= 7,
col_names= c("Cities", "Response Clear Yes Number", "Response Clear Yes Percantage",
"Response Clear No Number", "Response Clear No Percantage",
"Total Number Response Clear", "Total Response Percantage", "d1",
"Eligible Participants Yes Number","Eligible Participants Yes Percantage",
"Eligible Participants Not Clear Number", "Eligible Participants Not Clear Percantage",
"Non-Responding Number", "Non-Responding Percantage",
"Total Number Eligable Participants", "Total Eligable Percent" ))
paged_table(marriages7)
So after forgetting you needed to use col_names not colnames…I finally made a table that has the correct column names! This means I can go ahead and remove the one more row with the % and no. in it. However looking at the table there’s a glaring problem, we have divisions, but not a way to deal with the different divisions. This will be a later problem. In this section lets do some cleaning up. Lets select out d1 using pipping and filter out any n/a columns or (Total) columns in “Divisions”.
marriages7 <- marriages7 %>%
select(!starts_with("d1")) %>%
filter(!str_detect(Cities, "NA")) %>%
filter(!str_detect(Cities, "(Total)"))
I noticed there’s also sub notes in this column and the divisions so there’s no time like the present than to try to tackle this. Lets try to add columns using mutate. Realizing my mistake of calling the cities divisions I went back to rename the column “Cities”.
marriages7 <- marriages7 %>%
mutate(Footnotes = case_when(
endsWith(Cities, "c)") ~ "(c) Includes Christmas Island and the Cocos (Keeling) Islands",
endsWith(Cities, "d)") ~ "(d) Includes Norfolk Island",
endsWith(Cities, "e)") ~ "(e) Includes Jervis Bay",
))
paged_table(marriages7)
What I would like to do is add a column for the divisions in a way you’d do it in excel. You would add a column and then between the two “divisions” number you would add the division to the side so it could still be there. However this doesn’t seem to be useful information for now and it doesn’t seem easy after looking around so for now I’ll remove the divisions columns. I also attempted to use filter for the footnotes but some of them weren’t deleting so I ended up just removing the last five columns using slice.
marriages7 <- marriages7 %>%
filter(!str_detect(Cities, "Divisions")) %>%
filter(!str_detect(Cities, "Australia")) %>%
filter(!str_detect(Cities, "(a) The Federal Electoral Division boundaries are current as at 24 August 2017
")) %>%
slice(-c(151:155))
paged_table(marriages7)
I want to replace the NAs in the Footnote column with nothing due to the fact you wouldn’t see a na most of the time when it’s only referencing footnotes.
marriages <- marriages7 %>%
replace_na(list(Footnotes = " "))
paged_table(marriages)
Yay now it looks somewhat clean! Lets see if I can use these two functions. I have renamed this to “marriages” since this is the final copy for this homework.
marriages %>%
arrange(marriages, "Cities")
# A tibble: 150 x 16
Cities `Response Clear Ye~ `Response Clear Ye~ `Response Clear ~
<chr> <dbl> <dbl> <dbl>
1 Adelaide 62769 70.1 26771
2 Aston 48455 62 29730
3 Ballarat 65613 70.5 27405
4 Banks 37736 44.9 46343
5 Barker 42498 52.3 38827
6 Barton 37153 43.6 47984
7 Bass 36249 61.7 22510
8 Batman 66383 71.2 26906
9 Bendigo 63412 68.7 28852
10 Bennelong 42943 49.8 43215
# ... with 140 more rows, and 12 more variables:
# Response Clear No Percantage <dbl>,
# Total Number Response Clear <dbl>,
# Total Response Percantage <dbl>,
# Eligible Participants Yes Number <dbl>,
# Eligible Participants Yes Percantage <dbl>,
# Eligible Participants Not Clear Number <dbl>, ...
marriages %>%
select(contains("number")) %>%
summarize_all(median, na.rm = TRUE)
# A tibble: 1 x 7
`Response Clear ~ `Response Clear~ `Total Number R~ `Eligible Parti~
<dbl> <dbl> <dbl> <dbl>
1 51782. 31654. 85726. 85726.
# ... with 3 more variables:
# Eligible Participants Not Clear Number <dbl>,
# Non-Responding Number <dbl>,
# Total Number Eligable Participants <dbl>
marriages %>%
select(ends_with("e")) %>%
summarise_all(median, na.rm = TRUE)
# A tibble: 1 x 6
`Response Clear ~ `Response Clear~ `Total Response~ `Eligible Parti~
<dbl> <dbl> <dbl> <dbl>
1 61.8 38.2 100 79.7
# ... with 2 more variables:
# Eligible Participants Not Clear Percantage <dbl>,
# Non-Responding Percantage <dbl>
Summarise () was by hardest for this homework to use. It didn’t seem to work easily and it was frustrating to get to use at all. After reading the tutorial it got a lot easier but I still think I need to figure out where this data would be useful for this specific dataset. I think I bit off more than I could chew for this and next time I’ll pick a data set that’s a bit easier to read.
From what I understand of this data it’s trying to understand who is married based on the census taken in Australian. The government also broke them into two categories due to some people selecting more than one box, ignoring the question, or answering properly. The way I would deal with this is look more into the Eligible Participants Tab. This is because I noticed that all cities in “Response not clear” is between .1%-.3% which seem odd for multiple areas of their country. It also may be interesting to see a graph of the Responses that were clear for yes and no by division. Unfortunately I wouldn’t know how to do that at this point.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hackbarth (2021, Sept. 23). DACSS 601 Fall 2021: Homework 3. Retrieved from https://mrolfe.github.io/DACSS601Fall21/posts/2021-09-23-homework-3-molly-hackbarth/
BibTeX citation
@misc{hackbarth2021homework, author = {Hackbarth, Molly}, title = {DACSS 601 Fall 2021: Homework 3}, url = {https://mrolfe.github.io/DACSS601Fall21/posts/2021-09-23-homework-3-molly-hackbarth/}, year = {2021} }