DACSS 601 Fall 2021: Homework 3

Molly Hackbarth

knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
library(here)
library(readxl)
library(stringr)
library(rmarkdown)
library(knitr)
library(formattable)

Reading Data into R

The first thing I need to do is read the data into R. For the sake of spelling I will call this “amarriages” and I’ll use the HERE function. This time I’ll also focus on leaving all of the library functions I need in setup. Since I will use the functions in a more systematic way instead of a explaining my process way.

#I'll name this amarriages due to this being australian marriages
amarriages <- read_excel(here("_data", "australian_marriage_law_postal_survey_2017_-_response_final.xls"))
kable(amarriages [1:4, ], caption = "Here is untidy data of Australian Marriages")

(#tab:reading in data)Here is untidy data of Australian Marriages
Australian Bureau of Statistics	…2	…3
1800.0 Australian Marriage Law Postal Survey, 2017	NA	NA
Released on 15 November 2017	NA	NA
NA	NA	NA
NA	Contents	NA

Issues #1 out of probably very many - The wrong sheet

I noticed that what’s happening is I’m reading in the first tab! This is causing some errors about gathering the data I would want. After looking around I found that by default read_excel reads the first sheet of the worksheet. Since we want to focus on tab 3, also ironically named table 2, I’ll redo the read function. This time I’ll use a page table.

#I'll name this marriages3 due to the sheet number
marriages3 <- read_excel(here("_data", "australian_marriage_law_postal_survey_2017_-_response_final.xls"), 3)
paged_table(marriages3)

Issue #2 out of too many to count - Columns are all wrong

The paged table worked to show off what I wanted. Yay! However the columns are…a mess, and that’s to put it lightly. Looking at the original file it’s because all the columns are starting differently. Lets try to get rid of the first 6 columns.

#I will name this marriages6 due to the 6 skips
marriages6 <- read_excel(here(
  "_data", "australian_marriage_law_postal_survey_2017_-_response_final.xls"), 
                         3, skip=6)
paged_table(marriages6)

Issue inside the table - the loss of knowing column names

It seems this is working out correctly! However we just lost a lot of information that was important. We don’t know what the columns mean anymore. Luckily if we go to “marriages3” we can see what everything was! Lets go ahead and pipe in column names. Also a FYI putting sheet makes it easier to understand what sheet you’re using!

#marriage7 for the sake of keeping different databases
marriages7 <- read_excel(here("_data", "australian_marriage_law_postal_survey_2017_-_response_final.xls"), 
                         sheet= 3, skip= 7,
  col_names= c("Cities", "Response Clear Yes Number", "Response Clear Yes Percantage", 
               "Response Clear No Number", "Response Clear No Percantage", 
               "Total Number Response Clear", "Total Response Percantage", "d1", 
               "Eligible Participants Yes Number","Eligible Participants Yes Percantage", 
               "Eligible Participants Not Clear Number", "Eligible Participants Not Clear Percantage", 
               "Non-Responding Number", "Non-Responding Percantage", 
               "Total Number Eligable Participants", "Total Eligable Percent" ))
paged_table(marriages7)

More Issues inside the table - extra columns and rows

So after forgetting you needed to use col_names not colnames…I finally made a table that has the correct column names! This means I can go ahead and remove the one more row with the % and no. in it. However looking at the table there’s a glaring problem, we have divisions, but not a way to deal with the different divisions. This will be a later problem. In this section lets do some cleaning up. Lets select out d1 using pipping and filter out any n/a columns or (Total) columns in “Divisions”.

marriages7 <- marriages7 %>%
  select(!starts_with("d1")) %>%
  filter(!str_detect(Cities, "NA")) %>%
  filter(!str_detect(Cities, "(Total)"))

The neverending issues in the table

I noticed there’s also sub notes in this column and the divisions so there’s no time like the present than to try to tackle this. Lets try to add columns using mutate. Realizing my mistake of calling the cities divisions I went back to rename the column “Cities”.

marriages7 <- marriages7 %>% 
  mutate(Footnotes = case_when(
    endsWith(Cities, "c)") ~ "(c) Includes Christmas Island and the Cocos (Keeling) Islands",
    endsWith(Cities, "d)") ~ "(d) Includes Norfolk Island",
    endsWith(Cities, "e)") ~ "(e) Includes Jervis Bay",
    )) 
paged_table(marriages7)

Cannot figure out how to do divisions - give up

What I would like to do is add a column for the divisions in a way you’d do it in excel. You would add a column and then between the two “divisions” number you would add the division to the side so it could still be there. However this doesn’t seem to be useful information for now and it doesn’t seem easy after looking around so for now I’ll remove the divisions columns. I also attempted to use filter for the footnotes but some of them weren’t deleting so I ended up just removing the last five columns using slice.

marriages7 <- marriages7 %>%
  filter(!str_detect(Cities, "Divisions")) %>%
  filter(!str_detect(Cities, "Australia")) %>%
  filter(!str_detect(Cities, "(a) The Federal Electoral Division boundaries are current as at 24 August 2017
")) %>%
slice(-c(151:155))
paged_table(marriages7)

Replace NAs in the footnote column

I want to replace the NAs in the Footnote column with nothing due to the fact you wouldn’t see a na most of the time when it’s only referencing footnotes.

marriages <- marriages7 %>%
  replace_na(list(Footnotes = " "))
paged_table(marriages)

Summarize and Arrange

Yay now it looks somewhat clean! Lets see if I can use these two functions. I have renamed this to “marriages” since this is the final copy for this homework.

marriages %>%
arrange(marriages, "Cities")

# A tibble: 150 x 16
   Cities    `Response Clear Ye~ `Response Clear Ye~ `Response Clear ~
   <chr>                   <dbl>               <dbl>             <dbl>
 1 Adelaide                62769                70.1             26771
 2 Aston                   48455                62               29730
 3 Ballarat                65613                70.5             27405
 4 Banks                   37736                44.9             46343
 5 Barker                  42498                52.3             38827
 6 Barton                  37153                43.6             47984
 7 Bass                    36249                61.7             22510
 8 Batman                  66383                71.2             26906
 9 Bendigo                 63412                68.7             28852
10 Bennelong               42943                49.8             43215
# ... with 140 more rows, and 12 more variables:
#   Response Clear No Percantage <dbl>,
#   Total Number Response Clear <dbl>,
#   Total Response Percantage <dbl>,
#   Eligible Participants Yes Number <dbl>,
#   Eligible Participants Yes Percantage <dbl>,
#   Eligible Participants Not Clear Number <dbl>, ...

marriages %>%
  select(contains("number")) %>%
  summarize_all(median, na.rm = TRUE)

# A tibble: 1 x 7
  `Response Clear ~ `Response Clear~ `Total Number R~ `Eligible Parti~
              <dbl>            <dbl>            <dbl>            <dbl>
1            51782.           31654.           85726.           85726.
# ... with 3 more variables:
#   Eligible Participants Not Clear Number <dbl>,
#   Non-Responding Number <dbl>,
#   Total Number Eligable Participants <dbl>

marriages %>%
  select(ends_with("e")) %>%
  summarise_all(median, na.rm = TRUE)

# A tibble: 1 x 6
  `Response Clear ~ `Response Clear~ `Total Response~ `Eligible Parti~
              <dbl>            <dbl>            <dbl>            <dbl>
1              61.8             38.2              100             79.7
# ... with 2 more variables:
#   Eligible Participants Not Clear Percantage <dbl>,
#   Non-Responding Percantage <dbl>

Conclusion

Summarise () was by hardest for this homework to use. It didn’t seem to work easily and it was frustrating to get to use at all. After reading the tutorial it got a lot easier but I still think I need to figure out where this data would be useful for this specific dataset. I think I bit off more than I could chew for this and next time I’ll pick a data set that’s a bit easier to read.

From what I understand of this data it’s trying to understand who is married based on the census taken in Australian. The government also broke them into two categories due to some people selecting more than one box, ignoring the question, or answering properly. The way I would deal with this is look more into the Eligible Participants Tab. This is because I noticed that all cities in “Response not clear” is between .1%-.3% which seem odd for multiple areas of their country. It also may be interesting to see a graph of the Responses that were clear for yes and no by division. Unfortunately I wouldn’t know how to do that at this point.

Comment on this article Share:

Homework 3