challenge_3
Australian_marriage_law
Author

Mani Shanker Kamarapu

Published

August 17, 2022

Code
library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

Code
vote_response <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet = "Table 2", skip = 7, col_names = c("Area", "Yes", "Yes_P", "No", "No_P", "Totalclear", "Totalclear_P", "Empty", "clear", "clear_P", "Not_Eligible", "Not_Eligible_P", "NoResponse", "NoResponse_P", "Total", "Total_P")) %>%
  select("Area", "Yes", "No", "Not_Eligible", "NoResponse", "Total") %>%
  drop_na("Area") %>%
  filter(!grepl("Total", Area))
R <- nrow(vote_response)-7
vote_response <- slice(vote_response, 1:R)
vote_response
# A tibble: 158 × 6
   Area                        Yes    No Not_Eligible NoResponse  Total
   <chr>                     <dbl> <dbl>        <dbl>      <dbl>  <dbl>
 1 New South Wales Divisions    NA    NA           NA         NA     NA
 2 Banks                     37736 46343          247      20928 105254
 3 Barton                    37153 47984          226      24008 109371
 4 Bennelong                 42943 43215          244      19973 106375
 5 Berowra                   48471 40369          212      16038 105090
 6 Blaxland                  20406 57926          220      25883 104435
 7 Bradfield                 53681 34927          202      17261 106071
 8 Calare                    54091 35779          285      25342 115497
 9 Chifley                   32871 46702          263      28180 108016
10 Cook                      47505 38804          229      18713 105251
# … with 148 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
dim(vote_response)
[1] 158   6

The dimensions of the data set are 158 rows and 6 columns.

The data is on the postal survey of Australian Electoral Roll. It contains the data of the eligible participants and responses of the participants in federal election as at 24 August 2017. It has data based on different federal electoral divisions survey by the Australian Election Commission. It is basically the total number of response we got in each division during the federal election. The data set is untidy and variables are not defined correctly and there are a lot of extra variables which are not required and a bunch of NA values. So using different R functions to remove unwanted variables and drop some NA values and also remove extra rows.

Separating different divisions and areas

Code
vote_response <- vote_response %>%
  mutate(Division = case_when(
    str_ends(Area, "Divisions") ~ Area,
    TRUE ~ NA_character_ )) %>%
  fill(Division) %>%
  drop_na("Yes")
vote_response
# A tibble: 150 × 7
   Area        Yes    No Not_Eligible NoResponse  Total Division                
   <chr>     <dbl> <dbl>        <dbl>      <dbl>  <dbl> <chr>                   
 1 Banks     37736 46343          247      20928 105254 New South Wales Divisio…
 2 Barton    37153 47984          226      24008 109371 New South Wales Divisio…
 3 Bennelong 42943 43215          244      19973 106375 New South Wales Divisio…
 4 Berowra   48471 40369          212      16038 105090 New South Wales Divisio…
 5 Blaxland  20406 57926          220      25883 104435 New South Wales Divisio…
 6 Bradfield 53681 34927          202      17261 106071 New South Wales Divisio…
 7 Calare    54091 35779          285      25342 115497 New South Wales Divisio…
 8 Chifley   32871 46702          263      28180 108016 New South Wales Divisio…
 9 Cook      47505 38804          229      18713 105251 New South Wales Divisio…
10 Cowper    57493 38317          315      25197 121322 New South Wales Divisio…
# … with 140 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
dim(vote_response)
[1] 150   7

After we read the data and did some hard coding along with read, now we have separated the divisions row from the area column and formed a new column so we can use it as a grouping variable.

Using Pivot to tidy data

Code
vote_response <- pivot_longer(vote_response, Yes:Total, names_to = "Response", values_to = "Count")
vote_response
# A tibble: 750 × 4
   Area   Division                  Response      Count
   <chr>  <chr>                     <chr>         <dbl>
 1 Banks  New South Wales Divisions Yes           37736
 2 Banks  New South Wales Divisions No            46343
 3 Banks  New South Wales Divisions Not_Eligible    247
 4 Banks  New South Wales Divisions NoResponse    20928
 5 Banks  New South Wales Divisions Total        105254
 6 Barton New South Wales Divisions Yes           37153
 7 Barton New South Wales Divisions No            47984
 8 Barton New South Wales Divisions Not_Eligible    226
 9 Barton New South Wales Divisions NoResponse    24008
10 Barton New South Wales Divisions Total        109371
# … with 740 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
dim(vote_response)
[1] 750   4

Now the data set has become longer and final dimensions are 750 rows and 4 columns.

I have used pivot_longer to tidy the data set where I have collapse the responses(Yes, No, Not_Eligible, NoResponse and Total) into one column so we can vary them easily and really make sense. So now we can get the divisional level count or area level count easily and we can also plot it easily now in different ways.