library(tidyverse)
library(readxl)
library(tidyr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 3
Read in data
For this challenge, we will be using the following dataset:
- australian_marriage*.xls ⭐⭐⭐
Code
<- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",
votes sheet="Table 2",
skip=6)
colnames(votes) <- c("District", "Yes", "del1", "No", rep("del2", 6), "Illegible", "del3", "No Response", rep("del4", 3))
<- votes[, !grepl("del", names(votes))]
votes votes
# A tibble: 184 × 5
District Yes No Illegible `No Response`
<chr> <dbl> <dbl> <dbl> <dbl>
1 New South Wales Divisions NA NA NA NA
2 Banks 37736 46343 247 20928
3 Barton 37153 47984 226 24008
4 Bennelong 42943 43215 244 19973
5 Berowra 48471 40369 212 16038
6 Blaxland 20406 57926 220 25883
7 Bradfield 53681 34927 202 17261
8 Calare 54091 35779 285 25342
9 Chifley 32871 46702 263 28180
10 Cook 47505 38804 229 18713
# ℹ 174 more rows
We can see that this dataset will require some cleaning before it can be used. We will start with removing the NA values from the “District” column.
Code
<- votes[!is.na(votes$District),]
votes votes
# A tibble: 174 × 5
District Yes No Illegible `No Response`
<chr> <dbl> <dbl> <dbl> <dbl>
1 New South Wales Divisions NA NA NA NA
2 Banks 37736 46343 247 20928
3 Barton 37153 47984 226 24008
4 Bennelong 42943 43215 244 19973
5 Berowra 48471 40369 212 16038
6 Blaxland 20406 57926 220 25883
7 Bradfield 53681 34927 202 17261
8 Calare 54091 35779 285 25342
9 Chifley 32871 46702 263 28180
10 Cook 47505 38804 229 18713
# ℹ 164 more rows
Now we can remove all the rows where the District value contains the string “(Total)” or starts with the character ‘(’. These rows provide aggregated data which can be calculated using the rest of the dataset.
Code
<- votes[!grepl("(Total)", votes$District), ]
votes <- votes[!grepl("^\\(", votes$District), ]
votes votes
# A tibble: 160 × 5
District Yes No Illegible `No Response`
<chr> <dbl> <dbl> <dbl> <dbl>
1 New South Wales Divisions NA NA NA NA
2 Banks 37736 46343 247 20928
3 Barton 37153 47984 226 24008
4 Bennelong 42943 43215 244 19973
5 Berowra 48471 40369 212 16038
6 Blaxland 20406 57926 220 25883
7 Bradfield 53681 34927 202 17261
8 Calare 54091 35779 285 25342
9 Chifley 32871 46702 263 28180
10 Cook 47505 38804 229 18713
# ℹ 150 more rows
We can observe that some values in the District column contain Divisions. These rows define the Divisions of the Districts given in the underlying rows.
Code
grep("Divisions", votes$District), ] votes[
# A tibble: 8 × 5
District Yes No Illegible `No Response`
<chr> <dbl> <dbl> <dbl> <dbl>
1 New South Wales Divisions NA NA NA NA
2 Victoria Divisions NA NA NA NA
3 Queensland Divisions NA NA NA NA
4 South Australia Divisions NA NA NA NA
5 Western Australia Divisions NA NA NA NA
6 Tasmania Divisions NA NA NA NA
7 Northern Territory Divisions NA NA NA NA
8 Australian Capital Territory Divisions NA NA NA NA
Ideally these values shouldn’t exist in the Districts column, since they aren’t districts. However, we cannot remove these rows either as they contain useful information that we might want to retain. For this reason we will create a new variable (column) which will store the Division of the DIstrict in the row.
Code
$Division <- ifelse(grepl("Divisions", votes$District), votes$District, NA)
votes#votes$Division <- tidyr::fill(votes$Division, .direction = "down")
for (i in 2:length(votes$Division)) {
if (is.na(votes$Division[i])) {
$Division[i] <- votes$Division[i-1]
votes
}
}<- votes[!grepl("Divisions", votes$District), ]
votes <- na.omit(votes)
votes votes
# A tibble: 150 × 6
District Yes No Illegible `No Response` Division
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Banks 37736 46343 247 20928 New South Wales Divisions
2 Barton 37153 47984 226 24008 New South Wales Divisions
3 Bennelong 42943 43215 244 19973 New South Wales Divisions
4 Berowra 48471 40369 212 16038 New South Wales Divisions
5 Blaxland 20406 57926 220 25883 New South Wales Divisions
6 Bradfield 53681 34927 202 17261 New South Wales Divisions
7 Calare 54091 35779 285 25342 New South Wales Divisions
8 Chifley 32871 46702 263 28180 New South Wales Divisions
9 Cook 47505 38804 229 18713 New South Wales Divisions
10 Cowper 57493 38317 315 25197 New South Wales Divisions
# ℹ 140 more rows
Briefly describe the data
This data set provides results of a survey held in Australia regarding same sex marriages. Participants had to give either a “Yes” or a “No” response. Many participants were unclear on their response hence the data set also has a column for that. Many participants failed to respond and hence that is also considered in the data set. The voting results have been grouped by Divisions and Districts in Australia.
After some initial pre-processing, we can see that the data set has 6 columns, and 150 rows. We can observe that the following columns: Yes, No, Illegible, No Response; are the results of the same survey and hence can be pivoted into a single column.
Anticipate the End Result
Our dataset has \(150\) rows and \(6\) variables. 2 of the variables are used to identify a case, so we will be pivoting \(6-2\) variables into a longer format where the \(6-2\) variable names will move into the names_to
variable and the current values in each of those columns will move into the values_to
variable. Therefore, we would expect \(150 * (6-2) = 600\) rows in the pivoted dataframe. We will have \(2\) unpivoted variables, plus the two new variable defined by the names_to
and the values_to
parameter of the pivot function. Hence we end up with \(4\) variables in the pivoted dataset.
Therefore, we expect that the dimensions of the pivoted dataset will be \(600 * 4\).
Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
Code
<-pivot_longer(votes, cols = Yes:`No Response`,
votes_pivnames_to="Response",
values_to = "Count")
votes_piv
# A tibble: 600 × 4
District Division Response Count
<chr> <chr> <chr> <dbl>
1 Banks New South Wales Divisions Yes 37736
2 Banks New South Wales Divisions No 46343
3 Banks New South Wales Divisions Illegible 247
4 Banks New South Wales Divisions No Response 20928
5 Barton New South Wales Divisions Yes 37153
6 Barton New South Wales Divisions No 47984
7 Barton New South Wales Divisions Illegible 226
8 Barton New South Wales Divisions No Response 24008
9 Bennelong New South Wales Divisions Yes 42943
10 Bennelong New South Wales Divisions No 43215
# ℹ 590 more rows
Dimensions of the pivoted dataset:
Code
dim(votes_piv)
[1] 600 4
Yes, once it is pivoted long, our resulting data are \(600*4\) - exactly what we expected!