Challenge 3

challenge_3

Shaunak Padhye

australian_marriage

Tidy Data: Pivoting

Author

Shaunak Padhye

Published

May 12, 2023

library(tidyverse)
library(readxl)
library(tidyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

For this challenge, we will be using the following dataset:

australian_marriage*.xls ⭐⭐⭐

Code

votes <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",
                        sheet="Table 2",
                        skip=6)
colnames(votes) <- c("District", "Yes", "del1", "No", rep("del2", 6), "Illegible", "del3", "No Response", rep("del4", 3))
votes <- votes[, !grepl("del", names(votes))]
votes

# A tibble: 184 × 5
   District                    Yes    No Illegible `No Response`
   <chr>                     <dbl> <dbl>     <dbl>         <dbl>
 1 New South Wales Divisions    NA    NA        NA            NA
 2 Banks                     37736 46343       247         20928
 3 Barton                    37153 47984       226         24008
 4 Bennelong                 42943 43215       244         19973
 5 Berowra                   48471 40369       212         16038
 6 Blaxland                  20406 57926       220         25883
 7 Bradfield                 53681 34927       202         17261
 8 Calare                    54091 35779       285         25342
 9 Chifley                   32871 46702       263         28180
10 Cook                      47505 38804       229         18713
# ℹ 174 more rows

We can see that this dataset will require some cleaning before it can be used. We will start with removing the NA values from the “District” column.

Code

votes <- votes[!is.na(votes$District),]
votes

# A tibble: 174 × 5
   District                    Yes    No Illegible `No Response`
   <chr>                     <dbl> <dbl>     <dbl>         <dbl>
 1 New South Wales Divisions    NA    NA        NA            NA
 2 Banks                     37736 46343       247         20928
 3 Barton                    37153 47984       226         24008
 4 Bennelong                 42943 43215       244         19973
 5 Berowra                   48471 40369       212         16038
 6 Blaxland                  20406 57926       220         25883
 7 Bradfield                 53681 34927       202         17261
 8 Calare                    54091 35779       285         25342
 9 Chifley                   32871 46702       263         28180
10 Cook                      47505 38804       229         18713
# ℹ 164 more rows

Now we can remove all the rows where the District value contains the string “(Total)” or starts with the character ‘(’. These rows provide aggregated data which can be calculated using the rest of the dataset.

Code

votes <- votes[!grepl("(Total)", votes$District), ]
votes <- votes[!grepl("^\\(", votes$District), ]
votes

# A tibble: 160 × 5
   District                    Yes    No Illegible `No Response`
   <chr>                     <dbl> <dbl>     <dbl>         <dbl>
 1 New South Wales Divisions    NA    NA        NA            NA
 2 Banks                     37736 46343       247         20928
 3 Barton                    37153 47984       226         24008
 4 Bennelong                 42943 43215       244         19973
 5 Berowra                   48471 40369       212         16038
 6 Blaxland                  20406 57926       220         25883
 7 Bradfield                 53681 34927       202         17261
 8 Calare                    54091 35779       285         25342
 9 Chifley                   32871 46702       263         28180
10 Cook                      47505 38804       229         18713
# ℹ 150 more rows

We can observe that some values in the District column contain Divisions. These rows define the Divisions of the Districts given in the underlying rows.

Code

votes[grep("Divisions", votes$District), ]

# A tibble: 8 × 5
  District                                 Yes    No Illegible `No Response`
  <chr>                                  <dbl> <dbl>     <dbl>         <dbl>
1 New South Wales Divisions                 NA    NA        NA            NA
2 Victoria Divisions                        NA    NA        NA            NA
3 Queensland Divisions                      NA    NA        NA            NA
4 South Australia Divisions                 NA    NA        NA            NA
5 Western Australia Divisions               NA    NA        NA            NA
6 Tasmania Divisions                        NA    NA        NA            NA
7 Northern Territory Divisions              NA    NA        NA            NA
8 Australian Capital Territory Divisions    NA    NA        NA            NA

Ideally these values shouldn’t exist in the Districts column, since they aren’t districts. However, we cannot remove these rows either as they contain useful information that we might want to retain. For this reason we will create a new variable (column) which will store the Division of the DIstrict in the row.

Code

votes$Division <- ifelse(grepl("Divisions", votes$District), votes$District, NA)
#votes$Division <- tidyr::fill(votes$Division, .direction = "down")
for (i in 2:length(votes$Division)) {
  if (is.na(votes$Division[i])) {
    votes$Division[i] <- votes$Division[i-1]
  }
}
votes <- votes[!grepl("Divisions", votes$District), ]
votes <- na.omit(votes)
votes

# A tibble: 150 × 6
   District    Yes    No Illegible `No Response` Division                 
   <chr>     <dbl> <dbl>     <dbl>         <dbl> <chr>                    
 1 Banks     37736 46343       247         20928 New South Wales Divisions
 2 Barton    37153 47984       226         24008 New South Wales Divisions
 3 Bennelong 42943 43215       244         19973 New South Wales Divisions
 4 Berowra   48471 40369       212         16038 New South Wales Divisions
 5 Blaxland  20406 57926       220         25883 New South Wales Divisions
 6 Bradfield 53681 34927       202         17261 New South Wales Divisions
 7 Calare    54091 35779       285         25342 New South Wales Divisions
 8 Chifley   32871 46702       263         28180 New South Wales Divisions
 9 Cook      47505 38804       229         18713 New South Wales Divisions
10 Cowper    57493 38317       315         25197 New South Wales Divisions
# ℹ 140 more rows

Briefly describe the data

This data set provides results of a survey held in Australia regarding same sex marriages. Participants had to give either a “Yes” or a “No” response. Many participants were unclear on their response hence the data set also has a column for that. Many participants failed to respond and hence that is also considered in the data set. The voting results have been grouped by Divisions and Districts in Australia.

After some initial pre-processing, we can see that the data set has 6 columns, and 150 rows. We can observe that the following columns: Yes, No, Illegible, No Response; are the results of the same survey and hence can be pivoted into a single column.

Anticipate the End Result

Our dataset has \(150\) rows and \(6\) variables. 2 of the variables are used to identify a case, so we will be pivoting \(6-2\) variables into a longer format where the \(6-2\) variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect \(150 * (6-2) = 600\) rows in the pivoted dataframe. We will have \(2\) unpivoted variables, plus the two new variable defined by the names_to and the values_to parameter of the pivot function. Hence we end up with \(4\) variables in the pivoted dataset.

Therefore, we expect that the dimensions of the pivoted dataset will be \(600 * 4\).

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Code

votes_piv<-pivot_longer(votes, cols = Yes:`No Response`,
                 names_to="Response",
                 values_to = "Count")
votes_piv

# A tibble: 600 × 4
   District  Division                  Response    Count
   <chr>     <chr>                     <chr>       <dbl>
 1 Banks     New South Wales Divisions Yes         37736
 2 Banks     New South Wales Divisions No          46343
 3 Banks     New South Wales Divisions Illegible     247
 4 Banks     New South Wales Divisions No Response 20928
 5 Barton    New South Wales Divisions Yes         37153
 6 Barton    New South Wales Divisions No          47984
 7 Barton    New South Wales Divisions Illegible     226
 8 Barton    New South Wales Divisions No Response 24008
 9 Bennelong New South Wales Divisions Yes         42943
10 Bennelong New South Wales Divisions No          43215
# ℹ 590 more rows

Dimensions of the pivoted dataset:

Code

dim(votes_piv)

[1] 600   4

Yes, once it is pivoted long, our resulting data are \(600*4\) - exactly what we expected!