Code
library(tidyverse)
library(readxl)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Mani Shanker Kamarapu
August 17, 2022
vote_response <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet = "Table 2", skip = 7, col_names = c("Area", "Yes", "Yes_P", "No", "No_P", "Totalclear", "Totalclear_P", "Empty", "clear", "clear_P", "Not_Eligible", "Not_Eligible_P", "NoResponse", "NoResponse_P", "Total", "Total_P")) %>%
select("Area", "Yes", "No", "Not_Eligible", "NoResponse", "Total") %>%
drop_na("Area") %>%
filter(!grepl("Total", Area))
R <- nrow(vote_response)-7
vote_response <- slice(vote_response, 1:R)
vote_response
# A tibble: 158 × 6
Area Yes No Not_Eligible NoResponse Total
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 New South Wales Divisions NA NA NA NA NA
2 Banks 37736 46343 247 20928 105254
3 Barton 37153 47984 226 24008 109371
4 Bennelong 42943 43215 244 19973 106375
5 Berowra 48471 40369 212 16038 105090
6 Blaxland 20406 57926 220 25883 104435
7 Bradfield 53681 34927 202 17261 106071
8 Calare 54091 35779 285 25342 115497
9 Chifley 32871 46702 263 28180 108016
10 Cook 47505 38804 229 18713 105251
# … with 148 more rows
# ℹ Use `print(n = ...)` to see more rows
The dimensions of the data set are 158
rows and 6
columns.
The data is on the postal survey of Australian Electoral Roll. It contains the data of the eligible participants and responses of the participants in federal election as at 24 August 2017. It has data based on different federal electoral divisions survey by the Australian Election Commission. It is basically the total number of response we got in each division during the federal election. The data set is untidy and variables are not defined correctly and there are a lot of extra variables which are not required and a bunch of NA values. So using different R functions to remove unwanted variables and drop some NA values and also remove extra rows.
# A tibble: 150 × 7
Area Yes No Not_Eligible NoResponse Total Division
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Banks 37736 46343 247 20928 105254 New South Wales Divisio…
2 Barton 37153 47984 226 24008 109371 New South Wales Divisio…
3 Bennelong 42943 43215 244 19973 106375 New South Wales Divisio…
4 Berowra 48471 40369 212 16038 105090 New South Wales Divisio…
5 Blaxland 20406 57926 220 25883 104435 New South Wales Divisio…
6 Bradfield 53681 34927 202 17261 106071 New South Wales Divisio…
7 Calare 54091 35779 285 25342 115497 New South Wales Divisio…
8 Chifley 32871 46702 263 28180 108016 New South Wales Divisio…
9 Cook 47505 38804 229 18713 105251 New South Wales Divisio…
10 Cowper 57493 38317 315 25197 121322 New South Wales Divisio…
# … with 140 more rows
# ℹ Use `print(n = ...)` to see more rows
After we read the data and did some hard coding along with read, now we have separated the divisions row from the area column and formed a new column so we can use it as a grouping variable.
# A tibble: 750 × 4
Area Division Response Count
<chr> <chr> <chr> <dbl>
1 Banks New South Wales Divisions Yes 37736
2 Banks New South Wales Divisions No 46343
3 Banks New South Wales Divisions Not_Eligible 247
4 Banks New South Wales Divisions NoResponse 20928
5 Banks New South Wales Divisions Total 105254
6 Barton New South Wales Divisions Yes 37153
7 Barton New South Wales Divisions No 47984
8 Barton New South Wales Divisions Not_Eligible 226
9 Barton New South Wales Divisions NoResponse 24008
10 Barton New South Wales Divisions Total 109371
# … with 740 more rows
# ℹ Use `print(n = ...)` to see more rows
Now the data set has become longer and final dimensions are 750
rows and 4
columns.
I have used pivot_longer to tidy the data set where I have collapse the responses(Yes, No, Not_Eligible, NoResponse and Total) into one column so we can vary them easily and really make sense. So now we can get the divisional level count or area level count easily and we can also plot it easily now in different ways.
---
title: "Challenge 3"
author: "Mani Shanker Kamarapu"
desription: "Tidy Data: Pivoting"
date: "08/17/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- Australian_marriage_law
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in data
```{r}
vote_response <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet = "Table 2", skip = 7, col_names = c("Area", "Yes", "Yes_P", "No", "No_P", "Totalclear", "Totalclear_P", "Empty", "clear", "clear_P", "Not_Eligible", "Not_Eligible_P", "NoResponse", "NoResponse_P", "Total", "Total_P")) %>%
select("Area", "Yes", "No", "Not_Eligible", "NoResponse", "Total") %>%
drop_na("Area") %>%
filter(!grepl("Total", Area))
R <- nrow(vote_response)-7
vote_response <- slice(vote_response, 1:R)
vote_response
```
```{r}
dim(vote_response)
```
The dimensions of the data set are `158` rows and `6` columns.
The data is on the postal survey of Australian Electoral Roll. It contains the data of the eligible participants and responses of the participants in federal election as at 24 August 2017. It has data based on different federal electoral divisions survey by the Australian Election Commission. It is basically the total number of response we got in each division during the federal election. The data set is untidy and variables are not defined correctly and there are a lot of extra variables which are not required and a bunch of NA values. So using different R functions to remove unwanted variables and drop some NA values and also remove extra rows.
## Separating different divisions and areas
```{r}
vote_response <- vote_response %>%
mutate(Division = case_when(
str_ends(Area, "Divisions") ~ Area,
TRUE ~ NA_character_ )) %>%
fill(Division) %>%
drop_na("Yes")
vote_response
```
```{r}
dim(vote_response)
```
After we read the data and did some hard coding along with read, now we have separated the divisions row from the area column and formed a new column so we can use it as a grouping variable.
## Using Pivot to tidy data
```{r}
vote_response <- pivot_longer(vote_response, Yes:Total, names_to = "Response", values_to = "Count")
vote_response
```
```{r}
dim(vote_response)
```
Now the data set has become longer and final dimensions are `750` rows and `4` columns.
I have used pivot_longer to tidy the data set where I have collapse the responses(Yes, No, Not_Eligible, NoResponse and Total) into one column so we can vary them easily and really make sense. So now we can get the divisional level count or area level count easily and we can also plot it easily now in different ways.