Code
library(tidyverse)
library(readxl)
library(DescTools)
library(kableExtra)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
knitrsetwd("D:/MyDocs/Class Slides/DACSS601/601_Spring_2023/posts/_data")
Sahan Prasad Podduturi Reddy
March 18, 2023
I was trying to analyze the ‘australian_marriage*.xls’ dataset by splitting the dataset into two separate dataframes and then tidying the data to interpret the results. This dataset summarizes the responses given by eligible participants of the Australian Marriage Law Postal Survey, 2017. We first start by importing the necessary libraries and setting the working directory to point to the location where the spreadsheet is located.
Looking at the dataset, we can see that the data can be split up into two separate dataframes which makes it much easier to understand the context of the data we are trying to analyze. We use read_excel() to read in the excel sheet and then use “sheet=2” and “skip=6” as parameters. This removes the first six lines of the sheet as we import data into our dataframe. We can then assign a different column name to each column using colnames().We then use slice() to get only the data regarding the states/territories and ignore the remaining text in the spreadsheet.
setwd("D:/MyDocs/Class Slides/DACSS601/601_Spring_2023/posts/_data")
dataframe <- read_excel("australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet = 2, skip = 6)
dataframe <- slice(dataframe, 1:9)
colnames(dataframe) <- c("State/Territory", "Response Clear Yes", "Response Clear Yes %", "Response Clear No", "Response Clear No %", "Response Clear Total", "Response Clear Total %", "Blank", "Response Clear Total Eligible", "Response Clear Total Eligible %", "Response Not Clear Total Eligible", "Response Not Clear Total Eligible %", "No Response", "No Response %", "Total", "Total %")
print(dataframe)
# A tibble: 9 × 16
State/…¹ Respo…² Respo…³ Respo…⁴ Respo…⁵ Respo…⁶ Respo…⁷ Blank Respo…⁸ Respo…⁹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl>
1 New Sou… 2374362 57.8 1736838 42.2 4.11e6 100 NA 4.11e6 79.2
2 Victoria 2145629 64.9 1161098 35.1 3.31e6 100 NA 3.31e6 81.4
3 Queensl… 1487060 60.7 961015 39.3 2.45e6 100 NA 2.45e6 77.7
4 South A… 592528 62.5 356247 37.5 9.49e5 100 NA 9.49e5 79.5
5 Western… 801575 63.7 455924 36.3 1.26e6 100 NA 1.26e6 78.3
6 Tasmania 191948 63.6 109655 36.4 3.02e5 100 NA 3.02e5 79.5
7 Norther… 48686 60.6 31690 39.4 8.04e4 100 NA 8.04e4 58.2
8 Austral… 175459 74 61520 26 2.37e5 100 NA 2.37e5 82.3
9 Austral… 7817247 61.6 4873987 38.4 1.27e7 100 NA 1.27e7 79.3
# … with 6 more variables: `Response Not Clear Total Eligible` <dbl>,
# `Response Not Clear Total Eligible %` <dbl>, `No Response` <dbl>,
# `No Response %` <dbl>, Total <dbl>, `Total %` <dbl>, and abbreviated
# variable names ¹`State/Territory`, ²`Response Clear Yes`,
# ³`Response Clear Yes %`, ⁴`Response Clear No`, ⁵`Response Clear No %`,
# ⁶`Response Clear Total`, ⁷`Response Clear Total %`,
# ⁸`Response Clear Total Eligible`, ⁹`Response Clear Total Eligible %`
We first create a subset containing information on total number of 1)clear responses, 2)unclear responses, and 3)no responses by eligible participants in the survey over different states/territories. Then from this subset, we select only the participants from the “Clear Response” column and analyze their responses to the survey in another dataframe.
In our third dataframe, we take the number of clear responses and sort them by ‘Yes’ and ‘No’ Response per each state and calculate how much they account for over the total Response Count. This is the data that we are trying to tidy. We are trying to clean this data because the format can be better. Currently we have separate columns for ‘Yes’ and ‘No’ response counts. We can easily group these into a name column named “Response” holding the response value using pivot_longer() and place the response counts in a separate Response Count column. This makes it much simpler to interpret the data at a glance.
We can use nrow() and ncol() to find that the current dataset we’re working with has 9 rows and 4 cols.
dataframe2 <- select(dataframe, "State/Territory", "Response Clear Total Eligible", "Response Clear Total Eligible %", "Response Not Clear Total Eligible", "Response Not Clear Total Eligible %", "No Response", "No Response %", "Total", "Total %")
colnames(dataframe2) <- c("State/Territory", "Clear Response", "Clear Response %", "Unclear Response", "Unclear Response %", "No Response", "No Response %", "Total Eligible Participants", "Total %")
kable(select(dataframe2, "State/Territory", "Clear Response", "Unclear Response", "No Response", "Total Eligible Participants"), "html") %>%
column_spec(2, background = "yellow")
State/Territory | Clear Response | Unclear Response | No Response | Total Eligible Participants |
---|---|---|---|---|
New South Wales | 4111200 | 11036 | 1065445 | 5187681 |
Victoria | 3306727 | 11028 | 743634 | 4061389 |
Queensland | 2448075 | 7088 | 695710 | 3150873 |
South Australia | 948775 | 2778 | 242027 | 1193580 |
Western Australia | 1257499 | 3188 | 346333 | 1607020 |
Tasmania | 301603 | 805 | 77020 | 379428 |
Northern Territory(b) | 80376 | 229 | 57496 | 138101 |
Australian Capital Territory(c) | 236979 | 534 | 50595 | 288108 |
Australia | 12691234 | 36686 | 3278260 | 16006180 |
# A tibble: 9 × 5
`State/Territory` Yes `Yes %` No `No %`
<chr> <dbl> <dbl> <dbl> <dbl>
1 New South Wales 2374362 57.8 1736838 42.2
2 Victoria 2145629 64.9 1161098 35.1
3 Queensland 1487060 60.7 961015 39.3
4 South Australia 592528 62.5 356247 37.5
5 Western Australia 801575 63.7 455924 36.3
6 Tasmania 191948 63.6 109655 36.4
7 Northern Territory(b) 48686 60.6 31690 39.4
8 Australian Capital Territory(c) 175459 74 61520 26
9 Australia 7817247 61.6 4873987 38.4
[1] 9
[1] 5
We are first going to pivot the data based on columns “Yes” and “No” using the pivot_longer() command. Using nrow() and ncol(), we find that we have 9 rows and 5 cols in our dataframe. Using pivot_longer, we are pivoting 2 variable names into the names_to column - “Response”. So we will end up with 9*(2) = 18 rows in the dataframe. The cols will remain the same(5 columns) because we are removing 2 cols and adding 2 cols with the “names_to” and “values_to” attributes.
Next we will want to substitute “Yes %” and “No %” with a single column - “Percentage Response Count” for the dataframe such that column value for a particular row is equal to value in “Yes %” if “Response == Yes”, otherwise value will be equal to value in “No %”. Thus overall, we will lose one column and end up with 4 columns.
Initial Dimensions = 9 x 5 Final Prediction = 18 x 4
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check. We first use the pivot_longer() function to pivot the “Yes” and “No” columns and then apply the mutate() function to get the “Percentage Response Count” column. In the end, we print out the number of rows and columns to verify our prediction.
# A tibble: 18 × 4
`State/Territory` Response `Response Count` Percentage Respon…¹
<chr> <chr> <dbl> <dbl>
1 New South Wales Yes 2374362 57.8
2 New South Wales No 1736838 42.2
3 Victoria Yes 2145629 64.9
4 Victoria No 1161098 35.1
5 Queensland Yes 1487060 60.7
6 Queensland No 961015 39.3
7 South Australia Yes 592528 62.5
8 South Australia No 356247 37.5
9 Western Australia Yes 801575 63.7
10 Western Australia No 455924 36.3
11 Tasmania Yes 191948 63.6
12 Tasmania No 109655 36.4
13 Northern Territory(b) Yes 48686 60.6
14 Northern Territory(b) No 31690 39.4
15 Australian Capital Territory(c) Yes 175459 74
16 Australian Capital Territory(c) No 61520 26
17 Australia Yes 7817247 61.6
18 Australia No 4873987 38.4
# … with abbreviated variable name ¹`Percentage Response Count`
[1] 18
[1] 4
---
title: "Tidy Australian Marriage data"
author: "Sahan Prasad Podduturi Reddy"
desription: "Tidy Data: Pivoting"
date: "03/18/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- australian_marriage
- Sahan Prasad Podduturi Reddy
---
## Introduction
I was trying to analyze the 'australian_marriage\*.xls' dataset by splitting the dataset into two separate dataframes and then tidying the data to interpret the results. This dataset summarizes the responses given by eligible participants of the Australian Marriage Law Postal Survey, 2017. We first start by importing the necessary libraries and setting the working directory to point to the location where the spreadsheet is located.
```{r}
#| label: Introduction
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
library(DescTools)
library(kableExtra)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
setwd("D:/MyDocs/Class Slides/DACSS601/601_Spring_2023/posts/_data")
```
## Read in the Data
Looking at the dataset, we can see that the data can be split up into two separate dataframes which makes it much easier to understand the context of the data we are trying to analyze. We use read_excel() to read in the excel sheet and then use "sheet=2" and "skip=6" as parameters. This removes the first six lines of the sheet as we import data into our dataframe. We can then assign a different column name to each column using colnames().We then use slice() to get only the data regarding the states/territories and ignore the remaining text in the spreadsheet.
```{r}
#| label: ReadFile
#| warning: false
#| message: false
setwd("D:/MyDocs/Class Slides/DACSS601/601_Spring_2023/posts/_data")
dataframe <- read_excel("australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet = 2, skip = 6)
dataframe <- slice(dataframe, 1:9)
colnames(dataframe) <- c("State/Territory", "Response Clear Yes", "Response Clear Yes %", "Response Clear No", "Response Clear No %", "Response Clear Total", "Response Clear Total %", "Blank", "Response Clear Total Eligible", "Response Clear Total Eligible %", "Response Not Clear Total Eligible", "Response Not Clear Total Eligible %", "No Response", "No Response %", "Total", "Total %")
print(dataframe)
```
## Brief Description
We first create a subset containing information on total number of 1)clear responses, 2)unclear responses, and 3)no responses by eligible participants in the survey over different states/territories. Then from this subset, we select only the participants from the "Clear Response" column and analyze their responses to the survey in another dataframe.
In our third dataframe, we take the number of clear responses and sort them by 'Yes' and 'No' Response per each state and calculate how much they account for over the total Response Count. This is the data that we are trying to tidy. We are trying to clean this data because the format can be better. Currently we have separate columns for 'Yes' and 'No' response counts. We can easily group these into a name column named "Response" holding the response value using pivot_longer() and place the response counts in a separate Response Count column. This makes it much simpler to interpret the data at a glance.
We can use nrow() and ncol() to find that the current dataset we're working with has 9 rows and 4 cols.
```{r}
#| label: Description
#| warning: false
#| message: false
dataframe2 <- select(dataframe, "State/Territory", "Response Clear Total Eligible", "Response Clear Total Eligible %", "Response Not Clear Total Eligible", "Response Not Clear Total Eligible %", "No Response", "No Response %", "Total", "Total %")
colnames(dataframe2) <- c("State/Territory", "Clear Response", "Clear Response %", "Unclear Response", "Unclear Response %", "No Response", "No Response %", "Total Eligible Participants", "Total %")
kable(select(dataframe2, "State/Territory", "Clear Response", "Unclear Response", "No Response", "Total Eligible Participants"), "html") %>%
column_spec(2, background = "yellow")
dataframe3 <- select(dataframe, "State/Territory", "Response Clear Yes", "Response Clear Yes %", "Response Clear No", "Response Clear No %")
colnames(dataframe3) <- c("State/Territory", "Yes", "Yes %", "No", "No %")
print(dataframe3)
nrow(dataframe3)
ncol(dataframe3)
```
## Anticipate the data
We are first going to pivot the data based on columns "Yes" and "No" using the pivot_longer() command.
Using nrow() and ncol(), we find that we have 9 rows and 5 cols in our dataframe. Using pivot_longer, we are pivoting 2 variable names into the names_to column - "Response". So we will end up with 9*(2) = 18 rows in the dataframe. The cols will remain the same(5 columns) because we are removing 2 cols and adding 2 cols with the "names_to" and "values_to" attributes.
Next we will want to substitute "Yes %" and "No %" with a single column - "Percentage Response Count" for the dataframe such that column value for a particular row is equal to value in "Yes %" if "Response == Yes", otherwise value will be equal to value in
"No %". Thus overall, we will lose one column and end up with 4 columns.
Initial Dimensions = 9 x 5
Final Prediction = 18 x 4
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
We first use the pivot_longer() function to pivot the "Yes" and "No" columns and then apply the mutate() function to get the "Percentage Response Count" column. In the end, we print out the number of rows and columns to verify our prediction.
```{r}
#| label: Pivot
#| warning: false
#| message: false
dataframe4 <- dataframe3 %>% pivot_longer(cols=c("Yes","No"), names_to = "Response", values_to = "Response Count") %>%
mutate(`Percentage Response Count` = case_when(
Response == "Yes" ~ `Yes %`,
Response == "No" ~ `No %`
)) %>%
select(-"Yes %", -"No %")
print(dataframe4)
nrow(dataframe4)
ncol(dataframe4)
```