Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Mariia Dubyk
November 1, 2022
library(readxl)
marriage <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",
sheet = "Table 2",
col_names = c("Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Delete", "Delete", "Delete", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage"))
marriage <- slice(marriage, 8:180)
marriage <- select(marriage, "Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage")
marriage
# A tibble: 173 × 13
Division Yes Yes_p…¹ No No_pe…² Total…³ Total…⁴ Respo…⁵ Respo…⁶ Non_r…⁷
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 New Sout… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 Banks 37736 44.899… 46343 55.100… 84079 100 247 0.2000… 20928
3 Barton 37153 43.600… 47984 56.399… 85137 100 226 0.2000… 24008
4 Bennelong 42943 49.799… 43215 50.200… 86158 100 244 0.2000… 19973
5 Berowra 48471 54.600… 40369 45.399… 88840 100 212 0.2000… 16038
6 Blaxland 20406 26.100… 57926 73.900… 78332 100 220 0.2000… 25883
7 Bradfield 53681 60.600… 34927 39.399… 88608 100 202 0.2000… 17261
8 Calare 54091 60.200… 35779 39.799… 89870 100 285 0.2000… 25342
9 Chifley 32871 41.299… 46702 58.700… 79573 100 263 0.2000… 28180
10 Cook 47505 55 38804 45 86309 100 229 0.2000… 18713
# … with 163 more rows, 3 more variables: Non_responding_percentage <chr>,
# Total <chr>, Total_percentage <chr>, and abbreviated variable names
# ¹Yes_percentage, ²No_percentage, ³Total_clear, ⁴Total_clear_percentage,
# ⁵Response_not_clear, ⁶Response_not_clear_percentage, ⁷Non_responding
The dataset shows responses to the question in Australian Marriage Law Postal Survey 2017. We can see data gathered in different divisions of States and Territories in Australia. The responses are “Yes”, “No”, “Not clear” and “No response”. The table also shows percentage of each answer and total numbers.
To tidy data I will leave only divisions in the first column. I am going to move State and Territories to another column by adding a variable (as it was shown during the class).I will also remove percentages and total numbers from the table.
marriage <- marriage%>%
mutate(StateTerritory = case_when(str_ends(Division, "Divisions") ~ Division), NA_character_)%>%
fill(StateTerritory, .direction = "down")
marriage <- filter(marriage, !str_detect(Division, "Division|Australia"))
marriage <- select(marriage, "Division", "Yes", "No", "Response_not_clear", "Non_responding", "StateTerritory")
marriage
# A tibble: 155 × 6
Division Yes No Response_not_clear Non_responding StateTerritory
<chr> <chr> <chr> <chr> <chr> <chr>
1 Banks 37736 46343 247 20928 New South Wales Divi…
2 Barton 37153 47984 226 24008 New South Wales Divi…
3 Bennelong 42943 43215 244 19973 New South Wales Divi…
4 Berowra 48471 40369 212 16038 New South Wales Divi…
5 Blaxland 20406 57926 220 25883 New South Wales Divi…
6 Bradfield 53681 34927 202 17261 New South Wales Divi…
7 Calare 54091 35779 285 25342 New South Wales Divi…
8 Chifley 32871 46702 263 28180 New South Wales Divi…
9 Cook 47505 38804 229 18713 New South Wales Divi…
10 Cowper 57493 38317 315 25197 New South Wales Divi…
# … with 145 more rows
In the final dataframe we will have 4 rows (with answers “Yes”, “No”,“Response_not_clear”, “Non_responding”) for each division.
Lets see if this works with a simple example.
# A tibble: 6 × 5
country year trade outgoing incoming
<chr> <dbl> <chr> <dbl> <dbl>
1 Mexico 1980 NAFTA 217. 1102.
2 USA 1990 NAFTA -585. 1869.
3 France 1980 EU 1351. 369.
4 Mexico 1990 NAFTA 846. 1458.
5 USA 1980 NAFTA 551. 1212.
6 France 1990 EU 1932. 749.
[1] 6
[1] 5
[1] 12
[1] 5
Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.
Any additional comments?
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
# A tibble: 12 × 5
country year trade trade_direction trade_value
<chr> <dbl> <chr> <chr> <dbl>
1 Mexico 1980 NAFTA outgoing 217.
2 Mexico 1980 NAFTA incoming 1102.
3 USA 1990 NAFTA outgoing -585.
4 USA 1990 NAFTA incoming 1869.
5 France 1980 EU outgoing 1351.
6 France 1980 EU incoming 369.
7 Mexico 1990 NAFTA outgoing 846.
8 Mexico 1990 NAFTA incoming 1458.
9 USA 1980 NAFTA outgoing 551.
10 USA 1980 NAFTA incoming 1212.
11 France 1990 EU outgoing 1932.
12 France 1990 EU incoming 749.
Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!
After pivoting the case is number of certain response in a division. I think it may be easier to visualize the data using this type of table. So we may say that data is tidy because it is easier to make next steps in analyzing data.
# A tibble: 620 × 4
Division StateTerritory Response_type Number
<chr> <chr> <chr> <chr>
1 Banks New South Wales Divisions Yes 37736
2 Banks New South Wales Divisions No 46343
3 Banks New South Wales Divisions Response_not_clear 247
4 Banks New South Wales Divisions Non_responding 20928
5 Barton New South Wales Divisions Yes 37153
6 Barton New South Wales Divisions No 47984
7 Barton New South Wales Divisions Response_not_clear 226
8 Barton New South Wales Divisions Non_responding 24008
9 Bennelong New South Wales Divisions Yes 42943
10 Bennelong New South Wales Divisions No 43215
# … with 610 more rows
---
title: "Challenge 3"
author: "Mariia Dubyk"
desription: "Tidy Data: Pivoting"
date: "11/01/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- animal_weights
- eggs
- australian_marriage
- usa_households
- sce_labor
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
```{r}
library(readxl)
marriage <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",
sheet = "Table 2",
col_names = c("Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Delete", "Delete", "Delete", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage"))
marriage <- slice(marriage, 8:180)
marriage <- select(marriage, "Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage")
marriage
```
### Briefly describe the data
The dataset shows responses to the question in Australian Marriage Law Postal Survey 2017. We can see data gathered in different divisions of States and Territories in Australia. The responses are "Yes", "No", "Not clear" and "No response". The table also shows percentage of each answer and total numbers.
To tidy data I will leave only divisions in the first column. I am going to move State and Territories to another column by adding a variable (as it was shown during the class).I will also remove percentages and total numbers from the table.
```{r}
marriage <- marriage%>%
mutate(StateTerritory = case_when(str_ends(Division, "Divisions") ~ Division), NA_character_)%>%
fill(StateTerritory, .direction = "down")
marriage <- filter(marriage, !str_detect(Division, "Division|Australia"))
marriage <- select(marriage, "Division", "Yes", "No", "Response_not_clear", "Non_responding", "StateTerritory")
marriage
```
## Anticipate the End Result
In the final dataframe we will have 4 rows (with answers "Yes", "No","Response_not_clear", "Non_responding") for each division.
### Example: find current and future data dimensions
Lets see if this works with a simple example.
```{r}
#| tbl-cap: Example
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
year = rep(c(1980,1990), 3),
trade = rep(c("NAFTA", "NAFTA", "EU"),2),
outgoing = rnorm(6, mean=1000, sd=500),
incoming = rlogis(6, location=1000,
scale = 400))
df
#existing rows/cases
nrow(df)
#existing columns/cases
ncol(df)
#expected rows/cases
nrow(df) * (ncol(df)-3)
# expected columns
3 + 2
```
Or simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.
### Challenge: Describe the final dimensions
```{r}
```
Any additional comments?
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
### Example
```{r}
#| tbl-cap: Pivoted Example
df<-pivot_longer(df, col = c(outgoing, incoming),
names_to="trade_direction",
values_to = "trade_value")
df
```
Yes, once it is pivoted long, our resulting data are $12x5$ - exactly what we expected!
### Challenge: Pivot the Chosen Data
After pivoting the case is number of certain response in a division. I think it may be easier to visualize the data using this type of table. So we may say that data is tidy because it is easier to make next steps in analyzing data.
```{r}
marriage_new<-pivot_longer(marriage, col = c(Yes, No, Response_not_clear, Non_responding),
names_to="Response_type",
values_to = "Number")
marriage_new
```