Code
library(tidyverse)
library(readxl)
library(stringr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Priya Marla
January 12, 2023
Today’s challenge is to:
pivot_longer
Read in one (or more) of the following datasets, using the correct R package and command. col_names = c(“District”,“Yes”, “Yes%”, “No”, “No%”,)
#loading the data from posts/_data folder
vote_states <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",sheet = "Table 2", skip = 7,
col_names = c("Location","Yes","rem","No","rem","rem","rem","rem","rem","rem","Response_not_clear","rem","No_response","rem","rem","rem"))
#dataset
vote_states
# A tibble: 184 × 16
Location Yes rem...3 No rem...5 rem...6 rem...7 rem...8 rem...9 rem..…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl>
1 New Sout… NA NA NA NA NA NA NA NA NA
2 Banks 37736 44.9 46343 55.1 84079 100 NA 84079 79.9
3 Barton 37153 43.6 47984 56.4 85137 100 NA 85137 77.8
4 Bennelong 42943 49.8 43215 50.2 86158 100 NA 86158 81
5 Berowra 48471 54.6 40369 45.4 88840 100 NA 88840 84.5
6 Blaxland 20406 26.1 57926 73.9 78332 100 NA 78332 75
7 Bradfield 53681 60.6 34927 39.4 88608 100 NA 88608 83.5
8 Calare 54091 60.2 35779 39.8 89870 100 NA 89870 77.8
9 Chifley 32871 41.3 46702 58.7 79573 100 NA 79573 73.7
10 Cook 47505 55 38804 45 86309 100 NA 86309 82
# … with 174 more rows, 6 more variables: Response_not_clear <dbl>,
# rem...12 <dbl>, No_response <dbl>, rem...14 <dbl>, rem...15 <dbl>,
# rem...16 <dbl>, and abbreviated variable name ¹rem...10
#Cleaning the dataset
votes <- vote_states %>%
select(!contains("rem")) %>% #removing percentage columns
mutate(State = case_when(str_ends(Location,"Divisions") ~ Location) ,.after = "Location") %>% #adding a new column state
fill(State, .direction = "down") %>% #filling the values of the column state
filter(!str_detect(Location, 'Total')) %>% #Removing rows having total counts
filter(!str_detect(Location, 'Divisions')) %>% #Removing the divisions heading rows
drop_na() #Removing rows with null values
votes
# A tibble: 150 × 6
Location State Yes No Response_not_clear No_respo…¹
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Banks New South Wales Divisions 37736 46343 247 20928
2 Barton New South Wales Divisions 37153 47984 226 24008
3 Bennelong New South Wales Divisions 42943 43215 244 19973
4 Berowra New South Wales Divisions 48471 40369 212 16038
5 Blaxland New South Wales Divisions 20406 57926 220 25883
6 Bradfield New South Wales Divisions 53681 34927 202 17261
7 Calare New South Wales Divisions 54091 35779 285 25342
8 Chifley New South Wales Divisions 32871 46702 263 28180
9 Cook New South Wales Divisions 47505 38804 229 18713
10 Cowper New South Wales Divisions 57493 38317 315 25197
# … with 140 more rows, and abbreviated variable name ¹No_response
Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”
###Description
This is an Australian marriage law survey dataset. A survey has been conducted in Australia to identify whether people are ok with same gender marriages are not. People have to respond with either “yes” or “no”. Eligible participants response is sometimes clear, sometimes not clear and sometimes the candidates do not vote. When the response is clear- it can be either “yes” or “no”. In this dataset, for each location the statistics of “yes”, “no”, “Not clear” and “No Response” have been recorded.
It has been identified that few columns and rows are unnecessary for the dataset. Hence while cleaning the dataset, starting few rows i.e till 7 have been removed and ending few rows with NULL values have been removed. For each location corresponding division has been added in the “states” column. Percentage columns have been removed to get a tidy view of the statistics. Count of all possible responses for each location are now recorded in columns “Yes”, “No”, “Response_not_clear”, “No_response”.
To make the dataset more tidy, I’ll be using pivot_longer to get each repsonse type for specific location in separate row.
The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to
variable and the current values in each of those columns will move into the values_to
variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!
Lets see if this works with a simple example.
# A tibble: 6 × 5
country year trade outgoing incoming
<chr> <dbl> <chr> <dbl> <dbl>
1 Mexico 1980 NAFTA 589. -796.
2 USA 1990 NAFTA 999. 643.
3 France 1980 EU 718. 2316.
4 Mexico 1990 NAFTA 1708. 820.
5 USA 1980 NAFTA 790. 1421.
6 France 1990 EU 1014. 1128.
[1] 6
[1] 5
[1] 12
[1] 5
Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.
Document your work here.
[1] 150 6
[1] "Location" "State" "Yes"
[4] "No" "Response_not_clear" "No_response"
[1] 150
[1] 6
[1] 600
[1] 4
Currently there are 150 rows and 6 columns. After using pivot_longer number of rows will increase and number of columns will reduce Expected number of rows = 600 Expected number of columns = 4
Any additional comments?
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
# A tibble: 12 × 5
country year trade trade_direction trade_value
<chr> <dbl> <chr> <chr> <dbl>
1 Mexico 1980 NAFTA outgoing 589.
2 Mexico 1980 NAFTA incoming -796.
3 USA 1990 NAFTA outgoing 999.
4 USA 1990 NAFTA incoming 643.
5 France 1980 EU outgoing 718.
6 France 1980 EU incoming 2316.
7 Mexico 1990 NAFTA outgoing 1708.
8 Mexico 1990 NAFTA incoming 820.
9 USA 1980 NAFTA outgoing 790.
10 USA 1980 NAFTA incoming 1421.
11 France 1990 EU outgoing 1014.
12 France 1990 EU incoming 1128.
Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!
Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?
# A tibble: 600 × 4
Location State Response_type Stats
<chr> <chr> <chr> <dbl>
1 Banks New South Wales Divisions Yes 37736
2 Banks New South Wales Divisions No 46343
3 Banks New South Wales Divisions Response_not_clear 247
4 Banks New South Wales Divisions No_response 20928
5 Barton New South Wales Divisions Yes 37153
6 Barton New South Wales Divisions No 47984
7 Barton New South Wales Divisions Response_not_clear 226
8 Barton New South Wales Divisions No_response 24008
9 Bennelong New South Wales Divisions Yes 42943
10 Bennelong New South Wales Divisions No 43215
# … with 590 more rows
To get a comparison of number of people who votes on each category or to calculates the statistics like mean etc. or for the visual representation the dataset has to present in a pivot longer way. Now it’s easier to understand the dataset.
Any additional comments?
---
title: "Challenge 3"
author: "Priya Marla"
description: "Tidy Data: Pivoting"
date: "01/12/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
library(stringr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2. identify what needs to be done to tidy the current data
3. anticipate the shape of pivoted data
4. pivot the data into tidy format using `pivot_longer`
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
col_names = c("District","Yes", "Yes%", "No", "No%",)
- animal_weights.csv ⭐
- eggs_tidy.csv ⭐⭐ or organicpoultry.xls ⭐⭐⭐
- australian_marriage\*.xlsx ⭐⭐⭐
- USA Households\*.xlsx ⭐⭐⭐⭐
- sce_labor_chart_data_public.csv 🌟🌟🌟🌟🌟
```{r}
#loading the data from posts/_data folder
vote_states <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",sheet = "Table 2", skip = 7,
col_names = c("Location","Yes","rem","No","rem","rem","rem","rem","rem","rem","Response_not_clear","rem","No_response","rem","rem","rem"))
#dataset
vote_states
#Cleaning the dataset
votes <- vote_states %>%
select(!contains("rem")) %>% #removing percentage columns
mutate(State = case_when(str_ends(Location,"Divisions") ~ Location) ,.after = "Location") %>% #adding a new column state
fill(State, .direction = "down") %>% #filling the values of the column state
filter(!str_detect(Location, 'Total')) %>% #Removing rows having total counts
filter(!str_detect(Location, 'Divisions')) %>% #Removing the divisions heading rows
drop_na() #Removing rows with null values
votes
```
### Briefly describe the data
Describe the data, and be sure to comment on why you are planning to pivot it to make it "tidy"
###Description
This is an Australian marriage law survey dataset. A survey has been conducted in Australia to identify whether people are ok with same gender marriages are not. People have to respond with either "yes" or "no".
Eligible participants response is sometimes clear, sometimes not clear and sometimes the candidates do not vote. When the response is clear- it can be either "yes" or "no". In this dataset, for each location the statistics of "yes", "no", "Not clear" and "No Response" have been recorded.
It has been identified that few columns and rows are unnecessary for the dataset. Hence while cleaning the dataset, starting few rows i.e till 7 have been removed and ending few rows with NULL values have been removed. For each location corresponding division has been added in the "states" column. Percentage columns have been removed to get a tidy view of the statistics. Count of all possible responses for each location are now recorded in columns "Yes", "No", "Response_not_clear", "No_response".
To make the dataset more tidy, I'll be using pivot_longer to get each repsonse type for specific location in separate row.
## Anticipate the End Result
The first step in pivoting the data is to try to come up with a concrete vision of what the end product *should* look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!
### Example: find current and future data dimensions
Lets see if this works with a simple example.
```{r}
#| tbl-cap: Example
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
year = rep(c(1980,1990), 3),
trade = rep(c("NAFTA", "NAFTA", "EU"),2),
outgoing = rnorm(6, mean=1000, sd=500),
incoming = rlogis(6, location=1000,
scale = 400))
df
#existing rows/cases
nrow(df)
#existing columns/cases
ncol(df)
#expected rows/cases
nrow(df) * (ncol(df)-3)
# expected columns
3 + 2
```
Or simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.
### Challenge: Describe the final dimensions
Document your work here.
```{r}
#executed dim to know the dimentions i.e number of rows and columns of the dataset
dim(votes)
#To know the column names in the dataset
colnames(votes)
#existing rows/cases
nrow(votes)
#existing columns/cases
ncol(votes)
#expected rows/cases
nrow(votes) * (ncol(votes)-2)
# expected columns
2 + 2
```
Currently there are 150 rows and 6 columns. After using pivot_longer number of rows will increase and number of columns will reduce
Expected number of rows = 600
Expected number of columns = 4
Any additional comments?
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
### Example
```{r}
#| tbl-cap: Pivoted Example
df<-pivot_longer(df, col = c(outgoing, incoming),
names_to="trade_direction",
values_to = "trade_value")
df
```
Yes, once it is pivoted long, our resulting data are $12x5$ - exactly what we expected!
### Challenge: Pivot the Chosen Data
Document your work here. What will a new "case" be once you have pivoted the data? How does it meet requirements for tidy data?
```{r}
votes_tidy <- pivot_longer(votes, col = c("Yes","No","Response_not_clear","No_response"),
names_to = "Response_type",
values_to = "Stats")
votes_tidy
```
To get a comparison of number of people who votes on each category or to calculates the statistics like mean etc. or for the visual representation the dataset has to present in a pivot longer way. Now it's easier to understand the dataset.
Any additional comments?