Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Shriya Sehgal
November 17, 2022
Today’s challenge is to:
pivot_longer
# A tibble: 120 × 6
month year large_half_dozen large_dozen extra_large_half_dozen extra_l…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 January 2004 126 230 132 230
2 February 2004 128. 226. 134. 230
3 March 2004 131 225 137 230
4 April 2004 131 225 137 234.
5 May 2004 131 225 137 236
6 June 2004 134. 231. 137 241
7 July 2004 134. 234. 137 241
8 August 2004 134. 234. 137 241
9 September 2004 130. 234. 136. 241
10 October 2004 128. 234. 136. 241
# … with 110 more rows, and abbreviated variable name ¹extra_large_dozen
There are 129 rows and 6 columns containing the data of each month from the 2004-2013. The first 2 columns are the months and years and the rest 4 tells us the average price of the size and quantity of the eggs combined. The average price of the eggs ranges from 120-290 cents. The columns are as follows. a. large_half_dozen b. extra_large_half_dozen c. large_dozen d. extra_large_dozen
The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to
variable and the current values in each of those columns will move into the values_to
variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!
[1] 120
[1] 6
[1] 480
[1] 5
It can be seen that the dataset with 120 rows and 6 columns. After pivoting we would have the columns as months, year, size of the egg (large/extra large) and the quanity of the eggs(half dozen/dozen) which would make it extremely easy to observe the changes from 2004-2013 throughout the year for the size of the eggs as well as the quantity of eggs sold.
The resultant dataset would be 4 times larger than the original separating the size-quantity pairs with 4 columns. Another column that contains the information about the average price of the eggs have been added. So now, the total number of the columns would decrease by 1 and the resultants columns would be month, year, size, quantity and average price.
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
# A tibble: 480 × 5
month year size quantity cost
<chr> <dbl> <chr> <chr> <dbl>
1 January 2004 large half 126
2 January 2004 large dozen 230
3 January 2004 extra large 132
4 January 2004 extra large 230
5 February 2004 large half 128.
6 February 2004 large dozen 226.
7 February 2004 extra large 134.
8 February 2004 extra large 230
9 March 2004 large half 131
10 March 2004 large dozen 225
# … with 470 more rows
[1] 480
[1] 5
It can been observed that the data is 4 times longer than the original dataset (120 to 480 rows) and the number of columns got reduced by 1. The process of pivoting makes the single observation per row and helps in easily understand the data and work for the future analysis.
---
title: "Challenge 3"
author: "Shriya Sehgal"
desription: "Tidy Data: Pivoting"
date: "11/17/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- animal_weights
- eggs
- australian_marriage
- usa_households
- sce_labor
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2. identify what needs to be done to tidy the current data
3. anticipate the shape of pivoted data
4. pivot the data into tidy format using `pivot_longer`
## Read in data
```{r}
dataset <- read_csv("_data/eggs_tidy.csv")
dataset
```
### Briefly describe the data
```{r}
summary(eggs_dataset)
```
```{r}
head(eggs_dataset)
```
There are 129 rows and 6 columns containing the data of each month from the 2004-2013. The first 2 columns are the months and years and the rest 4 tells us the average price of the size and quantity of the eggs combined. The average price of the eggs ranges from 120-290 cents. The columns are as follows.
a. large_half_dozen
b. extra_large_half_dozen
c. large_dozen
d. extra_large_dozen
## Anticipate the End Result
The first step in pivoting the data is to try to come up with a concrete vision of what the end product *should* look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!
```{r}
#existing rows/cases
nrow(dataset)
#existing columns/cases
ncol(dataset)
#expected rows/cases
nrow(dataset) * (ncol(dataset)-2)
# expected columns
3 + 2
```
### Challenge: Describe the final dimensions
It can be seen that the dataset with 120 rows and 6 columns. After pivoting we would have the columns as months, year, size of the egg (large/extra large) and the quanity of the eggs(half dozen/dozen) which would make it extremely easy to observe the changes from 2004-2013 throughout the year for the size of the eggs as well as the quantity of eggs sold.
The resultant dataset would be 4 times larger than the original separating the size-quantity pairs with 4 columns. Another column that contains the information about the average price of the eggs have been added. So now, the total number of the columns would decrease by 1 and the resultants columns would be month, year, size, quantity and average price.
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
### Example
```{r}
#| tbl-cap: Pivoted Example
dataset_long <- dataset%>%
pivot_longer(cols=contains("large"),
names_to = c("size", "quantity"),
names_sep="_",
values_to = "cost"
)
dataset_long
```
```{r}
#existing rows/cases after the pivot
nrow(dataset_long)
#existing columns/cases after the pivot
ncol(dataset_long)
```
It can been observed that the data is 4 times longer than the original dataset (120 to 480 rows) and the number of columns got reduced by 1. The process of pivoting makes the single observation per row and helps in easily understand the data and work for the future analysis.