Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Ishan Bhardwaj
May 20, 2023
Today’s challenge is to:
pivot_longer
Read in one (or more) of the following datasets, using the correct R package and command.
# A tibble: 120 × 6
month year large_half_dozen large_dozen extra_large_half_dozen
<chr> <dbl> <dbl> <dbl> <dbl>
1 January 2004 126 230 132
2 February 2004 128. 226. 134.
3 March 2004 131 225 137
4 April 2004 131 225 137
5 May 2004 131 225 137
6 June 2004 134. 231. 137
7 July 2004 134. 234. 137
8 August 2004 134. 234. 137
9 September 2004 130. 234. 136.
10 October 2004 128. 234. 136.
# ℹ 110 more rows
# ℹ 1 more variable: extra_large_dozen <dbl>
Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”
year
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
12 12 12 12 12 12 12 12 12 12
This dataset provides values (most likely prices, although units are not provided) for different set quantities of eggs: large half dozen, large dozen, extra large half dozen, and extra large dozen. It records this data for every month of a specified year for the years 2004 to 2013. One issue with this dataset is that it has different columns for different dozen sizes, but these sizes are really all observances of the same variable: the dozen size.
The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to
variable and the current values in each of those columns will move into the values_to
variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!
Document your work here.
[1] 120 6
[1] 480
[1] 4
From the original dataset, we want to collapse the four dozen size columns into one, and that new column’s values will be in another column that carries all the cell values from the four separate columns. Hence, we subtract four and add two to our original number of columns. This also means that there will be four times the number of observances in the dataset, since each month will have to deal with four dozen types. Hence, we multiply the number of rows by four.
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?
# A tibble: 480 × 4
month year dozen_type Value
<chr> <dbl> <chr> <dbl>
1 January 2004 large_half_dozen 126
2 January 2004 large_dozen 230
3 January 2004 extra_large_half_dozen 132
4 January 2004 extra_large_dozen 230
5 February 2004 large_half_dozen 128.
6 February 2004 large_dozen 226.
7 February 2004 extra_large_half_dozen 134.
8 February 2004 extra_large_dozen 230
9 March 2004 large_half_dozen 131
10 March 2004 large_dozen 225
# ℹ 470 more rows
[1] 480 4
The new pivoted data has the expected dimensions: 480 rows and 4 columns. A “case” in this new dataset is the value of one of four egg dozen types for a particular month in a particular year. This dataset is tidy because each case has its own row, each relevant variable its own column, and each value its own cell.
---
title: "Challenge 3"
author: "Ishan Bhardwaj"
description: "Tidy data transformations"
date: "05/20/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- Ishan Bhardwaj
- eggs
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2. identify what needs to be done to tidy the current data
3. anticipate the shape of pivoted data
4. pivot the data into tidy format using `pivot_longer`
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- animal_weights.csv ⭐
- eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
- australian_marriage\*.xls ⭐⭐⭐
- USA Households\*.xlsx ⭐⭐⭐⭐
- sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟
```{r}
eggs <- read_csv("_data/eggs_tidy.csv")
eggs
```
### Briefly describe the data
Describe the data, and be sure to comment on why you are planning to pivot it to make it "tidy"
```{r}
table(select(eggs, "year"))
```
This dataset provides values (most likely prices, although units are not provided) for different set quantities of eggs: large half dozen, large dozen, extra large half dozen, and extra large dozen. It records this data for every month of a specified year for the years 2004 to 2013. One issue with this dataset is that it has different columns for different dozen sizes, but these sizes are really all observances of the same variable: the dozen size.
## Anticipate the End Result
The first step in pivoting the data is to try to come up with a concrete vision of what the end product *should* look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!
### Challenge: Describe the final dimensions
Document your work here.
```{r}
dim(eggs)
new_rows <- nrow(eggs) * 4
new_cols <- ncol(eggs) - 4 + 2
# Expected number of rows
new_rows
# Expected number of columns
new_cols
```
From the original dataset, we want to collapse the four dozen size columns into one, and that new column's values will be in another column that carries all the cell values from the four separate columns. Hence, we subtract four and add two to our original number of columns. This also means that there will be four times the number of observances in the dataset, since each month will have to deal with four dozen types. Hence, we multiply the number of rows by four.
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
### Challenge: Pivot the Chosen Data
Document your work here. What will a new "case" be once you have pivoted the data? How does it meet requirements for tidy data?
```{r}
eggs_tidy <- pivot_longer(eggs, 3:6, names_to = "dozen_type", values_to = "Value")
eggs_tidy
dim(eggs_tidy)
```
The new pivoted data has the expected dimensions: 480 rows and 4 columns. A "case" in this new dataset is the value of one of four egg dozen types for a particular month in a particular year. This dataset is tidy because each case has its own row, each relevant variable its own column, and each value its own cell.