Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Sai Pranav Kurly
April 12, 2023
I have decided to use the eggs_tidy.csv dataset
# A tibble: 120 × 6
month year large_half_dozen large_dozen extra_large_half_dozen extra_l…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 January 2004 126 230 132 230
2 February 2004 128. 226. 134. 230
3 March 2004 131 225 137 230
4 April 2004 131 225 137 234.
5 May 2004 131 225 137 236
6 June 2004 134. 231. 137 241
7 July 2004 134. 234. 137 241
8 August 2004 134. 234. 137 241
9 September 2004 130. 234. 136. 241
10 October 2004 128. 234. 136. 241
# … with 110 more rows, and abbreviated variable name ¹extra_large_dozen
Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”
month year large_half_dozen large_dozen
Length:120 Min. :2004 Min. :126.0 Min. :225.0
Class :character 1st Qu.:2006 1st Qu.:129.4 1st Qu.:233.5
Mode :character Median :2008 Median :174.5 Median :267.5
Mean :2008 Mean :155.2 Mean :254.2
3rd Qu.:2011 3rd Qu.:174.5 3rd Qu.:268.0
Max. :2013 Max. :178.0 Max. :277.5
extra_large_half_dozen extra_large_dozen
Min. :132.0 Min. :230.0
1st Qu.:135.8 1st Qu.:241.5
Median :185.5 Median :285.5
Mean :164.2 Mean :266.8
3rd Qu.:185.5 3rd Qu.:285.5
Max. :188.1 Max. :290.0
# A tibble: 6 × 6
month year large_half_dozen large_dozen extra_large_half_dozen extra_lar…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 January 2004 126 230 132 230
2 February 2004 128. 226. 134. 230
3 March 2004 131 225 137 230
4 April 2004 131 225 137 234.
5 May 2004 131 225 137 236
6 June 2004 134. 231. 137 241
# … with abbreviated variable name ¹extra_large_dozen
The 120 rows and 6 columns of the dataset, which contain data for each and every month of the year (there are 12 months in a year) from 2004 to 2013, are readily apparent from the dataset’s summary. The month and year are given in the first two columns, and the average price for the given egg size and quantity is given in the remaining four columns. The column names, such as large_half_dozen, extra_large_half_dozen, large_dozen, and extra_large_dozen, combine the size with the quantity. I’ve seen that the typical pricing is between 12 and 290 cents.
Thus, our resulting data is of the dimension 480*5.
Lets see if this works with a simple example.
# A tibble: 6 × 5
country year trade outgoing incoming
<chr> <dbl> <chr> <dbl> <dbl>
1 Mexico 1980 NAFTA 119. -1009.
2 USA 1990 NAFTA 1356. 1491.
3 France 1980 EU 1266. 663.
4 Mexico 1990 NAFTA 1198. -852.
5 USA 1980 NAFTA 348. 729.
6 France 1990 EU 1492. 1595.
[1] 6
[1] 5
[1] 12
[1] 5
Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
# A tibble: 480 × 5
month year size quantity cost
<chr> <dbl> <chr> <chr> <dbl>
1 January 2004 large half 126
2 January 2004 large dozen 230
3 January 2004 extra large 132
4 January 2004 extra large 230
5 February 2004 large half 128.
6 February 2004 large dozen 226.
7 February 2004 extra large 134.
8 February 2004 extra large 230
9 March 2004 large half 131
10 March 2004 large dozen 225
# … with 470 more rows
It matches our previous calculation.
---
title: "Challenge 3 Instructions"
author: "Sai Pranav Kurly"
description: "Tidy Data: Pivoting"
date: "04/12/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- animal_weights
- eggs
- australian_marriage
- usa_households
- sce_labor
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in data
I have decided to use the eggs_tidy.csv dataset
```{r}
eggs_dataset <- read_csv("_data/eggs_tidy.csv")
eggs_dataset
```
### Briefly describe the data
Describe the data, and be sure to comment on why you are planning to pivot it to make it "tidy"
```{r}
summary(eggs_dataset)
```
```{r}
head(eggs_dataset)
```
```{r}
nrow(eggs_dataset)
```
```{r}
ncol(eggs_dataset)
```
The 120 rows and 6 columns of the dataset, which contain data for each and every month of the year (there are 12 months in a year) from 2004 to 2013, are readily apparent from the dataset's summary. The month and year are given in the first two columns, and the average price for the given egg size and quantity is given in the remaining four columns. The column names, such as large_half_dozen, extra_large_half_dozen, large_dozen, and extra_large_dozen, combine the size with the quantity. I've seen that the typical pricing is between 12 and 290 cents.
## Anticipate the End Result
```{r}
nrow(eggs_dataset) * (ncol(eggs_dataset)-2)
```
```{r}
# expected columns in the eggs dataset after pivoting
3 + 2
```
Thus, our resulting data is of the dimension 480*5.
### Example: find current and future data dimensions
Lets see if this works with a simple example.
```{r}
#| tbl-cap: Example
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
year = rep(c(1980,1990), 3),
trade = rep(c("NAFTA", "NAFTA", "EU"),2),
outgoing = rnorm(6, mean=1000, sd=500),
incoming = rlogis(6, location=1000,
scale = 400))
df
#existing rows/cases
nrow(df)
#existing columns/cases
ncol(df)
#expected rows/cases
nrow(df) * (ncol(df)-3)
# expected columns
3 + 2
```
Or simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
### Example
```{r}
eggs_pivot <- eggs_dataset%>%
pivot_longer(cols=contains("large"),
names_to = c("size", "quantity"),
names_sep="_",
values_to = "cost"
)
eggs_pivot
```
```{r}
nrow(eggs_pivot)
```
```{r}
ncol(eggs_pivot)
```
It matches our previous calculation.