library(tidyverse)
library(readr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 3
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- identify what needs to be done to tidy the current data
- anticipate the shape of pivoted data
- pivot the data into tidy format using
pivot_longer
Read in data
<- read_csv("_data/eggs_tidy.csv")
data head(data)
# A tibble: 6 × 6
month year large_half_dozen large_dozen extra_large_half_dozen extra_lar…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 January 2004 126 230 132 230
2 February 2004 128. 226. 134. 230
3 March 2004 131 225 137 230
4 April 2004 131 225 137 234.
5 May 2004 131 225 137 236
6 June 2004 134. 231. 137 241
# … with abbreviated variable name ¹extra_large_dozen
Data Description
The summary of the data is as follows:
summary(data)
month year large_half_dozen large_dozen
Length:120 Min. :2004 Min. :126.0 Min. :225.0
Class :character 1st Qu.:2006 1st Qu.:129.4 1st Qu.:233.5
Mode :character Median :2008 Median :174.5 Median :267.5
Mean :2008 Mean :155.2 Mean :254.2
3rd Qu.:2011 3rd Qu.:174.5 3rd Qu.:268.0
Max. :2013 Max. :178.0 Max. :277.5
extra_large_half_dozen extra_large_dozen
Min. :132.0 Min. :230.0
1st Qu.:135.8 1st Qu.:241.5
Median :185.5 Median :285.5
Mean :164.2 Mean :266.8
3rd Qu.:185.5 3rd Qu.:285.5
Max. :188.1 Max. :290.0
The dataset describes the price of different quantities of two different varieties of eggs in different years and seasons.
Pivot Longer
In the above dataset we can see that multiple quantities are specified as columns so we can use pivot longer to reduce these to a single column.
The dimensions of the dataset is as follows:
dim(data)
[1] 120 6
The dataset comprises of 120 rows and 6 columns.
The different columns in the dataset are:
names(data)
[1] "month" "year" "large_half_dozen"
[4] "large_dozen" "extra_large_half_dozen" "extra_large_dozen"
So the target would be to reduce the columns “large_half_dozen”, “large_dozen”, “extra_large_half_dozen”, “extra_large_dozen” to a single column “quantity”.
<- 4
reduced_cols <- nrow(data)
rows <- ncol(data)
cols <- rows*reduced_cols
new_rows <- cols-reduced_cols+2
new_cols new_rows
[1] 480
new_cols
[1] 4
So the reduced dimensions will be \(120\times4\times(6-4+2)\) i.e, the target dimensions is \(480\times4\).
Pivot the data and current dimensions
Lets see if this works with a simple example.
The pivoting of the dataset can be done as follows:
<- pivot_longer(data, contains("dozen"), names_to = c("quantity"), values_to = c("price"))
pivotted_data head(pivotted_data)
# A tibble: 6 × 4
month year quantity price
<chr> <dbl> <chr> <dbl>
1 January 2004 large_half_dozen 126
2 January 2004 large_dozen 230
3 January 2004 extra_large_half_dozen 132
4 January 2004 extra_large_dozen 230
5 February 2004 large_half_dozen 128.
6 February 2004 large_dozen 226.
The dimensions of the reduced data is as follows:
dim(pivotted_data)
[1] 480 4
Yes, once it is pivoted long, our resulting data are \(480x4\) - exactly what we expected!
Challenge: Pivot the Chosen Data
Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?
Reducing the number of columns makes the data more readable and also efficient to view. Also it is suitable as the multiple columns can be reduced to a single column.
Any additional comments?