Challenge 3

challenge_3
eggs
Tidy Data: Pivoting
Author

Pranav Bharadwaj Komaravolu

Published

March 27, 2023

library(tidyverse)
library(readr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

data <- read_csv("_data/eggs_tidy.csv")
head(data)
# A tibble: 6 × 6
  month     year large_half_dozen large_dozen extra_large_half_dozen extra_lar…¹
  <chr>    <dbl>            <dbl>       <dbl>                  <dbl>       <dbl>
1 January   2004             126         230                    132         230 
2 February  2004             128.        226.                   134.        230 
3 March     2004             131         225                    137         230 
4 April     2004             131         225                    137         234.
5 May       2004             131         225                    137         236 
6 June      2004             134.        231.                   137         241 
# … with abbreviated variable name ¹​extra_large_dozen

Data Description

The summary of the data is as follows:

summary(data)
    month                year      large_half_dozen  large_dozen   
 Length:120         Min.   :2004   Min.   :126.0    Min.   :225.0  
 Class :character   1st Qu.:2006   1st Qu.:129.4    1st Qu.:233.5  
 Mode  :character   Median :2008   Median :174.5    Median :267.5  
                    Mean   :2008   Mean   :155.2    Mean   :254.2  
                    3rd Qu.:2011   3rd Qu.:174.5    3rd Qu.:268.0  
                    Max.   :2013   Max.   :178.0    Max.   :277.5  
 extra_large_half_dozen extra_large_dozen
 Min.   :132.0          Min.   :230.0    
 1st Qu.:135.8          1st Qu.:241.5    
 Median :185.5          Median :285.5    
 Mean   :164.2          Mean   :266.8    
 3rd Qu.:185.5          3rd Qu.:285.5    
 Max.   :188.1          Max.   :290.0    

The dataset describes the price of different quantities of two different varieties of eggs in different years and seasons.

Pivot Longer

In the above dataset we can see that multiple quantities are specified as columns so we can use pivot longer to reduce these to a single column.

The dimensions of the dataset is as follows:

dim(data)
[1] 120   6

The dataset comprises of 120 rows and 6 columns.

The different columns in the dataset are:

names(data)
[1] "month"                  "year"                   "large_half_dozen"      
[4] "large_dozen"            "extra_large_half_dozen" "extra_large_dozen"     

So the target would be to reduce the columns “large_half_dozen”, “large_dozen”, “extra_large_half_dozen”, “extra_large_dozen” to a single column “quantity”.

reduced_cols <- 4
rows <- nrow(data)
cols <- ncol(data)
new_rows <- rows*reduced_cols
new_cols <- cols-reduced_cols+2
new_rows
[1] 480
new_cols
[1] 4

So the reduced dimensions will be \(120\times4\times(6-4+2)\) i.e, the target dimensions is \(480\times4\).

Pivot the data and current dimensions

Lets see if this works with a simple example.

The pivoting of the dataset can be done as follows:

pivotted_data <- pivot_longer(data, contains("dozen"), names_to = c("quantity"), values_to = c("price"))
head(pivotted_data)
# A tibble: 6 × 4
  month     year quantity               price
  <chr>    <dbl> <chr>                  <dbl>
1 January   2004 large_half_dozen        126 
2 January   2004 large_dozen             230 
3 January   2004 extra_large_half_dozen  132 
4 January   2004 extra_large_dozen       230 
5 February  2004 large_half_dozen        128.
6 February  2004 large_dozen             226.

The dimensions of the reduced data is as follows:

dim(pivotted_data)
[1] 480   4

Yes, once it is pivoted long, our resulting data are \(480x4\) - exactly what we expected!

Challenge: Pivot the Chosen Data

Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?

Reducing the number of columns makes the data more readable and also efficient to view. Also it is suitable as the multiple columns can be reduced to a single column.

Any additional comments?