Challenge 3

challenge_3

eggs_tidy

pivot_longer

audrey_bertin

Tidy Data: Pivoting

Author

Audrey Bertin

Published

June 6, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Data Overview

For this challenge, I’ll be reading and tidying the eggs_tidy.csv ⭐⭐ dataset:

Code

eggs <- readr::read_csv("_data/eggs_tidy.csv")

Looking at this data before tidying, we see the following:

Code

glimpse(eggs)

Rows: 120
Columns: 6
$ month                  <chr> "January", "February", "March", "April", "May",…
$ year                   <dbl> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004,…
$ large_half_dozen       <dbl> 126.00, 128.50, 131.00, 131.00, 131.00, 133.50,…
$ large_dozen            <dbl> 230.000, 226.250, 225.000, 225.000, 225.000, 23…
$ extra_large_half_dozen <dbl> 132.000, 134.500, 137.000, 137.000, 137.000, 13…
$ extra_large_dozen      <dbl> 230.0, 230.0, 230.0, 234.5, 236.0, 241.0, 241.0…

The dataset has 120 rows and 6 columns. Each row appears to represent a specific month, with the variables showing how many eggs packages (or how much money’s worth) of each size were either sold or produced within that month.

The two variables that uniquely describe each row are month and year, so these will NOT be pivoted and will need to remain constant. However, the other four variables (large_half_dozen, large_dozen, extra_large_half_dozen and extra_large_dozen) can be pivoted together because they all store the exact same type of information (count).

When we pivot the data, we want to reformat so that we have the following columns:

month
year
carton_type
amount

Note: It is not clear exactly what the amount is representing without more information about the dataset, so we will use a generic name here for now.

We are starting with n = 120 rows and k = 6 columns and pivoting 4 (or k - 2) variables. This means we should expect our final dataset to have n * (k-2) = 120 * 4 = 480 rows and 4 columns.

Conducting the Pivot

Code

eggs_pivoted <- pivot_longer(eggs, col = large_half_dozen:extra_large_dozen,
                 names_to="carton_type",
                 values_to = "amount")
eggs_pivoted

Once we conduct this pivot, we see we have the correct (predicted) number of rows and columns, as well as the variables we want.

Before, a case/row represented a month in a particular year (and all cartons from within that month). Now a case represents only a specific egg carton type within a specific month/year, and there are four cases per month/year.

The three conditions of tidy data (according to Hadley Wickham) are:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

We can see that all of these are true. Variables are in their own unique columns, observations are separated out and have one unique row each, and there are not multiple values per cell.

The data is also in a much easier format for plotting, as with carton_type as a variable, it makes it much easier to do something like a facet_wrap to compare across types, versus having to separately draw four plots.