challenge_3
eggs
Joseph Vincent
Tidy Data: Pivoting
Author

Joseph Vincent

Published

March 8, 2023

Code
library(tidyverse)
library(summarytools)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

Reading in: - eggs_tidy.csv

Code
eggs <- read_csv("_data/eggs_tidy.csv")
head(eggs)
# A tibble: 6 × 6
  month     year large_half_dozen large_dozen extra_large_half_dozen extra_lar…¹
  <chr>    <dbl>            <dbl>       <dbl>                  <dbl>       <dbl>
1 January   2004             126         230                    132         230 
2 February  2004             128.        226.                   134.        230 
3 March     2004             131         225                    137         230 
4 April     2004             131         225                    137         234.
5 May       2004             131         225                    137         236 
6 June      2004             134.        231.                   137         241 
# … with abbreviated variable name ¹​extra_large_dozen

Briefly describe the data

Code
print(summarytools::dfSummary(eggs,
                              valid.col=FALSE), 
      method = 'render')

Data Frame Summary

eggs

Dimensions: 120 x 6
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 month [character]
1. April
2. August
3. December
4. February
5. January
6. July
7. June
8. March
9. May
10. November
[ 2 others ]
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
20 ( 16.7% )
0 (0.0%)
2 year [numeric]
Mean (sd) : 2008.5 (2.9)
min ≤ med ≤ max:
2004 ≤ 2008.5 ≤ 2013
IQR (CV) : 5 (0)
2004 : 12 ( 10.0% )
2005 : 12 ( 10.0% )
2006 : 12 ( 10.0% )
2007 : 12 ( 10.0% )
2008 : 12 ( 10.0% )
2009 : 12 ( 10.0% )
2010 : 12 ( 10.0% )
2011 : 12 ( 10.0% )
2012 : 12 ( 10.0% )
2013 : 12 ( 10.0% )
0 (0.0%)
3 large_half_dozen [numeric]
Mean (sd) : 155.2 (22.6)
min ≤ med ≤ max:
126 ≤ 174.5 ≤ 178
IQR (CV) : 45.1 (0.1)
126.00   : 1 ( 0.8% )
128.50   : 29 ( 24.2% )
129.75   : 1 ( 0.8% )
131.00   : 3 ( 2.5% )
131.12 ! : 1 ( 0.8% )
132.00   : 15 ( 12.5% )
133.50   : 3 ( 2.5% )
173.25   : 6 ( 5.0% )
174.50   : 47 ( 39.2% )
178.00   : 14 ( 11.7% )
! rounded
0 (0.0%)
4 large_dozen [numeric]
Mean (sd) : 254.2 (18.5)
min ≤ med ≤ max:
225 ≤ 267.5 ≤ 277.5
IQR (CV) : 34.5 (0.1)
12 distinct values 0 (0.0%)
5 extra_large_half_dozen [numeric]
Mean (sd) : 164.2 (24.7)
min ≤ med ≤ max:
132 ≤ 185.5 ≤ 188.1
IQR (CV) : 49.7 (0.2)
132.00   : 1 ( 0.8% )
134.50   : 1 ( 0.8% )
135.50   : 28 ( 23.3% )
135.88 ! : 1 ( 0.8% )
137.00   : 6 ( 5.0% )
138.12 ! : 1 ( 0.8% )
139.00   : 15 ( 12.5% )
185.50   : 53 ( 44.2% )
188.13   : 14 ( 11.7% )
! rounded
0 (0.0%)
6 extra_large_dozen [numeric]
Mean (sd) : 266.8 (22.8)
min ≤ med ≤ max:
230 ≤ 285.5 ≤ 290
IQR (CV) : 44 (0.1)
11 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-03-08

This data set consists of average prices of eggs per pound (in cents) over a 9 year period, from 2004 through 2013. It is broken down further by month.

The prices vary in any given year by carton and egg size. There are both large and extra large eggs, and dozen or half dozen carton varieties. The per pound price is different depending on the combination of these qualities.

Before performing more analysis, we will tidy-up the data set by moving egg price into its own column/variable, and using the carton type to describe the case.

Challenge: Describe the final dimensions

Finding the existing dimensions of “eggs”

Code
#existing rows
nrow(eggs)
[1] 120
Code
#existing columns
ncol(eggs)
[1] 6
Code
#expected rows/cases
nrow(eggs) * (ncol(eggs)-2)
[1] 480
Code
#expected columns 
2 + 2
[1] 4

There are 120 rows in the current dataset, each representing a specific month-year. However, this structure means that there are four different prices in each row. We would like for each row/case to only contain one price, in accordance with Tidydata standards.

There are currently 6 columns. 2 of these describe the case (Year and Month), and 4 of these are describing the carton and size of the eggs.

After combining the 4 price columns into a single “Price per Pound” column, we would expect to see 480 rows.

There will be 4 columns in the final data set, 2 existing descriptors (Year and Month) and 2 new columns (Carton Type and Price per Pound).

Challenge: Pivot the Chosen Data

Code
eggs_pivoted <- eggs %>%
  pivot_longer(col = c(large_half_dozen, large_dozen, extra_large_half_dozen, extra_large_dozen),
               names_to = "carton_type",
               values_to = "price_per_pound")
eggs_pivoted
# A tibble: 480 × 4
   month     year carton_type            price_per_pound
   <chr>    <dbl> <chr>                            <dbl>
 1 January   2004 large_half_dozen                  126 
 2 January   2004 large_dozen                       230 
 3 January   2004 extra_large_half_dozen            132 
 4 January   2004 extra_large_dozen                 230 
 5 February  2004 large_half_dozen                  128.
 6 February  2004 large_dozen                       226.
 7 February  2004 extra_large_half_dozen            134.
 8 February  2004 extra_large_dozen                 230 
 9 March     2004 large_half_dozen                  131 
10 March     2004 large_dozen                       225 
# … with 470 more rows

As you can see, the final data set has the dimensions we expected (480 rows x 4 columns). Each row now describes a single case, which in this case is an average price for specific carton type in a month and year.

Doing some summary analysis on egg price by carton type

Code
eggs_pivoted %>%
  group_by(carton_type) %>%
  summarize(Mean = mean(price_per_pound),
            Median = median(price_per_pound),
            Max = max(price_per_pound),
            Min = min(price_per_pound))
# A tibble: 4 × 5
  carton_type             Mean Median   Max   Min
  <chr>                  <dbl>  <dbl> <dbl> <dbl>
1 extra_large_dozen       267.   286.  290    230
2 extra_large_half_dozen  164.   186.  188.   132
3 large_dozen             254.   268.  278.   225
4 large_half_dozen        155.   174.  178    126

The average price per pound across all years was greatest for extra large eggs when sold in cartons of a dozen eggs.