DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Sarah McAlpine - Challenge 3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Pivoting “Tidy” Egg Data
  • Predict Dimensions of Tidy Data
  • Conclusion

Sarah McAlpine - Challenge 3

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
sarahmcalpine
eggs
pivot_longer()
rename()
mutate()
Author

Sarah McAlpine

Published

September 26, 2022

Code
library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE)

Pivoting “Tidy” Egg Data

For this challenge, I will read in and pivot the eggs_tidy.csv data to prepare it for analysis.

After looking at the rendered data frame summary, I know that there are 6 columns of data and 120 rows, containing all 12 months of the year, and the years 2004-2013. The remaining 4 columns are large_half_dozen, large_dozen, extra_large_half_dozen, and extra_large_dozen, all with their values ranging from 126-290. By researching the source of this data, I know that the values correspond to the price in cents, meaning that egg prices range from $1.26 to $2.90 for large and extra large half-dozens and dozens of eggs. While I read in the data, I will also rename the columns to separate the size and quantity variables for my next step. This will not change the dimensions of the data.

Code
# read in egg data and assign a name
eggs <- read_csv("_data/eggs_tidy.csv") %>%
  rename("xlarge_halfdozen" = "extra_large_half_dozen", 
         "xlarge_dozen" = "extra_large_dozen", 
         "large_halfdozen" = "large_half_dozen") 
Rows: 120 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): month
dbl (5): year, large_half_dozen, large_dozen, extra_large_half_dozen, extra_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
print(dfSummary(eggs))
Data Frame Summary  
eggs  
Dimensions: 120 x 6  
Duplicates: 0  

-------------------------------------------------------------------------------------------------------------------
No   Variable           Stats / Values             Freqs (% of Valid)    Graph                 Valid      Missing  
---- ------------------ -------------------------- --------------------- --------------------- ---------- ---------
1    month              1. April                   10 ( 8.3%)            I                     120        0        
     [character]        2. August                  10 ( 8.3%)            I                     (100.0%)   (0.0%)   
                        3. December                10 ( 8.3%)            I                                         
                        4. February                10 ( 8.3%)            I                                         
                        5. January                 10 ( 8.3%)            I                                         
                        6. July                    10 ( 8.3%)            I                                         
                        7. June                    10 ( 8.3%)            I                                         
                        8. March                   10 ( 8.3%)            I                                         
                        9. May                     10 ( 8.3%)            I                                         
                        10. November               10 ( 8.3%)            I                                         
                        [ 2 others ]               20 (16.7%)            III                                       

2    year               Mean (sd) : 2008.5 (2.9)   2004 : 12 (10.0%)     II                    120        0        
     [numeric]          min < med < max:           2005 : 12 (10.0%)     II                    (100.0%)   (0.0%)   
                        2004 < 2008.5 < 2013       2006 : 12 (10.0%)     II                                        
                        IQR (CV) : 5 (0)           2007 : 12 (10.0%)     II                                        
                                                   2008 : 12 (10.0%)     II                                        
                                                   2009 : 12 (10.0%)     II                                        
                                                   2010 : 12 (10.0%)     II                                        
                                                   2011 : 12 (10.0%)     II                                        
                                                   2012 : 12 (10.0%)     II                                        
                                                   2013 : 12 (10.0%)     II                                        

3    large_halfdozen    Mean (sd) : 155.2 (22.6)   126.00 :  1 ( 0.8%)                         120        0        
     [numeric]          min < med < max:           128.50 : 29 (24.2%)   IIII                  (100.0%)   (0.0%)   
                        126 < 174.5 < 178          129.75 :  1 ( 0.8%)                                             
                        IQR (CV) : 45.1 (0.1)      131.00 :  3 ( 2.5%)                                             
                                                   131.12!:  1 ( 0.8%)                                             
                                                   132.00 : 15 (12.5%)   II                                        
                                                   133.50 :  3 ( 2.5%)                                             
                                                   173.25 :  6 ( 5.0%)   I                                         
                                                   174.50 : 47 (39.2%)   IIIIIII                                   
                                                   178.00 : 14 (11.7%)   II                                        
                                                   ! rounded                                                       

4    large_dozen        Mean (sd) : 254.2 (18.5)   12 distinct values                    :     120        0        
     [numeric]          min < med < max:                                                 :     (100.0%)   (0.0%)   
                        225 < 267.5 < 277.5                                :             :                         
                        IQR (CV) : 34.5 (0.1)                              : .           : .                       
                                                                         . : :           : :                       

5    xlarge_halfdozen   Mean (sd) : 164.2 (24.7)   132.00 :  1 ( 0.8%)                         120        0        
     [numeric]          min < med < max:           134.50 :  1 ( 0.8%)                         (100.0%)   (0.0%)   
                        132 < 185.5 < 188.1        135.50 : 28 (23.3%)   IIII                                      
                        IQR (CV) : 49.7 (0.2)      135.88!:  1 ( 0.8%)                                             
                                                   137.00 :  6 ( 5.0%)   I                                         
                                                   138.12!:  1 ( 0.8%)                                             
                                                   139.00 : 15 (12.5%)   II                                        
                                                   185.50 : 53 (44.2%)   IIIIIIII                                  
                                                   188.13 : 14 (11.7%)   II                                        
                                                   ! rounded                                                       

6    xlarge_dozen       Mean (sd) : 266.8 (22.8)   11 distinct values              :           120        0        
     [numeric]          min < med < max:                                   .       :           (100.0%)   (0.0%)   
                        230 < 285.5 < 290                                  :       :                               
                        IQR (CV) : 44 (0.1)                                :       :                               
                                                                         . :       :                               
-------------------------------------------------------------------------------------------------------------------
Code
# quickly find mins, maxes, and ranges of data
summary(eggs)
    month                year      large_halfdozen  large_dozen   
 Length:120         Min.   :2004   Min.   :126.0   Min.   :225.0  
 Class :character   1st Qu.:2006   1st Qu.:129.4   1st Qu.:233.5  
 Mode  :character   Median :2008   Median :174.5   Median :267.5  
                    Mean   :2008   Mean   :155.2   Mean   :254.2  
                    3rd Qu.:2011   3rd Qu.:174.5   3rd Qu.:268.0  
                    Max.   :2013   Max.   :178.0   Max.   :277.5  
 xlarge_halfdozen  xlarge_dozen  
 Min.   :132.0    Min.   :230.0  
 1st Qu.:135.8    1st Qu.:241.5  
 Median :185.5    Median :285.5  
 Mean   :164.2    Mean   :266.8  
 3rd Qu.:185.5    3rd Qu.:285.5  
 Max.   :188.1    Max.   :290.0  
Code
# preview first several rows
head(eggs)
# A tibble: 6 × 6
  month     year large_halfdozen large_dozen xlarge_halfdozen xlarge_dozen
  <chr>    <dbl>           <dbl>       <dbl>            <dbl>        <dbl>
1 January   2004            126         230              132          230 
2 February  2004            128.        226.             134.         230 
3 March     2004            131         225              137          230 
4 April     2004            131         225              137          234.
5 May       2004            131         225              137          236 
6 June      2004            134.        231.             137          241 

Predict Dimensions of Tidy Data

In considering this data set, I can expect one observation to include a month, a year, an egg size, and a quantity of eggs. Arranging the data in this way will allow analysis of price changes throughout the year and over time for both large and extra large eggs, as well as whether they are sold in dozens or half dozens.

I expect the resulting data set to be four times as long since each size-quantity pairing will appear in its own row (rather than the initial 4 columns after month and year). I expect the total number of columns to reduce by one, since I will remove the 4 size-quantity columns names and replace them with a column each for size, quantity, and average price. See below for this newly rearranged data.

Code
eggs_longer <- eggs %>%
   pivot_longer(cols = contains("large"),
               names_to = c("size", "quantity"),
               names_sep = "_",
               values_to = "avg_price")

eggs_longer
# A tibble: 480 × 5
   month     year size   quantity  avg_price
   <chr>    <dbl> <chr>  <chr>         <dbl>
 1 January   2004 large  halfdozen      126 
 2 January   2004 large  dozen          230 
 3 January   2004 xlarge halfdozen      132 
 4 January   2004 xlarge dozen          230 
 5 February  2004 large  halfdozen      128.
 6 February  2004 large  dozen          226.
 7 February  2004 xlarge halfdozen      134.
 8 February  2004 xlarge dozen          230 
 9 March     2004 large  halfdozen      131 
10 March     2004 large  dozen          225 
# … with 470 more rows

Conclusion

As I predicted, I now have four times as many rows (120 became 480), and one column fewer (6 became 5). Now I have a single observation per row and am ready to begin analysis. I may wish to mutate the cents into dollars depending on the nature of my analysis or if there are other tables to bring in at some point. The preview tibble below rounds to whole cents, but the data retains the precise decimals.

Code
eggs_USD <- mutate(eggs_longer, 
       avg_USD = avg_price / 100
       )%>%
  select(!contains ("price"))
eggs_USD
# A tibble: 480 × 5
   month     year size   quantity  avg_USD
   <chr>    <dbl> <chr>  <chr>       <dbl>
 1 January   2004 large  halfdozen    1.26
 2 January   2004 large  dozen        2.3 
 3 January   2004 xlarge halfdozen    1.32
 4 January   2004 xlarge dozen        2.3 
 5 February  2004 large  halfdozen    1.28
 6 February  2004 large  dozen        2.26
 7 February  2004 xlarge halfdozen    1.34
 8 February  2004 xlarge dozen        2.3 
 9 March     2004 large  halfdozen    1.31
10 March     2004 large  dozen        2.25
# … with 470 more rows
Source Code
---
title: "Sarah McAlpine - Challenge 3"
author: "Sarah McAlpine"
desription: #Pivoting Data#"
date: "09/26/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - sarahmcalpine
  - eggs
  - pivot_longer()
  - rename()
  - mutate()
---

```{r}
#| label: setup
#| warning: false

library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE)
```

## Pivoting "Tidy" Egg Data

For this challenge, I will read in and pivot the eggs_tidy.csv data to prepare it for analysis.

After looking at the rendered data frame summary, I know that there are 6 columns of data and 120 rows, containing all 12 months of the year, and the years 2004-2013. The remaining 4 columns are large_half_dozen, large_dozen, extra_large_half_dozen, and extra_large_dozen, all with their values ranging from 126-290. By researching the source of this data, I know that the values correspond to the price in cents, meaning that egg prices range from $1.26 to $2.90 for large and extra large half-dozens and dozens of eggs. While I read in the data, I will also rename the columns to separate the size and quantity variables for my next step. This will not change the dimensions of the data.

```{r}
# read in egg data and assign a name
eggs <- read_csv("_data/eggs_tidy.csv") %>%
  rename("xlarge_halfdozen" = "extra_large_half_dozen", 
         "xlarge_dozen" = "extra_large_dozen", 
         "large_halfdozen" = "large_half_dozen") 
print(dfSummary(eggs))
# quickly find mins, maxes, and ranges of data
summary(eggs)
# preview first several rows
head(eggs)

```

## Predict Dimensions of Tidy Data 
In considering this data set, I can expect one observation to include a month, a year, an egg size, and a quantity of eggs. Arranging the data in this way will allow analysis of price changes throughout the year and over time for both large and extra large eggs, as well as whether they are sold in dozens or half dozens. 

I expect the resulting data set to be four times as long since each size-quantity pairing will appear in its own row (rather than the initial 4 columns after month and year). I expect the total number of columns to reduce by one, since  I will remove the 4 size-quantity columns names and replace them with a column each for size, quantity, and average price. See below for this newly rearranged data.
```{r}
eggs_longer <- eggs %>%
   pivot_longer(cols = contains("large"),
               names_to = c("size", "quantity"),
               names_sep = "_",
               values_to = "avg_price")

eggs_longer
```
## Conclusion
As I predicted, I now have four times as many rows (120 became 480), and one column fewer (6 became 5). Now I have a single observation per row and am ready to begin analysis. I may wish to `mutate` the cents into dollars depending on the nature of my analysis or if there are other tables to bring in at some point. The preview tibble below rounds to whole cents, but the data retains the precise decimals.
```{r}
eggs_USD <- mutate(eggs_longer, 
       avg_USD = avg_price / 100
       )%>%
  select(!contains ("price"))
eggs_USD
```