Code
library(tidyverse)
library(summarytools)
::opts_chunk$set(echo = TRUE) knitr
Sarah McAlpine
September 26, 2022
For this challenge, I will read in and pivot the eggs_tidy.csv data to prepare it for analysis.
After looking at the rendered data frame summary, I know that there are 6 columns of data and 120 rows, containing all 12 months of the year, and the years 2004-2013. The remaining 4 columns are large_half_dozen, large_dozen, extra_large_half_dozen, and extra_large_dozen, all with their values ranging from 126-290. By researching the source of this data, I know that the values correspond to the price in cents, meaning that egg prices range from $1.26 to $2.90 for large and extra large half-dozens and dozens of eggs. While I read in the data, I will also rename the columns to separate the size and quantity variables for my next step. This will not change the dimensions of the data.
Rows: 120 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): month
dbl (5): year, large_half_dozen, large_dozen, extra_large_half_dozen, extra_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data Frame Summary
eggs
Dimensions: 120 x 6
Duplicates: 0
-------------------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- ------------------ -------------------------- --------------------- --------------------- ---------- ---------
1 month 1. April 10 ( 8.3%) I 120 0
[character] 2. August 10 ( 8.3%) I (100.0%) (0.0%)
3. December 10 ( 8.3%) I
4. February 10 ( 8.3%) I
5. January 10 ( 8.3%) I
6. July 10 ( 8.3%) I
7. June 10 ( 8.3%) I
8. March 10 ( 8.3%) I
9. May 10 ( 8.3%) I
10. November 10 ( 8.3%) I
[ 2 others ] 20 (16.7%) III
2 year Mean (sd) : 2008.5 (2.9) 2004 : 12 (10.0%) II 120 0
[numeric] min < med < max: 2005 : 12 (10.0%) II (100.0%) (0.0%)
2004 < 2008.5 < 2013 2006 : 12 (10.0%) II
IQR (CV) : 5 (0) 2007 : 12 (10.0%) II
2008 : 12 (10.0%) II
2009 : 12 (10.0%) II
2010 : 12 (10.0%) II
2011 : 12 (10.0%) II
2012 : 12 (10.0%) II
2013 : 12 (10.0%) II
3 large_halfdozen Mean (sd) : 155.2 (22.6) 126.00 : 1 ( 0.8%) 120 0
[numeric] min < med < max: 128.50 : 29 (24.2%) IIII (100.0%) (0.0%)
126 < 174.5 < 178 129.75 : 1 ( 0.8%)
IQR (CV) : 45.1 (0.1) 131.00 : 3 ( 2.5%)
131.12!: 1 ( 0.8%)
132.00 : 15 (12.5%) II
133.50 : 3 ( 2.5%)
173.25 : 6 ( 5.0%) I
174.50 : 47 (39.2%) IIIIIII
178.00 : 14 (11.7%) II
! rounded
4 large_dozen Mean (sd) : 254.2 (18.5) 12 distinct values : 120 0
[numeric] min < med < max: : (100.0%) (0.0%)
225 < 267.5 < 277.5 : :
IQR (CV) : 34.5 (0.1) : . : .
. : : : :
5 xlarge_halfdozen Mean (sd) : 164.2 (24.7) 132.00 : 1 ( 0.8%) 120 0
[numeric] min < med < max: 134.50 : 1 ( 0.8%) (100.0%) (0.0%)
132 < 185.5 < 188.1 135.50 : 28 (23.3%) IIII
IQR (CV) : 49.7 (0.2) 135.88!: 1 ( 0.8%)
137.00 : 6 ( 5.0%) I
138.12!: 1 ( 0.8%)
139.00 : 15 (12.5%) II
185.50 : 53 (44.2%) IIIIIIII
188.13 : 14 (11.7%) II
! rounded
6 xlarge_dozen Mean (sd) : 266.8 (22.8) 11 distinct values : 120 0
[numeric] min < med < max: . : (100.0%) (0.0%)
230 < 285.5 < 290 : :
IQR (CV) : 44 (0.1) : :
. : :
-------------------------------------------------------------------------------------------------------------------
month year large_halfdozen large_dozen
Length:120 Min. :2004 Min. :126.0 Min. :225.0
Class :character 1st Qu.:2006 1st Qu.:129.4 1st Qu.:233.5
Mode :character Median :2008 Median :174.5 Median :267.5
Mean :2008 Mean :155.2 Mean :254.2
3rd Qu.:2011 3rd Qu.:174.5 3rd Qu.:268.0
Max. :2013 Max. :178.0 Max. :277.5
xlarge_halfdozen xlarge_dozen
Min. :132.0 Min. :230.0
1st Qu.:135.8 1st Qu.:241.5
Median :185.5 Median :285.5
Mean :164.2 Mean :266.8
3rd Qu.:185.5 3rd Qu.:285.5
Max. :188.1 Max. :290.0
# A tibble: 6 × 6
month year large_halfdozen large_dozen xlarge_halfdozen xlarge_dozen
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 January 2004 126 230 132 230
2 February 2004 128. 226. 134. 230
3 March 2004 131 225 137 230
4 April 2004 131 225 137 234.
5 May 2004 131 225 137 236
6 June 2004 134. 231. 137 241
In considering this data set, I can expect one observation to include a month, a year, an egg size, and a quantity of eggs. Arranging the data in this way will allow analysis of price changes throughout the year and over time for both large and extra large eggs, as well as whether they are sold in dozens or half dozens.
I expect the resulting data set to be four times as long since each size-quantity pairing will appear in its own row (rather than the initial 4 columns after month and year). I expect the total number of columns to reduce by one, since I will remove the 4 size-quantity columns names and replace them with a column each for size, quantity, and average price. See below for this newly rearranged data.
# A tibble: 480 × 5
month year size quantity avg_price
<chr> <dbl> <chr> <chr> <dbl>
1 January 2004 large halfdozen 126
2 January 2004 large dozen 230
3 January 2004 xlarge halfdozen 132
4 January 2004 xlarge dozen 230
5 February 2004 large halfdozen 128.
6 February 2004 large dozen 226.
7 February 2004 xlarge halfdozen 134.
8 February 2004 xlarge dozen 230
9 March 2004 large halfdozen 131
10 March 2004 large dozen 225
# … with 470 more rows
As I predicted, I now have four times as many rows (120 became 480), and one column fewer (6 became 5). Now I have a single observation per row and am ready to begin analysis. I may wish to mutate
the cents into dollars depending on the nature of my analysis or if there are other tables to bring in at some point. The preview tibble below rounds to whole cents, but the data retains the precise decimals.
# A tibble: 480 × 5
month year size quantity avg_USD
<chr> <dbl> <chr> <chr> <dbl>
1 January 2004 large halfdozen 1.26
2 January 2004 large dozen 2.3
3 January 2004 xlarge halfdozen 1.32
4 January 2004 xlarge dozen 2.3
5 February 2004 large halfdozen 1.28
6 February 2004 large dozen 2.26
7 February 2004 xlarge halfdozen 1.34
8 February 2004 xlarge dozen 2.3
9 March 2004 large halfdozen 1.31
10 March 2004 large dozen 2.25
# … with 470 more rows
---
title: "Sarah McAlpine - Challenge 3"
author: "Sarah McAlpine"
desription: #Pivoting Data#"
date: "09/26/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- sarahmcalpine
- eggs
- pivot_longer()
- rename()
- mutate()
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(summarytools)
knitr::opts_chunk$set(echo = TRUE)
```
## Pivoting "Tidy" Egg Data
For this challenge, I will read in and pivot the eggs_tidy.csv data to prepare it for analysis.
After looking at the rendered data frame summary, I know that there are 6 columns of data and 120 rows, containing all 12 months of the year, and the years 2004-2013. The remaining 4 columns are large_half_dozen, large_dozen, extra_large_half_dozen, and extra_large_dozen, all with their values ranging from 126-290. By researching the source of this data, I know that the values correspond to the price in cents, meaning that egg prices range from $1.26 to $2.90 for large and extra large half-dozens and dozens of eggs. While I read in the data, I will also rename the columns to separate the size and quantity variables for my next step. This will not change the dimensions of the data.
```{r}
# read in egg data and assign a name
eggs <- read_csv("_data/eggs_tidy.csv") %>%
rename("xlarge_halfdozen" = "extra_large_half_dozen",
"xlarge_dozen" = "extra_large_dozen",
"large_halfdozen" = "large_half_dozen")
print(dfSummary(eggs))
# quickly find mins, maxes, and ranges of data
summary(eggs)
# preview first several rows
head(eggs)
```
## Predict Dimensions of Tidy Data
In considering this data set, I can expect one observation to include a month, a year, an egg size, and a quantity of eggs. Arranging the data in this way will allow analysis of price changes throughout the year and over time for both large and extra large eggs, as well as whether they are sold in dozens or half dozens.
I expect the resulting data set to be four times as long since each size-quantity pairing will appear in its own row (rather than the initial 4 columns after month and year). I expect the total number of columns to reduce by one, since I will remove the 4 size-quantity columns names and replace them with a column each for size, quantity, and average price. See below for this newly rearranged data.
```{r}
eggs_longer <- eggs %>%
pivot_longer(cols = contains("large"),
names_to = c("size", "quantity"),
names_sep = "_",
values_to = "avg_price")
eggs_longer
```
## Conclusion
As I predicted, I now have four times as many rows (120 became 480), and one column fewer (6 became 5). Now I have a single observation per row and am ready to begin analysis. I may wish to `mutate` the cents into dollars depending on the nature of my analysis or if there are other tables to bring in at some point. The preview tibble below rounds to whole cents, but the data retains the precise decimals.
```{r}
eggs_USD <- mutate(eggs_longer,
avg_USD = avg_price / 100
)%>%
select(!contains ("price"))
eggs_USD
```