Final Project Gabriel Katz

final_Project_eggs_tidy
Egg Prices 2004-2013
Author

Gabriel Katz

Published

May 28, 2023

library(tidyverse)
library(dplyr)
library(ggplot2)
library(scales)
library(RColorBrewer)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Part 1. Introduction

Inflation is one of the most commonly discussed economic issues of the current era, especially in the wake of the Covid-19 pandemic and the Russian invasion of Ukraine. An average person may not be able to recite the dictionary definition of inflation, but one thing that everyone knows is that high inflation has resulted in an increase in egg prices. For this project, I will be using a dataset of egg prices collected over a 9 year period between 2004 and 2013. I will be using this dataset to look at how the price of eggs changed during that period, particularly during the 2008 financial crisis. As a novice data scientist, this datset was appealing to work with for two main reasons; its simplicity and cleanliness, and secondly, the potential it has to provide context within larger economic issues. With inflation currently impacting the US economy, it will be interesting to see how a different type of economic trend effected the price of eggs well over a decade ago. Understanding how egg prices fluctuated then could be illustrative of what is happening currently.

Research Question:

  • How did the price of eggs change between 2004 and 2013?

Part 2. Describe the data set

In order to analyze the data, I will be using RStudio to perform a series of functions to look at key aspects of the dataset. In order to do this, I first need to import, or read in the dataset into the program.

I will be using the read.csv function to bring the data into RStudio. The read.csv is the preferred method to do this because that specific function is part of the tidyverse package, an ecosystem of tools that allow data scientists to easily perform a variety of functions on the dataset in question.

# Read in the csv file from the appropriate folder and assign the object a name.
eggs <- read.csv("posts/_data/eggs_tidy.csv")
Error in file(file, "rt"): cannot open the connection

Use the dim function on the dataset to look at the numeric dimensions of the dataset. The resulting dimensions are 120 rows by 6 columns.

# Use the dim function to find the dimensions of the dataset.
dim(eggs)
Error in eval(expr, envir, enclos): object 'eggs' not found

Using the colnames function, I am able to see the names of each of the 6 columns.

# Use the colnames function to see what each column name is. 
colnames(eggs)
Error in is.data.frame(x): object 'eggs' not found

Using the heading function I can look at the first 6 rows of data below each column name.

# Use the head function to look at the first 6 rows.
head(eggs)
Error in head(eggs): object 'eggs' not found

At first glance, this dataset seems to be made up of egg prices broken down by months within years. Each case, or row, contains one month, one year, and the current price of 4 different egg product types: large half dozens, large dozens, extra large half dozens, and extra large dozens. The units of price are displayed in cents.

Before moving ahead with descriptive statistics, analysis, and plotting, I will first use the mutate function on a few parts of the dataset in order to allow for simpler analysis and to make them more understandable.

  • I am going to change each observation in the “month” column into a numerical value. My reason for doing this is to facilitate easier visualization of the dataset. It will allow for the use of a continuous scale when plotting the data.
  • I am going to change the prices in columns 3-6 from cents to dollars and cents. This will allow for the prices to be more readily understandable when looking at the summary statistics and visualizations.
# First, I will use the mutate function in a piped sequence to change each month into a 
# numerical value. Next, I am going to pipe the data again to recode the cents values 
# in columns 3 through 6 into dollars and cents. I will do this by simply dividing by 
# 100. I will also round each observation to 2 decimal places. 
eggs_mutated <- eggs%>%
  mutate(month = recode(month,
                        "January" = 1,
                        "February" = 2,
                        "March" = 3,
                        "April" = 4,
                        "May" = 5,
                        "June" = 6,
                        "July" = 7,
                        "August" = 8,
                        "September" = 9,
                        "October" = 10,
                        "November" = 11,
                        "December" = 12)) %>%
  mutate(across(3:6, ~ round(. / 100, 2)))
Error in mutate(., month = recode(month, January = 1, February = 2, March = 3, : object 'eggs' not found
head(eggs_mutated)
Error in head(eggs_mutated): object 'eggs_mutated' not found

After using the piped mutation sequence, I now have an adjusted dataset that I can produce descriptive statistics with.

Part 3. Summary Statistics and Dataset Description

In order to produce summary statistics, I am going to use the tidyverse function summarise along with the across argument to produce basic measurements of the data including the mean, median, standard deviation, minimum, and maximum for each of the 6 columns.

# By using summarise across, I can apply several functions to each observation 
# within each case. 
eggs_summarised <- eggs_mutated %>%
  summarise(across(1:6, list(mean = mean, median = median, sd = sd, min = min, max = max)))
Error in summarise(., across(1:6, list(mean = mean, median = median, sd = sd, : object 'eggs_mutated' not found
eggs_summarised
Error in eval(expr, envir, enclos): object 'eggs_summarised' not found

In the dataframe shown above, I applied five functions to each of the observations across each of the 6 columns from the mutated dataset.

By mutating the months from alphabetic into numerical values, I was able to produce numerical statistics concerning the month observations in each row. I can see that the minimum and maximum of each month is 1 and 12, indicating that the dataset contains observations that span the entire year. When looking at the standard deviations across each category, I find it interesting that the standard deviation of large_dozen is at ~.18 while the sd of all 3 other categories are between ~.22 and ~.25. It could be worth looking into why the standard deviation of large dozens is smaller than the rest of the egg product types in this dataset.

For my next set of summary statistics, I am going to focus specifically on the year 2008, when the financial crisis occurred. I am going to produce the same set of summary statistics as I did previously. I will extract cases from 2008 using the filter function.

# By using the filter function, I am pulling out only observations from 2008.
eggs_2008 <- eggs_mutated %>%
  filter(year == 2008) %>%
  summarise(across(3:6, list(mean = mean, 
                             median = median, 
                             sd = sd, 
                             min = min, 
                             max = max)))
Error in filter(., year == 2008): object 'eggs_mutated' not found
eggs_2008
Error in eval(expr, envir, enclos): object 'eggs_2008' not found

For the previous set of summary statistics, I only focused on the product type prices, and not the years or months. Given that I filtered out only cases from 2008, there was no need to produce summary statistics for the year and month columns. When looking at the summary statistics for the price columns, I can see that there was a fair amount of fluctuation among the price of each category of egg product in 2008. For example, there was a 46 cent difference between the minimum and maximum price for an extra large half dozen.

Part 4. Plan for Analysis + Data Wrangling

Based on what can be seen in the dataset so far, I am going to hone my original research question to focus in on a specific part of the data.

  • How did price changes during 2008 compare to price changes in other years?

In order to analyze this question, I will first pivot the data into a longer format. My goal in pivoting the dataset is to create a single column containing the egg product type and a corresponding column for the price. My expectation in pivoting the dataset is that I will go from 6 columns down to 4, as I will be combining each of the 4 egg product type columns into a single column, and then creating 1 new column containing only the price. In doing this pivot, I expect to go from 120 rows to 480 rows. I expect the 480 rows because I am taking 120 existing rows and multiplying it by the 4 egg product types.

# Pivot the data from wide to long.
eggs_longer <- eggs_mutated %>%
  pivot_longer(cols = c(large_half_dozen, large_dozen, extra_large_half_dozen, 
                        extra_large_dozen),
               names_to = "set_of_eggs",
               values_to = "price")
Error in pivot_longer(., cols = c(large_half_dozen, large_dozen, extra_large_half_dozen, : object 'eggs_mutated' not found
eggs_longer
Error in eval(expr, envir, enclos): object 'eggs_longer' not found

As expected, after pivoting this data into a longer format, I now have 4 columns of variables and 480 rows. The 2 new columns I created are set_of_eggs, referring to the egg product type, as well as the corresponding price. The purpose of pivoting the data was to get it into a form where each case now consists of only one unique price observation. Rather than having all 4 prices side by side, they are now broken down into 1 price corresponding to one type of egg product. This will make plotting more effective.

Part 5. Plotting and Analyzing the Data

In order to illustrate the comparative question from Part 4, I am going to create 4 different plots which show the following aspects of the data:

  • The price changes across each egg product type over all years measured in the dataset
  • The price changes in 2008 specifically

I will first create a point plot that shows the price of a large dozen eggs across all years included in the dataset.

# Create a point plot using ggplot2. I adjust the color scale to represent 
# prices, the points sized at 9 for optimal viewing, and the alpha value at 
# 0.3 to introduce opacity among overlapping points. 
# In the scale_y_continuous function I ensure that breaks are set to 10 cent 
# intervals and the labels indicate that these are dollar amounts. 
ggplot(eggs_mutated, aes(x = year, y = large_dozen, color = large_dozen)) +
  geom_point(size = 9, alpha = 0.3) +
  scale_x_continuous(breaks = unique(eggs_mutated$year)) +
  scale_y_continuous(labels = function(x) sprintf("$%.2f", x)) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(x = "Year", y = "Price of Large Dozen Eggs (USD)", title = 
         "Price of Large Dozen Eggs from 2004 - 2013")
Error in ggplot(eggs_mutated, aes(x = year, y = large_dozen, color = large_dozen)): object 'eggs_mutated' not found

It is clear when looking at this plot that the largest difference in price in a single year was in 2008. It is also interesting to see the gradual reduction in price in the years following the financial crisis. It is quite remarkable to see the large gap between data points in 2008. Either this resulted from non-nuanced data collection, or the sheer impact of the financial crisis and subsequent great recession. It is also interesting to see the smaller but still noticeable price fluctuation within the year of 2004.

For the next plot, I am going to create another point plot, however I am going to use the facet wrap function to create a plot for each egg product type. This will essentially produce the same plot as before, however now I will be able to see each of the egg product types side by side in order to see if they all follow similar price trends.

# Create a plot based on price for each egg product type and use facet_wrap to 
# display them each individually.
# Setting breaks on the y-axis using a sequence from $1.20 to the maximum price 
# in the data at increments of 0.20 cents.
# Adjust the angle of the x-axis text so they are displayed at an angle.
ggplot(eggs_longer, aes(x = year, y = price, color = price)) +
  geom_point(size = 6, alpha = 0.15) +
  scale_x_continuous(breaks = unique(eggs_mutated$year)) +
  scale_y_continuous(labels = function(x) sprintf("$%.2f", x),
                     breaks = seq(0, max(eggs_longer$price), by = 0.2)) +
  scale_color_gradient(low = "blue", high = "red") +
  facet_wrap(~ set_of_eggs, nrow = 2) +
  labs(title = "Price of Eggs by Category", x = "Date", y = "Price (USD)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
Error in ggplot(eggs_longer, aes(x = year, y = price, color = price)): object 'eggs_longer' not found

This point plot visualizes the prices of each egg product type in relation to one another. What I find interesting is that the large dozen price declines after peaking in the year 2008, while the other 3 categories stay pretty much the same. This explains why the large dozen price standard deviation is smaller than the rest of the categories - more of the observations are closer in range to each other. Further research would be necessary to understand why the large dozen category seems to hold tighter to its price than the other categories. Still, as shown by this plot, each egg product type had a significant price change in 2008.

To further examine the price fluctuation during the financial crisis, I am going to now focus specifically on the year 2008. I am going to take my previously pivoted dataset, filter it so only data from 2008 is used, and then mutate it once more to combine the year and the date columns into one. This will allow me to plot a continuous line that will show the price increase clearly throughout the year.

# Filter 2008 from my pivoted dataset and mutate the year and month into a new column
eggs_2008_longer <- eggs_longer %>% 
  filter(year == 2008) %>%
  mutate(date = as.Date(paste(year, month, "01", sep = "-")))
Error in filter(., year == 2008): object 'eggs_longer' not found
# Use RColorBrewer to pick a distinctive palette for better viewing
line_colors <- brewer.pal(4, "Set1")

# create a smooth line plot using loess regression so it better illustrates the 
# price fluctuation
ggplot(eggs_2008_longer, aes(x = as.Date(date), y = price, color = set_of_eggs)) +
  geom_smooth(method = "loess", size = 1.5) +
  labs(title = "Egg Price Fluctuations - 2008", x = "Date", y = "Price (USD)") + 
  scale_y_continuous(labels = function(x) sprintf("$%.2f", x),
                     breaks = seq(0, max(eggs_2008_longer$price), by = 0.1)) +
  scale_color_manual(values = line_colors) +  # Set the color palette 
  # for the line colors produced with RColorBrewer
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
Error in ggplot(eggs_2008_longer, aes(x = as.Date(date), y = price, color = set_of_eggs)): object 'eggs_2008_longer' not found

This smoothed line plot displays the price fluctuations during 2008 for each egg product type. Utilizing RColorBrewer, I picked a set of colors that contrast nicely so each egg product type is displayed clearly. This line plot shows in greater detail what was shown on the faceted point plot displayed earlier. Using the ggplot2 loess functionality, I was able to smooth and fit the line so that the price fluctuations are displayed clearly.

As a way to show how much fluctuation in price there was in 2008 in comparison to all other years, I am going to create a boxplot which shows relative price change for each year across the entire dataset.

# Create boxplots for each year with color based on egg type
ggplot(data = eggs_longer, aes(x = factor(year), 
                               y = price/10, 
                               fill = set_of_eggs)) +
  geom_boxplot(width = 1, outlier.shape = "triangle") +
  scale_fill_manual(values = c("large_half_dozen" = "red", 
                               "large_dozen" = "green",
                               "extra_large_half_dozen" = "blue", 
                               "extra_large_dozen" = "orange")) +   
  scale_y_continuous(labels = function(x) sprintf("$%.2f", x*10)) +
  labs(title = "Egg Price Changes by Year", x = "Year", y = "Price (USD)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
Error in ggplot(data = eggs_longer, aes(x = factor(year), y = price/10, : object 'eggs_longer' not found

The boxplot function is effective at illustrating the price fluctuations because the height of the box represents the amount of change. The largest boxes are located around 2008, with the next largest boxes around 2004. Further research would be required to understand the underlying economic cause during that time period that caused the price to fluctuate the way it did. And again, it is interesting to see the large dozen price decrease post-2008, unlike the 3 other egg product types.

Part 6. Discussion

Based on the 4 plots shown, it is clear that the price fluctuation across all egg product types in 2008 was the largest out of any year in the dataset. Each type of plot shown indicated the price changes during 2008 were the largest among each year present in the dataset. The simplicity and cleanliness of this dataset enabled easy data wrangling to get the dataset into a format that allowed for effective plotting.

Additionally, it was evident from the plotting that each egg product type stayed relatively consistent in price in comparison to its counterparts. For instance, extra large eggs were more expensive overall than large eggs, as expected. The line plot displaying the price change in 2008 showed that while the overall price change was significant, each egg product type experienced roughly the same amount of relative change. The only slight inconsistency was the way the price in the large dozen category experienced a slight decrease in the years after 2008. It could be worth further researching this in a separate project, using a different dataset that had some type of supply or demand measurement to show why the price of large dozens decreased while other major categories did not.

Furthermore, it would be interesting to utilize a dataset that includes other variables, such as socioeconomic or political ones, to show how egg prices increased or decreased alongside other variables. This could shed more light on the impact of major trends on commodity prices in general.

Part 7. Conclusion

Based on the descriptive statistics and plots I have produced, I am able to provide an answer to the question I posed in Part 4: The price of eggs changed the greatest amount in 2008 out of any year between 2004 and 2013. While there were other years that exhibited minor price fluctuations, it was clear when analyzing this dataset that the changes in other years did not come close the amount of change in 2008.

Further research would be necessary to definitively state why this was the case. It could be interpreted that the price change in 2008 was the result of the financial crisis and subsequent recession. However, I would need to include other variables in my analysis to show a positive correlation that proves that egg prices are tied to some other economic factor.

It will be interesting to look at data on commodity prices from more recent years that spanned the pandemic and subsequent recovery. When price trends from multiple eras can be compared in relation to each other, it could strengthen the argument that certain types of commodities, including eggs, are strong indicators of prevalent economic trends.

Bibliography

DataNovia. The A-Z of RColorBrewer Palette. Retrieved May 25, 2023, from https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/

Dataset: eggs_tidy.csv. Retrieved from https://github.com/DACSS/601_Spring_2023

R Core Team. (2022-06-23). R: A language and environment for statistical computing (Version 4.2.1) [Computer software]. The R Foundation for Statistical Computing. https://www.R-project.org/

The R Graph Gallery https://r-graph-gallery.com

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media. https://r4ds.had.co.nz/index.html