Changes in the price of poultry between 2003-14
For this research project, we will answer the question, how has the price of poultry changed over time? (For simplicity, the term poultry refers to both chicken and egg products.) From our data set, we can see at a detailed level how prices compare each month for specific poultry products, but we cannot see in broad strokes how price has changed over time, nor can we spot potentially valuable patterns or insights about those changes. For our big picture, we will rely on plots and visual interpretation. For more granular insights into the nature of poultry prices over time, we will rely on statistical methods.
How has the price of poultry products changed over time?
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(tidyverse)
library(readxl)
organic <- read_excel(
path = "../Downloads/organiceggpoultry.xlsx",
skip = 4
)
head(organic)
# A tibble: 6 x 11
...1 `Extra Large \nDozen` `Extra Large 1/2 Doz~ `Large \nDozen`
<chr> <dbl> <dbl> <dbl>
1 Jan 2004 230 132 230
2 February 230 134. 226.
3 March 230 137 225
4 April 234. 137 225
5 May 236 137 225
6 June 241 137 231.
# ... with 7 more variables: Large
1/2 Doz. <dbl>, ...6 <lgl>,
# Whole <dbl>, B/S Breast <dbl>, Bone-in Breast <chr>,
# Whole Legs <dbl>, Thighs <chr>
# lets tidy the col names before turning to the values.
# start by removing line breaks
names(organic) <- gsub("\n", " ", names(organic))
# give the first column a name to make it mutable
# (we don't have to but makes things more legible)
colnames(organic)[1] = "Date"
# remove typo from the third column name
colnames(organic)[3] = "Extra Large 1/2 Doz."
organic <- organic %>%
# drop empty column
select(-6) %>%
# ok now lets tidy the values...
# remove weird extra characters from Date
mutate(Date = ifelse(grepl("/", Date), gsub(".{3}$", "", Date), Date)) %>%
# separate Date into month and year variables
separate(Date, sep = " ", into = c("Month", "Year")) %>%
# fix abbreviated month name
mutate(Month = str_replace(Month, "Jan", "January")) %>%
# fill in missing year values
fill(Year) %>%
# cast variables as numeric values (also replaces strings with na)
mutate(Year = as.numeric(Year)) %>%
mutate(across(3:11, as.numeric)) %>%
# then drop na values, which should be ok for our research
drop_na() %>%
# pivot descriptive columns to make every row a unique observation of unique variables
pivot_longer(-c(Month, Year)) %>%
rename(Product = name) %>%
rename(Price = value)
head(organic)
# A tibble: 6 x 4
Month Year Product Price
<chr> <dbl> <chr> <dbl>
1 July 2004 Extra Large Dozen 241
2 July 2004 Extra Large 1/2 Doz. 137
3 July 2004 Large Dozen 234.
4 July 2004 Large 1/2 Doz. 134.
5 July 2004 Whole 217
6 July 2004 B/S Breast 642.
Now that our data set is tidy (each row represents a unique observation), we can begin to visualize our data.
organic %>%
ggplot(aes(as.character(Year), Price)) +
geom_boxplot() +
stat_boxplot(geom = "errorbar") +
theme_minimal() +
xlab("Year") +
labs(
title = "Poultry prices changed slightly over time with large range and outlier values",
subtitle = "Data from 2003-14"
)
By presenting our data in a box plot we see somewhat how the mean price (and price range) of poultry has changed over time, but we also see large, consistent outlier values, which may cloud or distort our interpretation of the data. For a clearer picture of these price changes and outlier values, we can create a new visual that facets data by product.
organic %>%
group_by(Year, Product) %>%
# we'll just look at mean value
summarize(Mean = mean(Price, na.rm = TRUE)) %>%
ggplot(aes(Year, Mean)) +
geom_smooth(aes(color = Product), show.legend = FALSE) +
facet_wrap(vars(Product)) +
theme_minimal() +
ylab("Mean Price") +
labs(
title = "Poultry price may depend on its product type",
subtitle = "Data from 2003-14"
)
From this visual, we can see how price has changed over time for each product. We can also see that the large outlier and range values in our box plot may result from certain chicken products, namely B/S Breast. However, it is too close/difficult to visually compare other products, so we will further classify our data set to make it easier to see meaningful changes in price over time.
Given outlier values appear mostly on chicken products, we can classify our products according to their category as chicken or eggs. We can hypothesize that a product’s price depends significantly on its category. For the remainder of this research, we will test this hypothesis - first by visualizing the change in mean price for chicken and egg products grouped together, then by testing the significance of a product’s category on its mean price. We will conclude our tests by checking the statistical correlation between egg and chicken prices.
To start, we must add to our data set a new variable that captures product category.
Now we can plot the mean price of chicken and egg products separately over time.
According to the plot above, mean prices for both chicken and eggs increased between 2004-13, although we cannot yet tell if there is any correlation between these two categories. From here, we could pursue numerous related questions about the nature of these price changes over time. (For example, we could look more closely at the jump in the price of eggs from 2006-09 or the dip in chicken price in 2010… We could also forecast future prices for each category and test the accuracy of our predictions.) For this research, however, we will focus on our newly created Category variable. Together, the plots above make clear that mean product price differs greatly by product category. For this research, we will conclude by verifying the significance of this relationship. To do so, we will start by testing a null hypothesis using the t-test model.
H0: The price of a poultry product does not depend significantly on its category.
HA: The price of a poultry product may dependent significantly on its category.
We want to know if there is a significant difference in the mean price between chicken and egg products. In other words, we must test a numeric dependent variable (Price) against a categorical independent variable (Category), which has only two levels (chicken/eggs).
# IV - product category (chicken/eggs)
# DV - product price
# t.test(DV ~ IV, data)
t.test(Price ~ Category, organic)
Welch Two Sample t-test
data: Price by Category
t = 16.749, df = 710.47, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Chicken and group Eggs is not equal to 0
95 percent confidence interval:
113.3091 143.3996
sample estimates:
mean in group Chicken mean in group Eggs
339.9471 211.5928
Because our resulting p-value is well below 0.05, we can reject our null hypothesis that a product’s category has no significant impact on its price. Instead, it appears that a product’s price does significantly depend on its category.
We’ve just seen that a product’s category has a significant influence on its price. For even more granular insight, we can explore if there is a correlation between the price of eggs and the price of chicken between 2004-13. In other words we may ask, does the price of eggs depend significantly on the price of chicken?
To answer this question, we must first widen our data set according to the average price of chicken and egg products.
organic <- organic %>%
group_by(Year, Month, Category) %>%
summarize(Mean = mean(Price, na.rm = TRUE)) %>%
pivot_wider(names_from = Category, values_from = Mean)
print(organic)
# A tibble: 114 x 4
# Groups: Year, Month [114]
Year Month Chicken Eggs
<dbl> <chr> <dbl> <dbl>
1 2004 August 331. 186.
2 2004 December 331. 185.
3 2004 July 331. 186.
4 2004 November 331. 185.
5 2004 October 331. 185.
6 2004 September 331. 185.
7 2005 April 336. 185.
8 2005 August 336. 185.
9 2005 December 336. 185.
10 2005 February 336. 185.
# ... with 104 more rows
Now that our data set contains unique variables representing the mean price of chicken and egg products, we can check for their correlation. Because we are testing the statistical significance between two numeric variables (mean prices of chicken and eggs), we will use a bivariate linear model.
# Chicken: mean price of chicken
# Eggs: mean price of eggs
# calculate the correlation coefficient
cor(organic$Chicken, organic$Eggs)
[1] 0.6719833
Call:
lm(formula = organic$Eggs ~ organic$Chicken)
Residuals:
Min 1Q Median 3Q Max
-18.015 -14.515 -6.948 14.297 26.359
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -634.4238 88.1128 -7.200 7.36e-11 ***
organic$Chicken 2.4887 0.2592 9.603 2.73e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.08 on 112 degrees of freedom
Multiple R-squared: 0.4516, Adjusted R-squared: 0.4467
F-statistic: 92.22 on 1 and 112 DF, p-value: 2.727e-16
According to our model, we see that there is a statistically significant correlation between the price of eggs and chicken.
In this research, we set out to explore how the price of poultry products changed over time. Because of the large range in values and noticeable outlier values, we viewed price changes separately by product. While this more detailed view uncovered the potential sources for these large outlier values, we still could not see the “big picture” of changes in the price of poultry over time. To clarify our data set, we added a new variable that classifies our observations into two categories: chicken and eggs. By grouping our data by this new category, we could see more intuitively and accurately how the average price of poultry products changed over time. Finally, this research concluded by testing the significance of our classification schema using the t-test bivariate linear statistical models. In the end, our statistical tests confirmed our assumption that the price of poultry depends significantly on its category.
This research was an excellent learning exercise, especially in narrative data story telling. Intuitively, my process was guided by the assumption that prices would depend significantly on product category. I am happy with the order of operations with which this paper explores the price of poultry over time - first as a box plot that also displays median, range, and outlier values; then separately by product, which uncovers a possible connection between a product’s category and its price; and finally as a simplified comparison between these product categories. While our research question just asks about change in the price of poultry overall, we can provide the reader with a more intelligent data story by presenting our research as a comparison between the change in price across chicken and egg products.
While this research ends by validating the statistical significance of a product’s category on its price, we could have pursued this topic in a number of other directions. For example, we could explore how price ranges changed over time (with particular attention to 2007-08) or when prices for chicken and eggs reached their max values. More interestingly, we could explore if the price of poultry depends significantly on the month of the year.
While I am happy with this narrative overall, there are ways that I could improve this research. While the box and faceted line plots help illustrate early steps/observations before we classify products by category, they are still a little difficult to interpret. While this difficulty does lead us to create a clearer/more intuitive comparison between chicken and egg prices, the visuals by themselves could be made more accessible for naive readers.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wheeler (2022, Jan. 25). Data Analytics and Computational Social Science: Data Science Fundamentals Final Paper. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomabwheelfinal-paper/
BibTeX citation
@misc{wheeler2022data, author = {Wheeler, Adam}, title = {Data Analytics and Computational Social Science: Data Science Fundamentals Final Paper}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomabwheelfinal-paper/}, year = {2022} }