Data Science Fundamentals Final Paper

Changes in the price of poultry between 2003-14

Adam Wheeler
2022-01-23

Contents

Introduction

For this research project, we will answer the question, how has the price of poultry changed over time? (For simplicity, the term poultry refers to both chicken and egg products.) From our data set, we can see at a detailed level how prices compare each month for specific poultry products, but we cannot see in broad strokes how price has changed over time, nor can we spot potentially valuable patterns or insights about those changes. For our big picture, we will rely on plots and visual interpretation. For more granular insights into the nature of poultry prices over time, we will rely on statistical methods.

Research Question

How has the price of poultry products changed over time?

Variables and Units of Measurement

Data Load

knitr::opts_chunk$set(echo = TRUE)

library(dplyr)
library(tidyverse)
library(readxl)

organic <- read_excel(
  path = "../Downloads/organiceggpoultry.xlsx",
  skip = 4
)

head(organic)
# A tibble: 6 x 11
  ...1     `Extra Large \nDozen` `Extra Large 1/2 Doz~ `Large \nDozen`
  <chr>                    <dbl>                 <dbl>           <dbl>
1 Jan 2004                  230                   132             230 
2 February                  230                   134.            226.
3 March                     230                   137             225 
4 April                     234.                  137             225 
5 May                       236                   137             225 
6 June                      241                   137             231.
# ... with 7 more variables: Large 
1/2 Doz. <dbl>, ...6 <lgl>,
#   Whole <dbl>, B/S Breast <dbl>, Bone-in Breast <chr>,
#   Whole Legs <dbl>, Thighs <chr>

Data Recode

# lets tidy the col names before turning to the values.
# start by removing line breaks
names(organic) <- gsub("\n", " ", names(organic))
# give the first column a name to make it mutable
# (we don't have to but makes things more legible)
colnames(organic)[1] = "Date"
# remove typo from the third column name
colnames(organic)[3] = "Extra Large 1/2 Doz."
organic <- organic %>%
  # drop empty column
  select(-6) %>%
  # ok now lets tidy the values...
  # remove weird extra characters from Date
  mutate(Date = ifelse(grepl("/", Date), gsub(".{3}$", "", Date), Date)) %>%
  # separate Date into month and year variables
  separate(Date, sep = " ", into = c("Month", "Year")) %>%
  # fix abbreviated month name
  mutate(Month = str_replace(Month, "Jan", "January")) %>%
  # fill in missing year values
  fill(Year) %>%
  # cast variables as numeric values (also replaces strings with na)
  mutate(Year = as.numeric(Year)) %>%
  mutate(across(3:11, as.numeric)) %>%
  # then drop na values, which should be ok for our research
  drop_na() %>%
  # pivot descriptive columns to make every row a unique observation of unique variables
  pivot_longer(-c(Month, Year)) %>%
  rename(Product = name) %>%
  rename(Price = value)

head(organic)
# A tibble: 6 x 4
  Month  Year Product              Price
  <chr> <dbl> <chr>                <dbl>
1 July   2004 Extra Large  Dozen    241 
2 July   2004 Extra Large 1/2 Doz.  137 
3 July   2004 Large  Dozen          234.
4 July   2004 Large  1/2 Doz.       134.
5 July   2004 Whole                 217 
6 July   2004 B/S Breast            642.

Descriptive Statistics

Now that our data set is tidy (each row represents a unique observation), we can begin to visualize our data.

organic %>%
  ggplot(aes(as.character(Year), Price)) +
  geom_boxplot() +
  stat_boxplot(geom = "errorbar") +
  theme_minimal() +
  xlab("Year") +
  labs(
    title = "Poultry prices changed slightly over time with large range and outlier values",
    subtitle = "Data from 2003-14"
  )

By presenting our data in a box plot we see somewhat how the mean price (and price range) of poultry has changed over time, but we also see large, consistent outlier values, which may cloud or distort our interpretation of the data. For a clearer picture of these price changes and outlier values, we can create a new visual that facets data by product.

organic %>%
  group_by(Year, Product) %>%
  # we'll just look at mean value
  summarize(Mean = mean(Price, na.rm = TRUE)) %>%
  ggplot(aes(Year, Mean)) +
  geom_smooth(aes(color = Product), show.legend = FALSE) +
  facet_wrap(vars(Product)) +
  theme_minimal() +
  ylab("Mean Price") +
  labs(
    title = "Poultry price may depend on its product type",
    subtitle = "Data from 2003-14"
  )

From this visual, we can see how price has changed over time for each product. We can also see that the large outlier and range values in our box plot may result from certain chicken products, namely B/S Breast. However, it is too close/difficult to visually compare other products, so we will further classify our data set to make it easier to see meaningful changes in price over time.

Given outlier values appear mostly on chicken products, we can classify our products according to their category as chicken or eggs. We can hypothesize that a product’s price depends significantly on its category. For the remainder of this research, we will test this hypothesis - first by visualizing the change in mean price for chicken and egg products grouped together, then by testing the significance of a product’s category on its mean price. We will conclude our tests by checking the statistical correlation between egg and chicken prices.

Grouped Comparison

To start, we must add to our data set a new variable that captures product category.

organic <- organic %>%
  # add a category variable
  mutate(
    Category = if_else(
      grepl("Doz", Product), "Eggs","Chicken"
    )
  ) %>%
  # convert Category from string to factor variable
  mutate(Category = factor(Category, c("Chicken", "Eggs")))

Now we can plot the mean price of chicken and egg products separately over time.

organic %>%
  group_by(Year, Category) %>%
  summarize(Mean = mean(Price, na.rm = TRUE)) %>%
  ggplot(aes(Year, Mean)) +
  geom_smooth(aes(color = Category)) +
  theme_minimal() +
  ylab("Mean Price") +
  labs(
    title = "Poultry price changed over time according to category",
    subtitle = "Data from 2003-14"
  )

According to the plot above, mean prices for both chicken and eggs increased between 2004-13, although we cannot yet tell if there is any correlation between these two categories. From here, we could pursue numerous related questions about the nature of these price changes over time. (For example, we could look more closely at the jump in the price of eggs from 2006-09 or the dip in chicken price in 2010… We could also forecast future prices for each category and test the accuracy of our predictions.) For this research, however, we will focus on our newly created Category variable. Together, the plots above make clear that mean product price differs greatly by product category. For this research, we will conclude by verifying the significance of this relationship. To do so, we will start by testing a null hypothesis using the t-test model.

Null Hypothesis

H0: The price of a poultry product does not depend significantly on its category.

HA: The price of a poultry product may dependent significantly on its category.

Significant Difference

We want to know if there is a significant difference in the mean price between chicken and egg products. In other words, we must test a numeric dependent variable (Price) against a categorical independent variable (Category), which has only two levels (chicken/eggs).

# IV - product category (chicken/eggs)
# DV - product price
# t.test(DV ~ IV, data)
t.test(Price ~ Category, organic)

    Welch Two Sample t-test

data:  Price by Category
t = 16.749, df = 710.47, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Chicken and group Eggs is not equal to 0
95 percent confidence interval:
 113.3091 143.3996
sample estimates:
mean in group Chicken    mean in group Eggs 
             339.9471              211.5928 

Because our resulting p-value is well below 0.05, we can reject our null hypothesis that a product’s category has no significant impact on its price. Instead, it appears that a product’s price does significantly depend on its category.

Correlation Check

We’ve just seen that a product’s category has a significant influence on its price. For even more granular insight, we can explore if there is a correlation between the price of eggs and the price of chicken between 2004-13. In other words we may ask, does the price of eggs depend significantly on the price of chicken?

To answer this question, we must first widen our data set according to the average price of chicken and egg products.

organic <- organic %>%
  group_by(Year, Month, Category) %>%
  summarize(Mean = mean(Price, na.rm = TRUE)) %>%
  pivot_wider(names_from = Category, values_from = Mean)
print(organic)
# A tibble: 114 x 4
# Groups:   Year, Month [114]
    Year Month     Chicken  Eggs
   <dbl> <chr>       <dbl> <dbl>
 1  2004 August       331.  186.
 2  2004 December     331.  185.
 3  2004 July         331.  186.
 4  2004 November     331.  185.
 5  2004 October      331.  185.
 6  2004 September    331.  185.
 7  2005 April        336.  185.
 8  2005 August       336.  185.
 9  2005 December     336.  185.
10  2005 February     336.  185.
# ... with 104 more rows

Now that our data set contains unique variables representing the mean price of chicken and egg products, we can check for their correlation. Because we are testing the statistical significance between two numeric variables (mean prices of chicken and eggs), we will use a bivariate linear model.

# Chicken: mean price of chicken
# Eggs: mean price of eggs
# calculate the correlation coefficient
cor(organic$Chicken, organic$Eggs)
[1] 0.6719833
# estimate bivariate linear model
# lm(DV ~ IV)
summary(lm(organic$Eggs ~ organic$Chicken))

Call:
lm(formula = organic$Eggs ~ organic$Chicken)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.015 -14.515  -6.948  14.297  26.359 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -634.4238    88.1128  -7.200 7.36e-11 ***
organic$Chicken    2.4887     0.2592   9.603 2.73e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.08 on 112 degrees of freedom
Multiple R-squared:  0.4516,    Adjusted R-squared:  0.4467 
F-statistic: 92.22 on 1 and 112 DF,  p-value: 2.727e-16

According to our model, we see that there is a statistically significant correlation between the price of eggs and chicken.

Conclusion

In this research, we set out to explore how the price of poultry products changed over time. Because of the large range in values and noticeable outlier values, we viewed price changes separately by product. While this more detailed view uncovered the potential sources for these large outlier values, we still could not see the “big picture” of changes in the price of poultry over time. To clarify our data set, we added a new variable that classifies our observations into two categories: chicken and eggs. By grouping our data by this new category, we could see more intuitively and accurately how the average price of poultry products changed over time. Finally, this research concluded by testing the significance of our classification schema using the t-test bivariate linear statistical models. In the end, our statistical tests confirmed our assumption that the price of poultry depends significantly on its category.

Reflection

This research was an excellent learning exercise, especially in narrative data story telling. Intuitively, my process was guided by the assumption that prices would depend significantly on product category. I am happy with the order of operations with which this paper explores the price of poultry over time - first as a box plot that also displays median, range, and outlier values; then separately by product, which uncovers a possible connection between a product’s category and its price; and finally as a simplified comparison between these product categories. While our research question just asks about change in the price of poultry overall, we can provide the reader with a more intelligent data story by presenting our research as a comparison between the change in price across chicken and egg products.

While this research ends by validating the statistical significance of a product’s category on its price, we could have pursued this topic in a number of other directions. For example, we could explore how price ranges changed over time (with particular attention to 2007-08) or when prices for chicken and eggs reached their max values. More interestingly, we could explore if the price of poultry depends significantly on the month of the year.

While I am happy with this narrative overall, there are ways that I could improve this research. While the box and faceted line plots help illustrate early steps/observations before we classify products by category, they are still a little difficult to interpret. While this difficulty does lead us to create a clearer/more intuitive comparison between chicken and egg prices, the visuals by themselves could be made more accessible for naive readers.

Bibliography

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Wheeler (2022, Jan. 25). Data Analytics and Computational Social Science: Data Science Fundamentals Final Paper. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomabwheelfinal-paper/

BibTeX citation

@misc{wheeler2022data,
  author = {Wheeler, Adam},
  title = {Data Analytics and Computational Social Science: Data Science Fundamentals Final Paper},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomabwheelfinal-paper/},
  year = {2022}
}