Data Analytics and Computational Social Science: HW6

Adam Wheeler

Research question

Does the price of poultry/eggs depend on the month of the year?

NULL and alternate hypotheses

H₀: The price of poultry/eggs does not depend significantly on the month of year.

H_A: The month of the year may significantly influence the price of poultry/eggs.

Bibliography

SOURCE U.S. Department of Agriculture, Agricultural Marketing Service (AMS) Market News, Organic Poultry and Eggs (Weekly reports)
R for Data Science
ANOVA in R | Scribbr
ANOVA in R | R-bloggers

Steps

Load the data set
Tidy and recode the data set
Define variables and units of measurement
Compute and visualize descriptive statistics
Check for equal variance across categories
Test for significant difference across group means
Check for normality of residuals

Load the data set

knitr::opts_chunk$set(echo = TRUE)

library(dplyr)
library(tidyverse)
library(readxl)

organic <- read_excel(
  path = "../Downloads/organiceggpoultry.xlsx",
  # fortunately we know how much header content to skip
  skip = 4
)

head(organic)

# A tibble: 6 x 11
  ...1     `Extra Large \nDozen` `Extra Large 1/2 Doz~ `Large \nDozen`
  <chr>                    <dbl>                 <dbl>           <dbl>
1 Jan 2004                  230                   132             230 
2 February                  230                   134.            226.
3 March                     230                   137             225 
4 April                     234.                  137             225 
5 May                       236                   137             225 
6 June                      241                   137             231.
# ... with 7 more variables: Large 
1/2 Doz. <dbl>, ...6 <lgl>,
#   Whole <dbl>, B/S Breast <dbl>, Bone-in Breast <chr>,
#   Whole Legs <dbl>, Thighs <chr>

Tidy and recode the data

# given few variables in the data set, i decided
# to just split and use month as
# a factor variable for this project, largely as a practical recoding exercise.
# lets tidy the col names before turning to the values.
# start by removing line breaks
names(organic) <- gsub("\n", " ", names(organic))
# give the first column a name to make it mutable
# (we don't have to but makes things more legible)
colnames(organic)[1] = "Date"
# remove typo from the third column name
colnames(organic)[3] = "Extra Large 1/2 Doz."
organic <- organic %>%
  # drop empty column
  select(-6) %>%
  # ok now lets tidy the values...
  # remove weird extra characters from Date
  mutate(Date = ifelse(grepl("/", Date), gsub(".{3}$", "", Date), Date)) %>%
  # separate Date into month and year variables
  separate(Date, sep = " ", into = c("Month", "Year")) %>%
  # fix abbreviated month name
  mutate(Month = str_replace(Month, "Jan", "January")) %>%
  # convert Month from string to factor variable
  # looks like r already has month.name levels that we cane use, so we don't need to define them again
  mutate(Month = factor(Month, month.name)) %>%
  # fill in missing year values
  fill(Year) %>%
  # cast variables as numeric values (also replaces strings with na)
  mutate(Year = as.numeric(Year)) %>%
  mutate(across(3:11, as.numeric)) %>%
  # then drop na values, which should be ok for our research
  drop_na() %>%
  # pivot descriptive columns to make every row a unique observation of unique variables
  pivot_longer(-c(Month, Year)) %>%
  rename(Product = name) %>%
  rename(Price = value) %>%
  # add a category column
  mutate(
    Category = if_else(
      grepl("Doz", Product), "Eggs","Chicken"
    )
  ) %>%
  # convert Category from string to factor variable
  mutate(Category = factor(Category, c("Chicken", "Eggs"))) %>%
  # we don't need Year or Product anymore
  select(-c(Year, Product)) %>%
  arrange(Month)

head(organic)

# A tibble: 6 x 3
  Month   Price Category
  <fct>   <dbl> <fct>   
1 January  241  Eggs    
2 January  136. Eggs    
3 January  234. Eggs    
4 January  128. Eggs    
5 January  217  Chicken 
6 January  644  Chicken

Define variables and units of measurement

Month: Calendar month of year
Category:
- Chicken: USDA Certified Organic young chicken
  - Extra Large Dozen
  - Extra Large 1/2 Dozen
  - Large Dozen
  - Large 1/2 Dozen
- Eggs: USDA Certified Organic eggs
  - Whole
  - B/S/ Breast
  - Bone-in Breast
  - Whole Legs
  - Thighs
Price, Mean Price:
- Chicken: Price cents per pound Delivered to First Receivers
- Eggs: Price per Carton Delivered to First Receivers

Compute and visualize descriptive statistics

We want to know if there is a significant difference in the mean between 12 groups. It’s helpful to start by visualizing these mean values.

# create a new variable for a plottable data set
organic.plot <- organic %>%
  # abbreviate month names for legibility in plot
  mutate(Month = case_when(
    Month == "January" ~ "01",
    Month == "February" ~ "02",
    Month == "March" ~ "03",
    Month == "April" ~ "04",
    Month == "May" ~ "05",
    Month == "June" ~ "06",
    Month == "July" ~ "07",
    Month == "August" ~ "08",
    Month == "September" ~ "09",
    Month == "October" ~ "10",
    Month == "November" ~ "11",
    Month == "December" ~ "12",
    TRUE ~ "0"
  ))
# create a bar plot of Mean price, using "identity" statistic
organic.plot %>%
  group_by(Month, Category) %>%
  summarize(Mean = mean(Price, na.rm = TRUE)) %>%
  ggplot(aes(Month, Mean, fill = Category)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(x = Month, ymin = Mean - sd(Mean), ymax = Mean + sd(Mean), width = 0.5)) +
  # initially i added a "free_y" scales param here
#   but chose to remove it because it makes
# the mean price look equal at a glance for chicken and eggs
  facet_wrap(vars(Category)) +
  theme_minimal() +
  ylab("Mean Price") +
  labs(
    title = "Mean Price is similar across calendar Month",
    subtitle = "Data from 2003 to 2014"
  )

At a glance, there does not appear to be much difference in mean price by month of year. However, we can perform a more rigorous test of our NULL hypothesis using ANOVA (ANalysis Of VAriance). We’ve chosen anova() for our hypothesis test because we’re treating Month as a factor variable, meaning we are comparing a numerical dependent variable (Price) against a categorical dependent variable (Month), which has a total of 12 possible levels. However, anova() carries three key assumptions: * Independence of variables * Normalcy of residuals (difference between an observed Price value and its corresponding group - Month - mean) * ** Equality of variances**

For this project, we assume the independence of variables. Instead, we will explore the following two assumptions using visual plots.

Check for equal variance across groups

organic.plot %>%
  ggplot(aes(Month, Price, fill = Category, ymin = length(Price) - sd(Price), ymax = length(Price) + sd(Price))) +
  stat_boxplot(geom = "errorbar") +
  geom_boxplot() +
  ylab("Mean Price") +
  labs(
    title = "Price variance is similar across calendar Month but differs across Category",
    subtitle = "Data from 2003 to 2014"
  ) +
  facet_wrap(vars(Category)) +
  theme_minimal()

Viewing chicken and eggs separately, products appear to have similar price dispersion across calendar months. However, there is little question that descriptive statistics differ drastically across product category (chicken or eggs).

Test for significant difference across group means

From the visualization above, we know that the Category variable may influence the results to our NULL hypothesis test. That said, we also know that we have approximately equal variance across our analysis groups (Month), as long as we view chicken and eggs separately. For this reason, we will add Category as a blocking variable (” + Category”) to the linear model that anova() takes as its param.

# anova() takes as param linear model lm()
model <- lm(Price ~ Month + Category, organic)
anova(model)

Analysis of Variance Table

Response: Price
            Df   Sum Sq Mean Sq  F value Pr(>F)    
Month       11      543      49   0.0027      1    
Category     1  4173625 4173625 231.5171 <2e-16 ***
Residuals 1013 18261640   18027                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# inspect the results
summary(model)


Call:
lm(formula = Price ~ Month + Category, data = organic)

Residuals:
    Min      1Q  Median      3Q     Max 
-140.75 -117.68  -26.29   51.68  365.18 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     338.5727    15.3823  22.011   <2e-16 ***
MonthFebruary     1.1111    21.0978   0.053    0.958    
MonthMarch        1.1481    21.0978   0.054    0.957    
MonthApril        1.0802    21.0978   0.051    0.959    
MonthMay          0.9846    21.0978   0.047    0.963    
MonthJune         3.4537    21.0978   0.164    0.870    
MonthJuly         1.6724    20.5636   0.081    0.935    
MonthAugust       1.4807    20.5636   0.072    0.943    
MonthSeptember    1.3265    20.5636   0.065    0.949    
MonthOctober      1.3029    20.5636   0.063    0.949    
MonthNovember     1.4433    20.5636   0.070    0.944    
MonthDecember     1.4433    20.5636   0.070    0.944    
CategoryEggs   -128.3543     8.4357 -15.216   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 134.3 on 1013 degrees of freedom
Multiple R-squared:  0.186, Adjusted R-squared:  0.1764 
F-statistic:  19.3 on 12 and 1013 DF,  p-value: < 2.2e-16

For our purposes, we are concerned with Category row results. Our degrees of freedom (Df) corresponds with our expectations (12 months - 1). A small F value says that any variation in Price is most likely due to chance instead of the independent variable (Month). A Pr(>F) value well above 0.05 tells us that our F value likely would have occurred if our NULL hypothesis were true. In other words, this data do not reject our NULL hypothesis, which is that the price of poultry/eggs does not depend significantly on the month of the year.

Check for normality of residuals

Finally, we will test the last assumption for the anova() model above: the normality of the model’s residuals. (Residuals refer to the difference between an actual observed Price value and the mean of its group (Month).) We can visually check the distribution/normalcy of these residuals using a histogram, which takes as its x aesthetic the residuals column of the model variable created above.

# source for these steps to plotting residuals: https://www.r-bloggers.com/2020/10/anova-in-r/
hist(
  model$residuals,
  main = "Model residuals skew right",
  xlab = "Residuals",
  col = "white"
  )

We can see above that residuals skew to the right instead of following a more normal distribution. This project does not go on to correct for this concern, partly because our findings did not reject our NULL hypothesis, and partly because I just have more to learn.

I am happy with the overall narrative I was able to achieve here, but it still needs work. Before the final submission, I’ll add an introduction and conclusion, and will clean up the existing text (including adding some rich formatting for clarity). I could also add additional geom objects to visuals for emphasis.

Also, and most importantly, I’m mindful that this is very much a learning exercise. Namely, I can see why using calendar month just as a factor variable is problematic. My focus was more homed on working out the process and narrative.

Comment on this article Share:

HW6