Homework 6
Does the price of poultry/eggs depend on the month of the year?
H0: The price of poultry/eggs does not depend significantly on the month of year.
HA: The month of the year may significantly influence the price of poultry/eggs.
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(tidyverse)
library(readxl)
organic <- read_excel(
path = "../Downloads/organiceggpoultry.xlsx",
# fortunately we know how much header content to skip
skip = 4
)
head(organic)
# A tibble: 6 x 11
...1 `Extra Large \nDozen` `Extra Large 1/2 Doz~ `Large \nDozen`
<chr> <dbl> <dbl> <dbl>
1 Jan 2004 230 132 230
2 February 230 134. 226.
3 March 230 137 225
4 April 234. 137 225
5 May 236 137 225
6 June 241 137 231.
# ... with 7 more variables: Large
1/2 Doz. <dbl>, ...6 <lgl>,
# Whole <dbl>, B/S Breast <dbl>, Bone-in Breast <chr>,
# Whole Legs <dbl>, Thighs <chr>
# given few variables in the data set, i decided
# to just split and use month as
# a factor variable for this project, largely as a practical recoding exercise.
# lets tidy the col names before turning to the values.
# start by removing line breaks
names(organic) <- gsub("\n", " ", names(organic))
# give the first column a name to make it mutable
# (we don't have to but makes things more legible)
colnames(organic)[1] = "Date"
# remove typo from the third column name
colnames(organic)[3] = "Extra Large 1/2 Doz."
organic <- organic %>%
# drop empty column
select(-6) %>%
# ok now lets tidy the values...
# remove weird extra characters from Date
mutate(Date = ifelse(grepl("/", Date), gsub(".{3}$", "", Date), Date)) %>%
# separate Date into month and year variables
separate(Date, sep = " ", into = c("Month", "Year")) %>%
# fix abbreviated month name
mutate(Month = str_replace(Month, "Jan", "January")) %>%
# convert Month from string to factor variable
# looks like r already has month.name levels that we cane use, so we don't need to define them again
mutate(Month = factor(Month, month.name)) %>%
# fill in missing year values
fill(Year) %>%
# cast variables as numeric values (also replaces strings with na)
mutate(Year = as.numeric(Year)) %>%
mutate(across(3:11, as.numeric)) %>%
# then drop na values, which should be ok for our research
drop_na() %>%
# pivot descriptive columns to make every row a unique observation of unique variables
pivot_longer(-c(Month, Year)) %>%
rename(Product = name) %>%
rename(Price = value) %>%
# add a category column
mutate(
Category = if_else(
grepl("Doz", Product), "Eggs","Chicken"
)
) %>%
# convert Category from string to factor variable
mutate(Category = factor(Category, c("Chicken", "Eggs"))) %>%
# we don't need Year or Product anymore
select(-c(Year, Product)) %>%
arrange(Month)
head(organic)
# A tibble: 6 x 3
Month Price Category
<fct> <dbl> <fct>
1 January 241 Eggs
2 January 136. Eggs
3 January 234. Eggs
4 January 128. Eggs
5 January 217 Chicken
6 January 644 Chicken
We want to know if there is a significant difference in the mean between 12 groups. It’s helpful to start by visualizing these mean values.
# create a new variable for a plottable data set
organic.plot <- organic %>%
# abbreviate month names for legibility in plot
mutate(Month = case_when(
Month == "January" ~ "01",
Month == "February" ~ "02",
Month == "March" ~ "03",
Month == "April" ~ "04",
Month == "May" ~ "05",
Month == "June" ~ "06",
Month == "July" ~ "07",
Month == "August" ~ "08",
Month == "September" ~ "09",
Month == "October" ~ "10",
Month == "November" ~ "11",
Month == "December" ~ "12",
TRUE ~ "0"
))
# create a bar plot of Mean price, using "identity" statistic
organic.plot %>%
group_by(Month, Category) %>%
summarize(Mean = mean(Price, na.rm = TRUE)) %>%
ggplot(aes(Month, Mean, fill = Category)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(x = Month, ymin = Mean - sd(Mean), ymax = Mean + sd(Mean), width = 0.5)) +
# initially i added a "free_y" scales param here
# but chose to remove it because it makes
# the mean price look equal at a glance for chicken and eggs
facet_wrap(vars(Category)) +
theme_minimal() +
ylab("Mean Price") +
labs(
title = "Mean Price is similar across calendar Month",
subtitle = "Data from 2003 to 2014"
)
At a glance, there does not appear to be much difference in mean price by month of year. However, we can perform a more rigorous test of our NULL hypothesis using ANOVA (ANalysis Of VAriance). We’ve chosen anova() for our hypothesis test because we’re treating Month as a factor variable, meaning we are comparing a numerical dependent variable (Price) against a categorical dependent variable (Month), which has a total of 12 possible levels. However, anova() carries three key assumptions: * Independence of variables * Normalcy of residuals (difference between an observed Price value and its corresponding group - Month - mean) * ** Equality of variances**
For this project, we assume the independence of variables. Instead, we will explore the following two assumptions using visual plots.
organic.plot %>%
ggplot(aes(Month, Price, fill = Category, ymin = length(Price) - sd(Price), ymax = length(Price) + sd(Price))) +
stat_boxplot(geom = "errorbar") +
geom_boxplot() +
ylab("Mean Price") +
labs(
title = "Price variance is similar across calendar Month but differs across Category",
subtitle = "Data from 2003 to 2014"
) +
facet_wrap(vars(Category)) +
theme_minimal()
Viewing chicken and eggs separately, products appear to have similar price dispersion across calendar months. However, there is little question that descriptive statistics differ drastically across product category (chicken or eggs).
From the visualization above, we know that the Category variable may influence the results to our NULL hypothesis test. That said, we also know that we have approximately equal variance across our analysis groups (Month), as long as we view chicken and eggs separately. For this reason, we will add Category as a blocking variable (” + Category”) to the linear model that anova() takes as its param.
# anova() takes as param linear model lm()
model <- lm(Price ~ Month + Category, organic)
anova(model)
Analysis of Variance Table
Response: Price
Df Sum Sq Mean Sq F value Pr(>F)
Month 11 543 49 0.0027 1
Category 1 4173625 4173625 231.5171 <2e-16 ***
Residuals 1013 18261640 18027
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# inspect the results
summary(model)
Call:
lm(formula = Price ~ Month + Category, data = organic)
Residuals:
Min 1Q Median 3Q Max
-140.75 -117.68 -26.29 51.68 365.18
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 338.5727 15.3823 22.011 <2e-16 ***
MonthFebruary 1.1111 21.0978 0.053 0.958
MonthMarch 1.1481 21.0978 0.054 0.957
MonthApril 1.0802 21.0978 0.051 0.959
MonthMay 0.9846 21.0978 0.047 0.963
MonthJune 3.4537 21.0978 0.164 0.870
MonthJuly 1.6724 20.5636 0.081 0.935
MonthAugust 1.4807 20.5636 0.072 0.943
MonthSeptember 1.3265 20.5636 0.065 0.949
MonthOctober 1.3029 20.5636 0.063 0.949
MonthNovember 1.4433 20.5636 0.070 0.944
MonthDecember 1.4433 20.5636 0.070 0.944
CategoryEggs -128.3543 8.4357 -15.216 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 134.3 on 1013 degrees of freedom
Multiple R-squared: 0.186, Adjusted R-squared: 0.1764
F-statistic: 19.3 on 12 and 1013 DF, p-value: < 2.2e-16
For our purposes, we are concerned with Category row results. Our degrees of freedom (Df) corresponds with our expectations (12 months - 1). A small F value says that any variation in Price is most likely due to chance instead of the independent variable (Month). A Pr(>F) value well above 0.05 tells us that our F value likely would have occurred if our NULL hypothesis were true. In other words, this data do not reject our NULL hypothesis, which is that the price of poultry/eggs does not depend significantly on the month of the year.
Finally, we will test the last assumption for the anova() model above: the normality of the model’s residuals. (Residuals refer to the difference between an actual observed Price value and the mean of its group (Month).) We can visually check the distribution/normalcy of these residuals using a histogram, which takes as its x aesthetic the residuals column of the model variable created above.
# source for these steps to plotting residuals: https://www.r-bloggers.com/2020/10/anova-in-r/
hist(
model$residuals,
main = "Model residuals skew right",
xlab = "Residuals",
col = "white"
)
We can see above that residuals skew to the right instead of following a more normal distribution. This project does not go on to correct for this concern, partly because our findings did not reject our NULL hypothesis, and partly because I just have more to learn.
I am happy with the overall narrative I was able to achieve here, but it still needs work. Before the final submission, I’ll add an introduction and conclusion, and will clean up the existing text (including adding some rich formatting for clarity). I could also add additional geom objects to visuals for emphasis.
Also, and most importantly, I’m mindful that this is very much a learning exercise. Namely, I can see why using calendar month just as a factor variable is problematic. My focus was more homed on working out the process and narrative.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wheeler (2022, Jan. 20). Data Analytics and Computational Social Science: HW6. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomabwheelhw6/
BibTeX citation
@misc{wheeler2022hw6, author = {Wheeler, Adam}, title = {Data Analytics and Computational Social Science: HW6}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomabwheelhw6/}, year = {2022} }