Data Analytics and Computational Social Science: HW4

Adam Wheeler

Read in the data set

knitr::opts_chunk$set(echo = TRUE)

library(dplyr)
library(tidyverse)
library(readxl)

organic <- read_excel(
  path = "../Downloads/organiceggpoultry.xlsx",
  skip = 4
)

Tidy the data set

# remove line breaks in column names
names(organic) <- gsub("\n", " ", names(organic))
# name column to make it mutable
colnames(organic)[1] = "Date"
# remove typo from column name
colnames(organic)[3] = "Extra Large 1/2 Doz."

organic <- organic %>%
  # drop empty column
  select(-6) %>%
  # remove extra characters from Date
  mutate(Date = ifelse(grepl("/", Date), gsub(".{3}$", "", Date), Date)) %>%
  # separate Date into month and year variables
  separate(Date, sep = " ", into = c("Month", "Year")) %>%
  # fix abbreviated month name
  mutate(Month = str_replace(Month, "Jan", "January")) %>%
  # fill in missing year values
  fill(Year) %>%
  # convert mutated variables as numeric values (and replace strings with na)
  mutate(Year = as.numeric(Year)) %>%
  mutate(across(3:11, as.numeric))

head(organic)

# A tibble: 6 x 11
  Month     Year `Extra Large  Dozen` `Extra Large 1/2~ `Large  Dozen`
  <chr>    <dbl>                <dbl>             <dbl>          <dbl>
1 January   2004                 230               132            230 
2 February  2004                 230               134.           226.
3 March     2004                 230               137            225 
4 April     2004                 234.              137            225 
5 May       2004                 236               137            225 
6 June      2004                 241               137            231.
# ... with 6 more variables: Large  1/2 Doz. <dbl>, Whole <dbl>,
#   B/S Breast <dbl>, Bone-in Breast <dbl>, Whole Legs <dbl>,
#   Thighs <dbl>

Compute descriptive statistics

all_stats <- organic %>%
  summarize_each(funs(mean(., na.rm = TRUE), median(., na.rm = TRUE), sd(., na.rm = TRUE)), -c(Month, Year))

yearly_stats <- organic %>%
  group_by(Year) %>%
  select(-Month) %>%
  summarize_all(funs(mean(., na.rm = TRUE))) %>%
  pivot_longer(cols = c(-Year)) %>%
  rename(Product = name) %>%
  rename(`Mean Price` = value) %>%
  mutate(
    Category = if_else(
      grepl("Doz", Product), "Eggs","Chicken"
    )
  )

monthly_stats <- organic %>%
  group_by(Month) %>%
  select(-Year) %>%
  summarize_all(funs(mean(., na.rm = TRUE))) %>%
  pivot_longer(cols = c(-Month)) %>%
  rename(Product = name) %>%
  rename(`Mean Price` = value) %>%
  mutate(
    Category = if_else(
      grepl("Doz", Product), "Eggs","Chicken"
    )
  )

head(yearly_stats)

# A tibble: 6 x 4
   Year Product              `Mean Price` Category
  <dbl> <chr>                       <dbl> <chr>   
1  2004 Extra Large  Dozen           237. Eggs    
2  2004 Extra Large 1/2 Doz.         136. Eggs    
3  2004 Large  Dozen                 230. Eggs    
4  2004 Large  1/2 Doz.              130. Eggs    
5  2004 Whole                        212. Chicken 
6  2004 B/S Breast                   643. Chicken

Create and explain two visualizations

ggplot(data = yearly_stats, mapping = aes(
      x = `Year`,
      y = `Mean Price`,
      color = Category
    )) +
  ggtitle("Mean Product Price by Year") +
  # geom_point() +
  geom_smooth()

In this visualization, I am plotting the mean price of chicken and egg products over the years 2004 to 2013. I am plotting this data in response to the question: How has the price of chicken and eggs changed over time? From this visualization, I can conclude that the average price of both chicken and egg products have increased over time.

Limitations

This visualization does not account for individual product prices. The values range is helpful but may clutter the plot or make it confusing for a naive viewer. For improvement, I could remove the plot background and simplify the design, in the hope of reducing clutter.

eggs <- filter(monthly_stats, Category == "Eggs")
chicken <- filter(monthly_stats, Category == "Chicken")
ggplot(data = eggs, mapping = aes(x = Month, y = `Mean Price`)) +
  geom_boxplot() +
  ggtitle("Monthly Egg Product Prices") +
  coord_flip()

ggplot(data = chicken, mapping = aes(x = Month, y = `Mean Price`)) +
  geom_boxplot() +
  ggtitle("Monthly Chicken Product Prices") +
  coord_flip()

In this visualization, I am plotting the mean price of chicken and egg products for each month of the year between 2004 and 2013. I am plotting this data in response to the question: How do product prices compare with each other per month? From this visualization, I can conclude that product prices were highest in July. I can also see large outlier values in the mean price of chicken products.

Limitations

This visualization is admittedly cumbersome. Both egg and chicken products should appear either on one plot or else side by side. It is difficult to interpret the actual or relative value of the chicken products due to outliers. For improvement, I may reconsider the boxplot as the best tool to explore my research question.

Comment on this article Share:

HW4

Read in the data set

Tidy the data set

Compute descriptive statistics

Create and explain two visualizations

Limitations

Limitations

Reuse

Citation