Challenge 3: Tidying Data

challenge3
jocelyn_lutes
organic_egg_poultry
tidyverse
ggplot2
Tidying the Organic Eggs & Poultry Dataset
Author

Jocelyn Lutes

Published

June 7, 2023

Code
library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE)

Import Data

To begin, we will read in the Organic Eggs and Poultry data that is stored in the Data sheet of organiceggpoultry.xls. From inspecting the raw file, we see that there are four additional heading rows that are not needed to view our data in tabular form. Therefore, we begin reading the data at line 5.

A sample of the data is shown below.

Code
df_raw <- read_excel('_data/organiceggpoultry.xls', sheet = 'Data', skip = 4)
head(df_raw)

Description of Dataset

This dataset is from the U.S. Department of Agriculture, and it contains data for the prices that “first receivers” (i.e., people who purchase the product directly from the egg or poultry seller) paid for egg and poultry products. The data covers the years 2004-2013.

The raw data contains 120 rows, each of which represents the price recorded for a specific month and year. There are 9 columns that represent the price for different egg and poultry products.

Prior to being able to use this data to conduct analyses, there are several changes that should be made to “tidy” the data:

  1. Remove Non-informative Columns: The formatting of the raw data resulted in a completely null column. We should remove this from the data.

  2. Separate Year and Months: The raw data is formatted so that the year appears in the January row, and February-December of that same year are unlabeled. We should ensure that every row has a value in the Year and Month columns.

  3. Convert “Too Few” to NA: There are a few columns where the presence of “Too Few” causes the price to be encoded as a character data type. To ensure that this column has a numeric data type, we should replace all “Too Few” instances with NA and cast the column to numeric.

  4. Pivot Longer to Create Product and Price Columns: The raw data is formatted so that the price of each product is given its own column. Because each of these columns represents the same thing, it would be better to move the names of the columns into a Product column and the values (prices) into a Price column.

Expected Results: If we tidy our data correctly, we should have four columns: Year, Month, Product and Price and 1080 rows (10 years * 12 months * 9 products).

Tidying the Data

1. Remove Non-informative Columns

We begin by removing column ...6 from the data.

Code
df_tidy <- df_raw %>% 
  select(-`...6`)

2. Separate the Year and Months into Separate Columns

Create Month and Year columns

Next, we extract the month and year from the first column. This breaks the date column into Month and Year columns.

Code
df_tidy <- df_tidy %>%
  rename(date = `...1`) %>%
  # use a regular expression to extract months as words (alpha characters only)
  # and years as digits (4-digit numerics only)
  mutate(
    Month = stringr::str_extract(date, "\\b[a-zA-Z]+\\b"),
    Year = as.numeric(stringr::str_extract(date, "\\b[0-9]{4}\\b"))
  )

df_tidy %>%
  select(date, Month, Year) %>%
  head()
Code
# remove unneeded date column
df_tidy <- df_tidy %>%
  select(-date)

Fill in the null years

Next, because the year is only included in the January column, we must fill in the NA years.

Code
# Manually assign the years - not sure if there is an easier way to do this!
df_tidy$Year[2:12] <- 2004
df_tidy$Year[14:24] <- 2005
df_tidy$Year[26:36] <- 2006
df_tidy$Year[38:48] <- 2007
df_tidy$Year[50:60] <- 2008
df_tidy$Year[61:72] <- 2009
df_tidy$Year[74:84] <- 2010
df_tidy$Year[86:96] <- 2011
df_tidy$Year[98:108] <- 2012
df_tidy$Year[110:nrow(df_tidy)] <- 2013

product_cols <- df_tidy %>%
  select(-Month, -Year) %>%
  colnames()

df_tidy <- df_tidy %>%
  select(Year, Month, all_of(product_cols))

df_tidy

We can also confirm that we have 12 months per year.

Code
df_tidy %>% group_by(Year) %>% tally()

Convert “Too Few” to NA

Next, to ensure that all prices are of numeric type, we replace the “too few” string with NA and use as.numeric to convert the data to the numeric type.

Code
# first get the columns that are character type
# remove columns that are expected to be character
char_cols <- df_tidy %>% 
  select_if(is.character) %>% 
  select(-Month) %>% 
  colnames()

# use across + anonymous function to replace 'too few'; also set existing numbers to numeric type
df_tidy <- df_tidy %>%
  mutate(across(all_of(char_cols), ~ifelse(. == 'too few', NA, as.numeric(.))))

Pivot Longer

Finally, we pivot our data so that each of the product names becomes a value in the Product column and each corresponding price becomes a value in the Price column.

Code
# get the columns that we need to pivot by removing Month and Year
cols_to_pivot <- df_tidy %>%
  select(-Month, -Year) %>%
  colnames()

# Pivot the data
df_tidy <- df_tidy %>%
  pivot_longer(
    cols = cols_to_pivot, 
    names_to = 'Product', 
    values_to = 'Price'
    )

# Rearrange cols
df_tidy <- df_tidy %>%
  select(Year, Month, Product, Price)

Our final data frame, which has a shape of 1080 rows and 4 columns (as expected!), is shown below. This data is now tidy because each column represents a single variable and each row represents the price of a single product for a given year and month.

Code
df_tidy