Challenge 3: Tidying Data

challenge3

jocelyn_lutes

organic_egg_poultry

tidyverse

ggplot2

Tidying the Organic Eggs & Poultry Dataset

Author

Jocelyn Lutes

Published

June 7, 2023

Code

library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE)

Import Data

To begin, we will read in the Organic Eggs and Poultry data that is stored in the Data sheet of organiceggpoultry.xls. From inspecting the raw file, we see that there are four additional heading rows that are not needed to view our data in tabular form. Therefore, we begin reading the data at line 5.

A sample of the data is shown below.

Code

df_raw <- read_excel('_data/organiceggpoultry.xls', sheet = 'Data', skip = 4)
head(df_raw)

Description of Dataset

This dataset is from the U.S. Department of Agriculture, and it contains data for the prices that “first receivers” (i.e., people who purchase the product directly from the egg or poultry seller) paid for egg and poultry products. The data covers the years 2004-2013.

The raw data contains 120 rows, each of which represents the price recorded for a specific month and year. There are 9 columns that represent the price for different egg and poultry products.

Prior to being able to use this data to conduct analyses, there are several changes that should be made to “tidy” the data:

Remove Non-informative Columns: The formatting of the raw data resulted in a completely null column. We should remove this from the data.
Separate Year and Months: The raw data is formatted so that the year appears in the January row, and February-December of that same year are unlabeled. We should ensure that every row has a value in the Year and Month columns.
Convert “Too Few” to NA: There are a few columns where the presence of “Too Few” causes the price to be encoded as a character data type. To ensure that this column has a numeric data type, we should replace all “Too Few” instances with NA and cast the column to numeric.
Pivot Longer to Create Product and Price Columns: The raw data is formatted so that the price of each product is given its own column. Because each of these columns represents the same thing, it would be better to move the names of the columns into a Product column and the values (prices) into a Price column.

Expected Results: If we tidy our data correctly, we should have four columns: Year, Month, Product and Price and 1080 rows (10 years * 12 months * 9 products).

Tidying the Data

1. Remove Non-informative Columns

We begin by removing column ...6 from the data.

Code

df_tidy <- df_raw %>% 
  select(-`...6`)

2. Separate the Year and Months into Separate Columns

Create `Month` and `Year` columns

Next, we extract the month and year from the first column. This breaks the date column into Month and Year columns.

Code

df_tidy <- df_tidy %>%
  rename(date = `...1`) %>%
  # use a regular expression to extract months as words (alpha characters only)
  # and years as digits (4-digit numerics only)
  mutate(
    Month = stringr::str_extract(date, "\\b[a-zA-Z]+\\b"),
    Year = as.numeric(stringr::str_extract(date, "\\b[0-9]{4}\\b"))
  )

df_tidy %>%
  select(date, Month, Year) %>%
  head()

Code

# remove unneeded date column
df_tidy <- df_tidy %>%
  select(-date)

Fill in the null years

Next, because the year is only included in the January column, we must fill in the NA years.

Code

# Manually assign the years - not sure if there is an easier way to do this!
df_tidy$Year[2:12] <- 2004
df_tidy$Year[14:24] <- 2005
df_tidy$Year[26:36] <- 2006
df_tidy$Year[38:48] <- 2007
df_tidy$Year[50:60] <- 2008
df_tidy$Year[61:72] <- 2009
df_tidy$Year[74:84] <- 2010
df_tidy$Year[86:96] <- 2011
df_tidy$Year[98:108] <- 2012
df_tidy$Year[110:nrow(df_tidy)] <- 2013

product_cols <- df_tidy %>%
  select(-Month, -Year) %>%
  colnames()

df_tidy <- df_tidy %>%
  select(Year, Month, all_of(product_cols))

df_tidy

We can also confirm that we have 12 months per year.

Code

df_tidy %>% group_by(Year) %>% tally()

Convert “Too Few” to `NA`

Next, to ensure that all prices are of numeric type, we replace the “too few” string with NA and use as.numeric to convert the data to the numeric type.

Code

# first get the columns that are character type
# remove columns that are expected to be character
char_cols <- df_tidy %>% 
  select_if(is.character) %>% 
  select(-Month) %>% 
  colnames()

# use across + anonymous function to replace 'too few'; also set existing numbers to numeric type
df_tidy <- df_tidy %>%
  mutate(across(all_of(char_cols), ~ifelse(. == 'too few', NA, as.numeric(.))))

Pivot Longer

Finally, we pivot our data so that each of the product names becomes a value in the Product column and each corresponding price becomes a value in the Price column.

Code

# get the columns that we need to pivot by removing Month and Year
cols_to_pivot <- df_tidy %>%
  select(-Month, -Year) %>%
  colnames()

# Pivot the data
df_tidy <- df_tidy %>%
  pivot_longer(
    cols = cols_to_pivot, 
    names_to = 'Product', 
    values_to = 'Price'
    )

# Rearrange cols
df_tidy <- df_tidy %>%
  select(Year, Month, Product, Price)

Our final data frame, which has a shape of 1080 rows and 4 columns (as expected!), is shown below. This data is now tidy because each column represents a single variable and each row represents the price of a single product for a given year and month.

Code

df_tidy

--- title: "Challenge 3: Tidying Data" author: "Jocelyn Lutes" description: "Tidying the Organic Eggs & Poultry Dataset" date: "06/08/2023" format: html: df-print: paged toc: true code-fold: true code-copy: true code-tools: true categories: - challenge3 - jocelyn_lutes - organic_egg_poultry - tidyverse - readxl --- ```{r message = FALSE} #| label: setup #| warning: false library(tidyverse) library(readxl) knitr::opts_chunk$set(echo = TRUE) ``` ## Import Data To begin, we will read in the Organic Eggs and Poultry data that is stored in the `Data` sheet of `organiceggpoultry.xls`. From inspecting the raw file, we see that there are four additional heading rows that are not needed to view our data in tabular form. Therefore, we begin reading the data at line 5. A sample of the data is shown below. ``` {r message = FALSE} df_raw <- read_excel('_data/organiceggpoultry.xls', sheet = 'Data', skip = 4) head(df_raw) ``` ## Description of Dataset This dataset is from the U.S. Department of Agriculture, and it contains data for the prices that "first receivers" (i.e., people who purchase the product directly from the egg or poultry seller) paid for egg and poultry products. The data covers the years 2004-2013. The raw data contains `r nrow(df_raw)` rows, each of which represents the price recorded for a specific month and year. There are `r ncol(df_raw) - 2` columns that represent the price for different egg and poultry products. Prior to being able to use this data to conduct analyses, there are several changes that should be made to "tidy" the data: 1. **Remove Non-informative Columns**: The formatting of the raw data resulted in a completely null column. We should remove this from the data. 2. **Separate Year and Months**: The raw data is formatted so that the year appears in the January row, and February-December of that same year are unlabeled. We should ensure that every row has a value in the Year and Month columns. 3. **Convert "Too Few" to `NA`**: There are a few columns where the presence of "Too Few" causes the price to be encoded as a `character` data type. To ensure that this column has a `numeric` data type, we should replace all "Too Few" instances with `NA` and cast the column to `numeric`. 4. **Pivot Longer to Create Product and Price Columns**: The raw data is formatted so that the price of each product is given its own column. Because each of these columns represents the same thing, it would be better to move the names of the columns into a `Product` column and the values (prices) into a `Price` column. **Expected Results:** If we tidy our data correctly, we should have four columns: `Year`, `Month`, `Product` and `Price` and 1080 rows (10 years * 12 months * 9 products). ## Tidying the Data ### 1. Remove Non-informative Columns We begin by removing column `...6` from the data. ``` {r} df_tidy <- df_raw %>% select(-`...6`) ``` ### 2. Separate the Year and Months into Separate Columns #### *Create `Month` and `Year` columns* Next, we extract the month and year from the first column. This breaks the date column into `Month` and `Year` columns. ``` {r} df_tidy <- df_tidy %>% rename(date = `...1`) %>% # use a regular expression to extract months as words (alpha characters only) # and years as digits (4-digit numerics only) mutate( Month = stringr::str_extract(date, "\\b[a-zA-Z]+\\b"), Year = as.numeric(stringr::str_extract(date, "\\b[0-9]{4}\\b")) ) df_tidy %>% select(date, Month, Year) %>% head() # remove unneeded date column df_tidy <- df_tidy %>% select(-date) ``` #### *Fill in the null years* Next, because the year is only included in the January column, we must fill in the `NA` years. ``` {r} # Manually assign the years - not sure if there is an easier way to do this! df_tidy$Year[2:12] <- 2004 df_tidy$Year[14:24] <- 2005 df_tidy$Year[26:36] <- 2006 df_tidy$Year[38:48] <- 2007 df_tidy$Year[50:60] <- 2008 df_tidy$Year[61:72] <- 2009 df_tidy$Year[74:84] <- 2010 df_tidy$Year[86:96] <- 2011 df_tidy$Year[98:108] <- 2012 df_tidy$Year[110:nrow(df_tidy)] <- 2013 product_cols <- df_tidy %>% select(-Month, -Year) %>% colnames() df_tidy <- df_tidy %>% select(Year, Month, all_of(product_cols)) df_tidy ``` We can also confirm that we have 12 months per year. ```{r} df_tidy %>% group_by(Year) %>% tally() ``` #### *Convert "Too Few" to `NA`* Next, to ensure that all prices are of `numeric` type, we replace the "too few" string with NA and use `as.numeric` to convert the data to the `numeric` type. ``` {r warning = FALSE} # first get the columns that are character type # remove columns that are expected to be character char_cols <- df_tidy %>% select_if(is.character) %>% select(-Month) %>% colnames() # use across + anonymous function to replace 'too few'; also set existing numbers to numeric type df_tidy <- df_tidy %>% mutate(across(all_of(char_cols), ~ifelse(. == 'too few', NA, as.numeric(.)))) ``` #### *Pivot Longer* Finally, we pivot our data so that each of the product names becomes a value in the `Product` column and each corresponding price becomes a value in the `Price` column. ``` {r message = FALSE, warning = FALSE} # get the columns that we need to pivot by removing Month and Year cols_to_pivot <- df_tidy %>% select(-Month, -Year) %>% colnames() # Pivot the data df_tidy <- df_tidy %>% pivot_longer( cols = cols_to_pivot, names_to = 'Product', values_to = 'Price' ) # Rearrange cols df_tidy <- df_tidy %>% select(Year, Month, Product, Price) ``` Our final data frame, which has a shape of `r nrow(df_tidy)` rows and `r ncol(df_tidy)` columns (as expected!), is shown below. This data is now tidy because each column represents a single variable and each row represents the price of a single product for a given year and month. ``` {r} df_tidy ```

Import Data

Description of Dataset

Tidying the Data

1. Remove Non-informative Columns

2. Separate the Year and Months into Separate Columns

Create Month and Year columns

Fill in the null years

Convert “Too Few” to NA

Pivot Longer

Create `Month` and `Year` columns

Convert “Too Few” to `NA`