challenge_4
eggs
More data wrangling: pivoting
Author

Tim Shores

Published

March 18, 2023

Code
#| label: setup
#| warning: false
#| message: false

my_packages <- c("tidyverse", "readxl", "knitr", "zoo") # create vector of packages
invisible(lapply(my_packages, require, character.only = TRUE)) # load multiple packages


knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Challenge 4 includes four tasks:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Task 1) Read in, Tidy, and Describe the Data

I chose to read in the organic egg file, organiceggpoultry.xls. I use the same process as in my Challenge 3 submission. I left out the details in this post, and show only the first 18 of 1,080 lines of the final data frame.

I also use mutate and case_match to convert price strings to 0. I could also remove these rows but I think it’s better to leave them and document that 0 means there were too few cases to make an observation.

Code
listEggDFs <- map(set_names(excel_sheets("../posts/_data/organiceggpoultry.xls")),
read_xls, path = "../posts/_data/organiceggpoultry.xls") # read in all sheets as dataframes in a list

df_eggPoultryData <- map(names(listEggDFs), ~assign(.x, listEggDFs[[.x]], envir = .GlobalEnv))[[1]] # assign the first df in the list to a separate df

productname <- c(word(df_eggPoultryData[2,1], -1),word(df_eggPoultryData[2,7], -1)) # Store last word from header values in list variables for use after pivot.

colnames(df_eggPoultryData) <- df_eggPoultryData[4, ] # Assign values from the 4th populated row to column names
colnames(df_eggPoultryData)[1] <- "month" # source data has no name in date column

df_eggPoultryData <- df_eggPoultryData %>% 
  slice(-(1:4)) %>% # Slice off the first four rows, which do not have table data.
  select(!`NA`) %>% # Select all columns but the empty one in the middle
  separate_wider_delim(month, delim = " ", names = c("month", "year"), too_few = "align_start", too_many = "merge") %>% 
  mutate(month = replace(month, month == 'Jan', 'January')) %>%
  mutate(year = replace(year, year == '/1', NA)) %>%
  fill(year) %>% # tidy the month column and separate year into its own column. 
  pivot_longer(cols = -c(month,year), names_to = "item", values_to = "price") %>% # to pivot the egg and chicken product column headings into row values under a new `item` column, with their values listed in a `price` column
  mutate(item = str_remove_all(item, '\\n')) %>% # tidy up item values and make labels consistent
  mutate(item = str_remove_all(item, '1/2 Dozen')) %>%
  mutate(item = str_replace_all(item, 'Doz\\.','Dozen')) %>%
  mutate(product = case_when( # assign the egg and chicken product variables that I recorded earlier according to each egg and chicken item
    str_detect(item, regex("^[EL]")) ~ productname[[1]],
    str_detect(item, regex("^[BTW]")) ~ productname[[2]]
    )) %>% 
  mutate(price = case_match(price, "too few" ~ NA, .default = as.double(price))) %>% # make column all double and NA
  print(n=18) # print a sample of the table
# A tibble: 1,080 × 5
   month    year  item                  price product
   <chr>    <chr> <chr>                 <dbl> <chr>  
 1 January  2004  Extra Large Dozen      230  Eggs   
 2 January  2004  Extra Large 1/2 Dozen  132  Eggs   
 3 January  2004  Large Dozen            230  Eggs   
 4 January  2004  Large 1/2 Dozen        126  Eggs   
 5 January  2004  Whole                  198. Chicken
 6 January  2004  B/S Breast             646. Chicken
 7 January  2004  Bone-in Breast          NA  Chicken
 8 January  2004  Whole Legs             194. Chicken
 9 January  2004  Thighs                  NA  Chicken
10 February 2004  Extra Large Dozen      230  Eggs   
11 February 2004  Extra Large 1/2 Dozen  134. Eggs   
12 February 2004  Large Dozen            226. Eggs   
13 February 2004  Large 1/2 Dozen        128. Eggs   
14 February 2004  Whole                  198. Chicken
15 February 2004  B/S Breast             642. Chicken
16 February 2004  Bone-in Breast          NA  Chicken
17 February 2004  Whole Legs             194. Chicken
18 February 2004  Thighs                 203  Chicken
# … with 1,062 more rows

Briefly describe the data

The data describes the price of several types of USDA certified organic egg and chicken products from the years 2004 to 2013. The price was paid by ‘first receivers’, described in a footnote as “those entities that purchases the processed product from the poultry or egg company, such as a retailer, distributor, or manufacturer.” The cited data sources are the U.S. Department of Agriculture, Agricultural Marketing Service (AMS) Market News, Organic Poultry and Eggs (Weekly reports).

The data includes 5 variables (date, 4 egg carton items and 5 chicken items, and an empty column) with prices for 3 items for each month of 10 years in 1076 rows.

Task 2) Identify variables that need to be mutated

In the Tidying code block above, I pivot the source data columns to a single item column. I then use several mutate functions to make the data more consistent and apply labels from other metadata taken from the original header.

I also use mutate to create a month and year column, but these variables are still character data types.

The lubridate package wants to use dates or datetimes. It can do calculations with periods such as months, but the output is still dates or datetimes rather than months. I could pick the 1st of each month to work around this limitation, but there’s another option …

The zoo package yearmon class can return a year and month combo to represent dates on in monthly units. I like that! I decided to use zoo instead of lubridate for this specific case.

Code
df_eggPoultryData$monthYear <- as.yearmon(paste(df_eggPoultryData$year, df_eggPoultryData$month), "%Y %B") # turn combine month and year strings into a yearmon data type column
df_eggPoultryData <- subset(df_eggPoultryData, select = -c(month, year)) %>%
  relocate(monthYear) %>% # we no longer need the month and year string columns
  print(n=18)
# A tibble: 1,080 × 4
   monthYear item                  price product
   <yearmon> <chr>                 <dbl> <chr>  
 1 Jan 2004  Extra Large Dozen      230  Eggs   
 2 Jan 2004  Extra Large 1/2 Dozen  132  Eggs   
 3 Jan 2004  Large Dozen            230  Eggs   
 4 Jan 2004  Large 1/2 Dozen        126  Eggs   
 5 Jan 2004  Whole                  198. Chicken
 6 Jan 2004  B/S Breast             646. Chicken
 7 Jan 2004  Bone-in Breast          NA  Chicken
 8 Jan 2004  Whole Legs             194. Chicken
 9 Jan 2004  Thighs                  NA  Chicken
10 Feb 2004  Extra Large Dozen      230  Eggs   
11 Feb 2004  Extra Large 1/2 Dozen  134. Eggs   
12 Feb 2004  Large Dozen            226. Eggs   
13 Feb 2004  Large 1/2 Dozen        128. Eggs   
14 Feb 2004  Whole                  198. Chicken
15 Feb 2004  B/S Breast             642. Chicken
16 Feb 2004  Bone-in Breast          NA  Chicken
17 Feb 2004  Whole Legs             194. Chicken
18 Feb 2004  Thighs                 203  Chicken
# … with 1,062 more rows

Now I can summarize price information by item. There are 9 items, so this table has only 9 rows. (I also updated the find_mode function from the version I used in Challenge 2 – it now omits NA values, which makes the calculation of mode consistent with how I’m using other summary calculations.)

Code
#define function to calculate mode
find_mode <- function(x) {
  u <- unique(x[!is.na(x)]) # unique list as an index, without NA
  tab <- tabulate(match(x[!is.na(x)], u))  # count how many times each index member occurs
  u[tab == max(tab)] #  the max occurrence is the mode
  mean(u) # return mean in case the data is multimodal
}

df_eggPoultryData %>% 
  group_by(item) %>% 
  summarise(
    meanPrice = mean(price, na.rm = TRUE), 
    modePrice = find_mode(price), 
    minPrice = fivenum(price, na.rm = TRUE)[1], 
    lowHingePrice = fivenum(price, na.rm = TRUE)[2], 
    medianPrice = median(price, na.rm = TRUE), 
    upHungePrice = fivenum(price, na.rm = TRUE)[4], 
    maxPrice = fivenum(price, na.rm = TRUE)[5], 
    count = n()
    )
# A tibble: 9 × 9
  item             meanP…¹ modeP…² minPr…³ lowHi…⁴ media…⁵ upHun…⁶ maxPr…⁷ count
  <chr>              <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <int>
1 B/S Breast          655.    657.    638.    645.    646.    646.    704.   120
2 Bone-in Breast      390.    390.    390.    390.    390.    390.    390.   120
3 Extra Large 1/2…    164.    147.    132     136.    186.    186.    188.   120
4 Extra Large Doz…    267.    252.    230     242.    286.    286.    290    120
5 Large 1/2 Dozen     155.    144.    126     129.    174.    174.    178    120
6 Large Dozen         254.    248.    225     234.    268.    268     278.   120
7 Thighs              218.    211.    200.    215     222     222     222    120
8 Whole               231.    222.    198.    220.    235     238.    248    120
9 Whole Legs          203.    200.    194.    204.    204.    204.    204.   120
# … with abbreviated variable names ¹​meanPrice, ²​modePrice, ³​minPrice,
#   ⁴​lowHingePrice, ⁵​medianPrice, ⁶​upHungePrice, ⁷​maxPrice

Or by month and item (this time using lubridate!). I expect this table to show 108 rows = 9 items * 12 months.

Code
df_eggPoultryData %>% 
  group_by(month = month(monthYear, label = TRUE), item) %>% 
  summarise(
    meanPrice = mean(price, na.rm = TRUE), 
    modePrice = find_mode(price), 
    minPrice = fivenum(price, na.rm = TRUE)[1], 
    lowHingePrice = fivenum(price, na.rm = TRUE)[2], 
    medianPrice = median(price, na.rm = TRUE), 
    upHungePrice = fivenum(price, na.rm = TRUE)[4], 
    maxPrice = fivenum(price, na.rm = TRUE)[5], 
    count = n()
    )
# A tibble: 108 × 10
# Groups:   month [12]
   month item      meanP…¹ modeP…² minPr…³ lowHi…⁴ media…⁵ upHun…⁶ maxPr…⁷ count
   <ord> <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <int>
 1 Jan   B/S Brea…    650.    658.    638.    644     646.    646.    704.    10
 2 Jan   Bone-in …    390.    390.    390.    390.    390.    390.    390.    10
 3 Jan   Extra La…    161.    156.    132     136.    162.    186.    188.    10
 4 Jan   Extra La…    263.    256.    230     241     265.    286.    290     10
 5 Jan   Large 1/…    152.    148.    126     128.    153.    174.    178     10
 6 Jan   Large Do…    252.    253.    230     234.    252.    268.    278.    10
 7 Jan   Thighs       219.    217.    213     215     222     222     222     10
 8 Jan   Whole        228.    226.    198.    217     228.    238.    248     10
 9 Jan   Whole Le…    202.    198.    194.    204.    204.    204.    204.    10
10 Feb   B/S Brea…    656.    666.    638.    646.    646.    646.    704.    10
# … with 98 more rows, and abbreviated variable names ¹​meanPrice, ²​modePrice,
#   ³​minPrice, ⁴​lowHingePrice, ⁵​medianPrice, ⁶​upHungePrice, ⁷​maxPrice

Or by product. One row each for eggs and chicken.

Code
df_eggPoultryData %>% 
  group_by(product) %>% 
  summarise(
    meanPrice = mean(price, na.rm = TRUE), 
    modePrice = find_mode(price), 
    minPrice = fivenum(price, na.rm = TRUE)[1], 
    lowHingePrice = fivenum(price, na.rm = TRUE)[2], 
    medianPrice = median(price, na.rm = TRUE), 
    upHungePrice = fivenum(price, na.rm = TRUE)[4], 
    maxPrice = fivenum(price, na.rm = TRUE)[5], 
    count = n()
    )
# A tibble: 2 × 9
  product meanPrice modePrice minPrice lowHingeP…¹ media…² upHun…³ maxPr…⁴ count
  <chr>       <dbl>     <dbl>    <dbl>       <dbl>   <dbl>   <dbl>   <dbl> <int>
1 Chicken      339.      331.     194.        215     235     390.    704.   600
2 Eggs         210.      204.     126         174.    207.    268.    290    480
# … with abbreviated variable names ¹​lowHingePrice, ²​medianPrice,
#   ³​upHungePrice, ⁴​maxPrice

Or by year and product. Ten years of eggs and chicken, in 20 rows.

Code
df_eggPoultryData %>% 
  group_by(year = year(monthYear), product) %>% 
  summarise(
    meanPrice = mean(price, na.rm = TRUE), 
    modePrice = find_mode(price), 
    minPrice = fivenum(price, na.rm = TRUE)[1], 
    lowHingePrice = fivenum(price, na.rm = TRUE)[2], 
    medianPrice = median(price, na.rm = TRUE), 
    upHungePrice = fivenum(price, na.rm = TRUE)[4], 
    maxPrice = fivenum(price, na.rm = TRUE)[5], 
    count = n()
    )
# A tibble: 20 × 10
# Groups:   year [10]
    year product meanPrice modeP…¹ minPr…² lowHi…³ media…⁴ upHun…⁵ maxPr…⁶ count
   <dbl> <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <int>
 1  2004 Chicken      325.    299.    194.    200.    212     390.    646.    60
 2  2004 Eggs         183.    177.    126     134.    181     234.    241     48
 3  2005 Chicken      336.    362.    204.    217     222     390.    646.    60
 4  2005 Eggs         185.    185.    128.    132     184.    237.    241     48
 5  2006 Chicken      336.    303.    204.    220.    222     390.    646.    60
 6  2006 Eggs         185.    204.    128.    132     184.    237.    242.    48
 7  2007 Chicken      336.    336.    204.    220.    222     390.    646.    60
 8  2007 Eggs         188.    187.    128.    134.    186.    239.    245     48
 9  2008 Chicken      340.    322.    204.    220.    248     390.    646.    60
10  2008 Eggs         213.    210.    132     174.    211.    278.    286.    48
11  2009 Chicken      342.    342.    204.    222     248     390.    646.    60
12  2009 Eggs         230.    239.    174.    180     228.    282.    286.    48
13  2010 Chicken      339.    355.    204.    215     235     390.    646.    60
14  2010 Eggs         228.    242.    174.    180     226.    278.    286.    48
15  2011 Chicken      336.    336.    204.    215     235     390.    638.    60
16  2011 Eggs         229.    237.    174.    180     226.    278.    286.    48
17  2012 Chicken      349.    393.    204.    215     238.    390.    704.    60
18  2012 Eggs         229.    226.    173.    182.    228.    276.    290     48
19  2013 Chicken      350.    350.    204.    216.    238.    390.    704.    60
20  2013 Eggs         231.    231.    178     183.    228.    279.    290     48
# … with abbreviated variable names ¹​modePrice, ²​minPrice, ³​lowHingePrice,
#   ⁴​medianPrice, ⁵​upHungePrice, ⁶​maxPrice