Challenge 3: Create a tidy organiceggpoultry dataset

challenge_3
organiceggpoultry
laurenzichittella
Tidy Data: Pivoting
Author

Lauren Zichittella

Published

March 5, 2023

Code
library(tidyverse)
library(readxl)
library(dplyr) 
library(stringr)
library(tidyr)
library(kableExtra)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

This data represents cost of eggs versus poultry spanning January 2004- December 13. Costs are broken down different for eggs versus chickens. - Egg cost is categorized to differentiate cost of large versus x large items by units of 6 or 12 eggs. - Poultry cost is characteirzed by whole versus parts represented and whether part includes skin and/or bone when applicable.

In a tidy version of this data, each observation would represent a single year, month, product (egg versus chicken), type of product (L, XL, Whole Chicken, etc) unit of product (1, 6, 0r 12), other detail (skin on, bone in, etc), and cost. Processing the data to this form would transition the data from being wide & short, that is having many columns and few observations), to being long and lean, few columns & many more observations.

Step 1: Read in data, characterize identify steps to clean

My goal with this chunk is to import a version of “data” that is as near as possible to the raw data but with enough cleaning to allow for evaluation necessary to refine to a “pretidy version” that will be coded in subsequent separate chunk.

Methods 1) Import data, skip header records and specify col name to allow easy analysis 2) utilize simple functions to evaluate distribution of values to answer following questions and inform next steps: - Is the colType appropriate for what is represented? This data looks to represent cost but “chix” columns imported as text - Are values of yearMonth distinct? First looks indicate rows where the value includes “Jan” also include the year. Would expect remaining values to be standard month names

Wishlist/To do - evaluate other sheets in xlsx to confirm they sourced figures from data

Code
orig_organiceggpoultry <- read_xls(  "_data/organiceggpoultry.xls"
                                     , sheet = "Data"
                                     , col_names =  FALSE)

orig_organiceggpoultry <- read_xls(  "_data/organiceggpoultry.xls"
                                     , sheet = "Data"
                                     , col_names = c("yearmonth", "eggcost_xl_12", "eggcost_xl_6", "eggcost_l_12", "eggcost_l_6", "delete", "chix_whole", "chix_boneskin_breast", "chix_bone_breast", "chix_whole_leg", "chix_thigh" ) 
                                     , skip = 5)

orig_organiceggpoultry
# A tibble: 120 × 11
   yearmonth eggcost_xl…¹ eggco…² eggco…³ eggco…⁴ delete chix_…⁵ chix_…⁶ chix_…⁷
   <chr>            <dbl>   <dbl>   <dbl>   <dbl> <lgl>    <dbl>   <dbl> <chr>  
 1 Jan 2004          230     132     230     126  NA        198.    646. too few
 2 February          230     134.    226.    128. NA        198.    642. too few
 3 March             230     137     225     131  NA        209     642. too few
 4 April             234.    137     225     131  NA        212     642. too few
 5 May               236     137     225     131  NA        214.    642. too few
 6 June              241     137     231.    134. NA        216.    641  too few
 7 July              241     137     234.    134. NA        217     642. 390.5  
 8 August            241     137     234.    134. NA        217     642. 390.5  
 9 September         241     136.    234.    130. NA        217     642. 390.5  
10 October           241     136.    234.    128. NA        217     642. 390.5  
# … with 110 more rows, 2 more variables: chix_whole_leg <dbl>,
#   chix_thigh <chr>, and abbreviated variable names ¹​eggcost_xl_12,
#   ²​eggcost_xl_6, ³​eggcost_l_12, ⁴​eggcost_l_6, ⁵​chix_whole,
#   ⁶​chix_boneskin_breast, ⁷​chix_bone_breast
Code
orig_organiceggpoultry%>%
    select(yearmonth)%>%
    n_distinct
[1] 22
Code
orig_organiceggpoultry%>%
    select(yearmonth)%>%
    distinct
# A tibble: 22 × 1
   yearmonth
   <chr>    
 1 Jan 2004 
 2 February 
 3 March    
 4 April    
 5 May      
 6 June     
 7 July     
 8 August   
 9 September
10 October  
# … with 12 more rows
Code
print(summarytools::dfSummary(orig_organiceggpoultry,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

orig_organiceggpoultry

Dimensions: 120 x 11
Duplicates: 8
Variable Stats / Values Freqs (% of Valid) Graph Missing
yearmonth [character]
1. April
2. August
3. December
4. July
5. June
6. March
7. May
8. November
9. October
10. September
[ 12 others ]
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
20 ( 16.7% )
0 (0.0%)
eggcost_xl_12 [numeric]
Mean (sd) : 266.8 (22.8)
min ≤ med ≤ max:
230 ≤ 285.5 ≤ 290
IQR (CV) : 44 (0.1)
11 distinct values 0 (0.0%)
eggcost_xl_6 [numeric]
Mean (sd) : 164.2 (24.7)
min ≤ med ≤ max:
132 ≤ 185.5 ≤ 188.1
IQR (CV) : 49.7 (0.2)
132.00   : 1 ( 0.8% )
134.50   : 1 ( 0.8% )
135.50   : 28 ( 23.3% )
135.88 ! : 1 ( 0.8% )
137.00   : 6 ( 5.0% )
138.12 ! : 1 ( 0.8% )
139.00   : 15 ( 12.5% )
185.50   : 53 ( 44.2% )
188.13   : 14 ( 11.7% )
! rounded
0 (0.0%)
eggcost_l_12 [numeric]
Mean (sd) : 254.2 (18.5)
min ≤ med ≤ max:
225 ≤ 267.5 ≤ 277.5
IQR (CV) : 34.5 (0.1)
12 distinct values 0 (0.0%)
eggcost_l_6 [numeric]
Mean (sd) : 155.2 (22.6)
min ≤ med ≤ max:
126 ≤ 174.5 ≤ 178
IQR (CV) : 45.1 (0.1)
126.00   : 1 ( 0.8% )
128.50   : 29 ( 24.2% )
129.75   : 1 ( 0.8% )
131.00   : 3 ( 2.5% )
131.12 ! : 1 ( 0.8% )
132.00   : 15 ( 12.5% )
133.50   : 3 ( 2.5% )
173.25   : 6 ( 5.0% )
174.50   : 47 ( 39.2% )
178.00   : 14 ( 11.7% )
! rounded
0 (0.0%)
delete [logical]
All NA's
120 (100.0%)
chix_whole [numeric]
Mean (sd) : 230.5 (12.5)
min ≤ med ≤ max:
197.5 ≤ 235 ≤ 248
IQR (CV) : 18 (0.1)
12 distinct values 0 (0.0%)
chix_boneskin_breast [numeric]
Mean (sd) : 654.9 (23.3)
min ≤ med ≤ max:
637.5 ≤ 645.5 ≤ 703.8
IQR (CV) : 0.4 (0)
637.50 : 17 ( 14.2% )
641.00 : 1 ( 0.8% )
642.30 : 1 ( 0.8% )
642.50 : 10 ( 8.3% )
644.00 : 1 ( 0.8% )
645.50 : 67 ( 55.8% )
700.00 : 9 ( 7.5% )
703.75 : 14 ( 11.7% )
0 (0.0%)
chix_bone_breast [character]
1. 390.5
2. too few
114 ( 95.0% )
6 ( 5.0% )
0 (0.0%)
chix_whole_leg [numeric]
Mean (sd) : 203.1 (2)
min ≤ med ≤ max:
193.5 ≤ 203.5 ≤ 203.5
IQR (CV) : 0 (0)
193.50   : 5 ( 4.2% )
201.88 ! : 1 ( 0.8% )
203.50   : 114 ( 95.0% )
! rounded
0 (0.0%)
chix_thigh [character]
1. 199.5
2. 200.375
3. 203
4. 213
5. 215
6. 216.25
7. 219.19999999999999
8. 222
9. too few
6 ( 5.0% )
1 ( 0.8% )
4 ( 3.3% )
1 ( 0.8% )
23 ( 19.2% )
17 ( 14.2% )
1 ( 0.8% )
66 ( 55.0% )
1 ( 0.8% )
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-04-25

Step 2: Clean data better, identify steps to tidy

My goal with this chunk is produce a cleanest possible version of the data that will be the source for tidying

The end result of this step is to have a dataframe that will allow for transposing where each row represents: year of observation Month of observation Unit of measure (equivalent to variable name transposed i.e.eggcost_xl_12, eggcost_xl_6, etc) value of observations/that is the cost of the item represented by row of data

From this dataset, further tidying can be done to create a variables indicating: whether value represents egg versus chicken type of measure (something like L egg, XL egg, Part of chicken) unit of measure (in order to differentiate between counts of 6 versus 12 eggs, chicken parts with skin on/bone in etc )

Methods
1) Revise read_xls to specify column type as numeric for all variables with the exception of year month - note there is not a lot of variation in values for “chix” values but capturing in numeric will allow for summarizing results, keep things consistent across variables, make manipulation downstream easier (I think) 2) rename variables representing cost to more easily parsed tokens where token 1 = product type (egg or chix) and token2 - “unit” (XLn where n = count, Whole, BSbreast = bone-in skin-on breast, etc) 2) Clean up year month field. I’m thinking this will be most easily accomplished by create two new variables, year & month. Separating out year part might be nice for summarizing data for analysis

Wishlist/To do
- Find a better way to assign year. I had tried case_when to parse number for january and it gave annoying warnings. My plan was to get a value and use it to assign to subsequent via lag but no time to test out the best way to do that this time me around. I recognize every single way my current method is problematic but it works for this particular case well enough (maybe)

Code
pretidy_organiceggpoultry <- read_excel(  "_data/organiceggpoultry.xls"
                                     , sheet = "Data"
                                     , skip = 5
                                     , col_names = c("yearmonth", "egg_xl12", "egg_xl6", "egg_l12", "egg_l6", "delete", "chix_whole", "chix_bsbreast", "chix_bbreast", "chix_leg", "chix_thigh" ) 
                                     , col_types = c("text", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric"))

pretidy_organiceggpoultry <- select(pretidy_organiceggpoultry, !contains("delete"))
is.data.frame(pretidy_organiceggpoultry)
[1] TRUE
Code
pretidy_organiceggpoultry <-
  pretidy_organiceggpoultry %>% 
  mutate(
    month = case_when(
      grepl("JAN", yearmonth,   ignore.case=TRUE) ~ "january",
      grepl("FEB", yearmonth,   ignore.case=TRUE) ~ "february",  
      grepl("MARCH", yearmonth, ignore.case=TRUE) ~ "march",
      grepl("APRIL", yearmonth, ignore.case=TRUE) ~ "april",
      grepl("MAY", yearmonth,   ignore.case=TRUE) ~ "may",
      grepl("JUNE", yearmonth,  ignore.case=TRUE) ~ "june",
      grepl("JULY", yearmonth,  ignore.case=TRUE) ~ "july",
      grepl("AUG", yearmonth,   ignore.case=TRUE) ~ "august",
      grepl("SEPT", yearmonth,  ignore.case=TRUE) ~ "september",
      grepl("OCT", yearmonth,   ignore.case=TRUE) ~ "october",
      grepl("NOV", yearmonth,   ignore.case=TRUE) ~ "november",
      grepl("DEC", yearmonth,   ignore.case=TRUE) ~ "december",
      TRUE ~ "ERROR" 
      ) )
  
pretidy_organiceggpoultry%>%
    select(month)%>%
    distinct
# A tibble: 12 × 1
   month    
   <chr>    
 1 january  
 2 february 
 3 march    
 4 april    
 5 may      
 6 june     
 7 july     
 8 august   
 9 september
10 october  
11 november 
12 december 
Code
pretidy_organiceggpoultry <-
  pretidy_organiceggpoultry %>% 
  mutate(year = floor(row_number()/12) + 2004)

pretidy_organiceggpoultry%>%
    select(year)%>%
    distinct
# A tibble: 11 × 1
    year
   <dbl>
 1  2004
 2  2005
 3  2006
 4  2007
 5  2008
 6  2009
 7  2010
 8  2011
 9  2012
10  2013
11  2014
Code
is.data.frame(pretidy_organiceggpoultry)
[1] TRUE
Code
pretidy_organiceggpoultry <- select(pretidy_organiceggpoultry, !contains("yearmonth"))

head(pretidy_organiceggpoultry)
# A tibble: 6 × 11
  egg_xl12 egg_xl6 egg_l12 egg_l6 chix_w…¹ chix_…² chix_…³ chix_…⁴ chix_…⁵ month
     <dbl>   <dbl>   <dbl>  <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>
1     230     132     230    126      198.    646.      NA    194.     NA  janu…
2     230     134.    226.   128.     198.    642.      NA    194.    203  febr…
3     230     137     225    131      209     642.      NA    194.    203  march
4     234.    137     225    131      212     642.      NA    194.    203  april
5     236     137     225    131      214.    642.      NA    194.    203  may  
6     241     137     231.   134.     216.    641       NA    202.    200. june 
# … with 1 more variable: year <dbl>, and abbreviated variable names
#   ¹​chix_whole, ²​chix_bsbreast, ³​chix_bbreast, ⁴​chix_leg, ⁵​chix_thigh

Step 3: Create tidy version of dataset

The goal of this chunk will be to pivot data from wide to long form. This will result in each observation representing a case or cost by year, month, and type of product.

Methods 1) Utilize pivot longer to transpose data to long form, retain values of month & year, creating new variable ‘product’ named for the value variable translated, and cost
2) Try out the tidy version by creating a simple summary of cost by year

Wishlist/To do
- Parse product type to create separate field for defining “source” as egg or chicken and unit. Should be pretty easy to do as these values are represented by single undercore-delimited token - Introduce step earlier on in that evaluates that each row in the pretidy dataset is distinct to validate transposition to long form resulted in expected dataset

Code
tidy_organiceggpoultry <- 
  pretidy_organiceggpoultry %>%
  pivot_longer(
    cols = c("egg_xl12", "egg_xl6", "egg_l12", "egg_l6", "chix_whole", "chix_bsbreast", "chix_bbreast", "chix_leg", "chix_thigh" ), 
    names_to = "product",
    values_to = "cost",
    values_drop_na = FALSE    
  )

head(tidy_organiceggpoultry)
# A tibble: 6 × 4
  month    year product        cost
  <chr>   <dbl> <chr>         <dbl>
1 january  2004 egg_xl12       230 
2 january  2004 egg_xl6        132 
3 january  2004 egg_l12        230 
4 january  2004 egg_l6         126 
5 january  2004 chix_whole     198.
6 january  2004 chix_bsbreast  646.
Code
tab1<- tidy_organiceggpoultry%>%
  group_by(product)%>%
     summarise(
               min  = min(cost,    na.rm=TRUE), 
               med  = median(cost, na.rm=TRUE), 
               max  = max(cost,    na.rm=TRUE),                
               mean = mean(cost,   na.rm=TRUE),  
               sd   = sd(cost,     na.rm=TRUE), )

knitr::kable(tab1,
             digits=2,
             col.names = c("Product Type", "Min", "Median", "Max", "Mean", "Standard Deviation"))%>%
  kableExtra::kable_styling(htmltable_class = "lightable-minimal")%>%
  kableExtra::add_header_above(c(" " = 1, "Cost Statistics" = 5))
Cost Statistics
Product Type Min Median Max Mean Standard Deviation
chix_bbreast 390.5 390.5 390.50 390.50 0.00
chix_bsbreast 637.5 645.5 703.75 654.92 23.33
chix_leg 193.5 203.5 203.50 203.07 2.01
chix_thigh 199.5 222.0 222.00 217.77 6.24
chix_whole 197.5 235.0 248.00 230.53 12.52
egg_l12 225.0 267.5 277.50 254.20 18.55
egg_l6 126.0 174.5 178.00 155.17 22.59
egg_xl12 230.0 285.5 290.00 266.80 22.80
egg_xl6 132.0 185.5 188.13 164.22 24.68