Challenge 4 Poultry and Eggs

challenge_4

eggs

More data wrangling: pivoting

Author

Keith Martin

Published

August 18, 2022

Code

library(tidyverse)
library(googlesheets4)
library(readxl)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

For this challenge, I re-used the data set that I used in challenge three. In that challenge I had previously tidied the data and conducted a basic sanity check to ensure the date was entered correctly.

Read in data

First I loaded in the Organice Poultry Data set.

Code

### Loading in the libraries of the excel files
excel_sheets("_data/organiceggpoultry.xls")

[1] "Data"                            "Organic egg prices, 2004-13"    
[3] "Organic poultry prices, 2004-13"

Briefly describe the data

This data set is tracking the prices of various quantities of eggs and chicken cuts from the years 2004 to 2013.

Tidy Data (as needed)

In this challenge, due to outside time restraints, i re-used the data that I made tidy last week. However, this this challenge, I leveraged the lubridate functions to show a better picture of the date.

Code

raw_poulty<-read_excel("_data/organiceggpoultry.xls", 
                         sheet = "Data",
                          range =cell_limits(c(6,2),c(NA,12)),
                          col_names = c("date", "XL_Dozen", "XL_1/2_Doz.", "L_Dozen", "L_1/2_Doz", "Remove", "Ckn_Whole", "Ckn_BS_Breast", "Ckn_Bone_Breast", "Ckn_Whole_legs", "Ckn_Thighs"),
                      
                                        
                        )
raw_poulty %>%
  select(-c(Remove))

# A tibble: 120 × 10
   date  XL_Do…¹ XL_1/…² L_Dozen L_1/2…³ Ckn_W…⁴ Ckn_B…⁵ Ckn_B…⁶ Ckn_W…⁷ Ckn_T…⁸
   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>     <dbl> <chr>  
 1 Jan …    230     132     230     126     198.    646. too few    194. too few
 2 Febr…    230     134.    226.    128.    198.    642. too few    194. 203    
 3 March    230     137     225     131     209     642. too few    194. 203    
 4 April    234.    137     225     131     212     642. too few    194. 203    
 5 May      236     137     225     131     214.    642. too few    194. 203    
 6 June     241     137     231.    134.    216.    641  too few    202. 200.375
 7 July     241     137     234.    134.    217     642. 390.5      204. 199.5  
 8 Augu…    241     137     234.    134.    217     642. 390.5      204. 199.5  
 9 Sept…    241     136.    234.    130.    217     642. 390.5      204. 199.5  
10 Octo…    241     136.    234.    128.    217     642. 390.5      204. 199.5  
# … with 110 more rows, and abbreviated variable names ¹XL_Dozen,
#   ²`XL_1/2_Doz.`, ³`L_1/2_Doz`, ⁴Ckn_Whole, ⁵Ckn_BS_Breast, ⁶Ckn_Bone_Breast,
#   ⁷Ckn_Whole_legs, ⁸Ckn_Thighs

Identify variables that need to be mutated

In this data, there we’re a few things that needed to be cleaned up. There was a typo or a “/1” in the data that needed to be mutated and removed. Additionally, I separated the month and date into two separate columns.

Code

### remove that /1 from the February date
raw_poulty_clean <-raw_poulty %>% 
  mutate(date = str_remove(date, " /1"))
### Separate the month and the year, fill the years in for the rest of the months
raw_poulty_clean<-raw_poulty_clean %>%
  separate(date, into=c("month", "year"), sep =" ")%>%
  fill(year)
raw_poulty_clean

# A tibble: 120 × 12
   month    year  XL_Do…¹ XL_1/…² L_Dozen L_1/2…³ Remove Ckn_W…⁴ Ckn_B…⁵ Ckn_B…⁶
   <chr>    <chr>   <dbl>   <dbl>   <dbl>   <dbl> <lgl>    <dbl>   <dbl> <chr>  
 1 Jan      2004     230     132     230     126  NA        198.    646. too few
 2 February 2004     230     134.    226.    128. NA        198.    642. too few
 3 March    2004     230     137     225     131  NA        209     642. too few
 4 April    2004     234.    137     225     131  NA        212     642. too few
 5 May      2004     236     137     225     131  NA        214.    642. too few
 6 June     2004     241     137     231.    134. NA        216.    641  too few
 7 July     2004     241     137     234.    134. NA        217     642. 390.5  
 8 August   2004     241     137     234.    134. NA        217     642. 390.5  
 9 Septemb… 2004     241     136.    234.    130. NA        217     642. 390.5  
10 October  2004     241     136.    234.    128. NA        217     642. 390.5  
# … with 110 more rows, 2 more variables: Ckn_Whole_legs <dbl>,
#   Ckn_Thighs <chr>, and abbreviated variable names ¹XL_Dozen, ²`XL_1/2_Doz.`,
#   ³`L_1/2_Doz`, ⁴Ckn_Whole, ⁵Ckn_BS_Breast, ⁶Ckn_Bone_Breast

Now I decided to use lubridate for month and year since this data did not include any data that indicated days. I had it added to its own row. It does not add much in terms of story to this data set, but it is more efficent and visually appealing.

Code

raw_poulty_clean<-raw_poulty_clean%>%
  mutate(date = str_c(month, year, sep = " "), ###combine month and year into date column
         date = my(date) ### lubridate will automatically make them into date format
         )
  
  select(raw_poulty_clean, month, year, date)

# A tibble: 120 × 3
   month     year  date      
   <chr>     <chr> <date>    
 1 Jan       2004  2004-01-01
 2 February  2004  2004-02-01
 3 March     2004  2004-03-01
 4 April     2004  2004-04-01
 5 May       2004  2004-05-01
 6 June      2004  2004-06-01
 7 July      2004  2004-07-01
 8 August    2004  2004-08-01
 9 September 2004  2004-09-01
10 October   2004  2004-10-01
# … with 110 more rows

Before I could do my sanity check on the eggs and the chicken, I needed to pivot longer. Before I could do that, I needed to recode the “too few” entries into 0. However, changing them to “0” was not enough I needed to re-code them as integers as well.

Code

ckn_edited<- raw_poulty_clean %>%
  mutate(Ckn_Bone_Breast = recode(Ckn_Bone_Breast, `too few` = "0"),
      Ckn_Thighs = recode(Ckn_Thighs, `too few`="0"))
ckn_edited

# A tibble: 120 × 13
   month    year  XL_Do…¹ XL_1/…² L_Dozen L_1/2…³ Remove Ckn_W…⁴ Ckn_B…⁵ Ckn_B…⁶
   <chr>    <chr>   <dbl>   <dbl>   <dbl>   <dbl> <lgl>    <dbl>   <dbl> <chr>  
 1 Jan      2004     230     132     230     126  NA        198.    646. 0      
 2 February 2004     230     134.    226.    128. NA        198.    642. 0      
 3 March    2004     230     137     225     131  NA        209     642. 0      
 4 April    2004     234.    137     225     131  NA        212     642. 0      
 5 May      2004     236     137     225     131  NA        214.    642. 0      
 6 June     2004     241     137     231.    134. NA        216.    641  0      
 7 July     2004     241     137     234.    134. NA        217     642. 390.5  
 8 August   2004     241     137     234.    134. NA        217     642. 390.5  
 9 Septemb… 2004     241     136.    234.    130. NA        217     642. 390.5  
10 October  2004     241     136.    234.    128. NA        217     642. 390.5  
# … with 110 more rows, 3 more variables: Ckn_Whole_legs <dbl>,
#   Ckn_Thighs <chr>, date <date>, and abbreviated variable names ¹XL_Dozen,
#   ²`XL_1/2_Doz.`, ³`L_1/2_Doz`, ⁴Ckn_Whole, ⁵Ckn_BS_Breast, ⁶Ckn_Bone_Breast

Below I recoded the 0s into integers rather than characters.

Code

ckn_edited$Ckn_Bone_Breast<-as.numeric(ckn_edited$Ckn_Bone_Breast)
ckn_edited$Ckn_Thighs<-as.numeric(ckn_edited$Ckn_Thighs)
ckn_edited

# A tibble: 120 × 13
   month    year  XL_Do…¹ XL_1/…² L_Dozen L_1/2…³ Remove Ckn_W…⁴ Ckn_B…⁵ Ckn_B…⁶
   <chr>    <chr>   <dbl>   <dbl>   <dbl>   <dbl> <lgl>    <dbl>   <dbl>   <dbl>
 1 Jan      2004     230     132     230     126  NA        198.    646.      0 
 2 February 2004     230     134.    226.    128. NA        198.    642.      0 
 3 March    2004     230     137     225     131  NA        209     642.      0 
 4 April    2004     234.    137     225     131  NA        212     642.      0 
 5 May      2004     236     137     225     131  NA        214.    642.      0 
 6 June     2004     241     137     231.    134. NA        216.    641       0 
 7 July     2004     241     137     234.    134. NA        217     642.    390.
 8 August   2004     241     137     234.    134. NA        217     642.    390.
 9 Septemb… 2004     241     136.    234.    130. NA        217     642.    390.
10 October  2004     241     136.    234.    128. NA        217     642.    390.
# … with 110 more rows, 3 more variables: Ckn_Whole_legs <dbl>,
#   Ckn_Thighs <dbl>, date <date>, and abbreviated variable names ¹XL_Dozen,
#   ²`XL_1/2_Doz.`, ³`L_1/2_Doz`, ⁴Ckn_Whole, ⁵Ckn_BS_Breast, ⁶Ckn_Bone_Breast

I then pivoted longer to include the previous egg/chicken types as rows rather than columns so it was easier to do a sanity check.

Code

eggs_longer<- pivot_longer(ckn_edited,  cols=c("XL_Dozen", "XL_1/2_Doz.", "L_Dozen", "L_1/2_Doz", "Ckn_Whole", "Ckn_BS_Breast", "Ckn_Bone_Breast", "Ckn_Whole_legs", "Ckn_Thighs"), 
               names_to = "eggType/cknType",
               values_to = "avgPrice"
  )
eggs_longer

# A tibble: 1,080 × 6
   month    year  Remove date       `eggType/cknType` avgPrice
   <chr>    <chr> <lgl>  <date>     <chr>                <dbl>
 1 Jan      2004  NA     2004-01-01 XL_Dozen              230 
 2 Jan      2004  NA     2004-01-01 XL_1/2_Doz.           132 
 3 Jan      2004  NA     2004-01-01 L_Dozen               230 
 4 Jan      2004  NA     2004-01-01 L_1/2_Doz             126 
 5 Jan      2004  NA     2004-01-01 Ckn_Whole             198.
 6 Jan      2004  NA     2004-01-01 Ckn_BS_Breast         646.
 7 Jan      2004  NA     2004-01-01 Ckn_Bone_Breast         0 
 8 Jan      2004  NA     2004-01-01 Ckn_Whole_legs        194.
 9 Jan      2004  NA     2004-01-01 Ckn_Thighs              0 
10 February 2004  NA     2004-02-01 XL_Dozen              230 
# … with 1,070 more rows

Below I included the basic summary statistics like median, mode, min, max and standard deviation to show that the lubridate function and the pivot longer function worked as intended.

Code

eggs_longer%>%
  group_by(date)%>%
  summarise (
      sd_year = sd (avgPrice, na.rm=TRUE),
      max_year = max(avgPrice, na.rm = TRUE),
      min_year = min(avgPrice, na.rm = TRUE),
      avg_year = mean (avgPrice, na.rm = TRUE),
      med_year = median(avgPrice, na.rm = TRUE)
        )

# A tibble: 120 × 6
   date       sd_year max_year min_year avg_year med_year
   <date>       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 2004-01-01    190.     646.       0      195.     194.
 2 2004-02-01    175.     642.       0      217.     198.
 3 2004-03-01    174.     642.       0      219      203 
 4 2004-04-01    174.     642.       0      220.     203 
 5 2004-05-01    174.     642.       0      220.     203 
 6 2004-06-01    174.     641        0      222.     202.
 7 2004-07-01    160.     642.     134.     266.     217 
 8 2004-08-01    160.     642.     134.     266.     217 
 9 2004-09-01    160.     642.     130.     266.     217 
10 2004-10-01    160.     642.     128.     266.     217 
# … with 110 more rows