challenge_3
eggs_tidy
Tidy Data: Pivoting
Author

Shoshana Buck

Published

August 17, 2022

Code
library(tidyverse)
library(readxl)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

Code
eggs<-read_excel("_data/organiceggpoultry.xls",
                      sheet="Data",
                      range =cell_limits(c(6,2),c(NA,6)),
                      col_names = c("date", "xlarge_dozen",
                               "xlarge_halfdozen", "large_dozen",
                               "large_halfdozen")
     )
new_eggs<-eggs %>%  
  mutate(date = str_remove(date, "/1")
  )
head(new_eggs)
# A tibble: 6 × 5
  date     xlarge_dozen xlarge_halfdozen large_dozen large_halfdozen
  <chr>           <dbl>            <dbl>       <dbl>           <dbl>
1 Jan 2004         230              132         230             126 
2 February         230              134.        226.            128.
3 March            230              137         225             131 
4 April            234.             137         225             131 
5 May              236              137         225             131 
6 June             241              137         231.            134.

Briefly describe the data

The U.S Department of Agriculture conducted monthly reports about the amount, size, and price of organic eggs starting in 2004 and ending in 2013. There are 120 observations, which are the prices of the organic eggs and 6 variables (month, year, xlarge_dozen, xlarge_halfdozen, large_dozen, and large_halfdozen), within this data set. To better understand the data set it is best to tidy the data first and then use the function pivot_longer() which collapses multiple columns into one column and flips the columns to rows.

Code
new_eggs<-new_eggs%>%
  separate(date, into=c("month", "year"), sep=" ")%>%
  fill(year)
new_eggs
# A tibble: 120 × 6
   month     year  xlarge_dozen xlarge_halfdozen large_dozen large_halfdozen
   <chr>     <chr>        <dbl>            <dbl>       <dbl>           <dbl>
 1 Jan       2004          230              132         230             126 
 2 February  2004          230              134.        226.            128.
 3 March     2004          230              137         225             131 
 4 April     2004          234.             137         225             131 
 5 May       2004          236              137         225             131 
 6 June      2004          241              137         231.            134.
 7 July      2004          241              137         234.            134.
 8 August    2004          241              137         234.            134.
 9 September 2004          241              136.        234.            130.
10 October   2004          241              136.        234.            128.
# … with 110 more rows
# ℹ Use `print(n = ...)` to see more rows

Pivot the Data

The date set has n= 120 rows and k= 6 columns, which means that when using the function pivot_longer() the k-6 values are going to pivot to the rows using the ‘names_to’ variable and the current values will move using the ‘values_to’ variable. The data set is complex because the 120 cases have a month-year combination, whereas the 6 columns have four different prices based off the size of the organic eggs.

The dimension of the data can be calculated using the formula n*(k-2). In this case, k=6 so k-2 = 4 and n = 120. So, it can be expected that the new_eggs_long data should have 480 rows.

Code
new_eggs_long<-new_eggs %>% 
  pivot_longer(cols= contains("large"),
names_to = "eggtype",
values_to = "avgprice" 
)