Example Code for Pivot Longer

example code data cleaning

I’m sharing some example code for pivot_longer using the eggs data. Enjoy!

Meredith Rolfe true
08-17-2021

This is example code for using the pivot functions in R. Several of the government data sources include tabular data that really need to be pivoted into a dataset in which a “case” is some combination of the grouping variables (the rows and columns in the table) alongside the appropriate statistical value(s) in the table (e.g., counts or average costs). Lets start with the easy to read in eggs_tidy.csv just so we can focus on the pivoting function.

eggs<-read_csv("../../_data/eggs_tidy.csv", show_col_types = FALSE)
eggs
# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>
 1 January    2004             126         230                    132 
 2 February   2004             128.        226.                   134.
 3 March      2004             131         225                    137 
 4 April      2004             131         225                    137 
 5 May        2004             131         225                    137 
 6 June       2004             134.        231.                   137 
 7 July       2004             134.        234.                   137 
 8 August     2004             134.        234.                   137 
 9 September  2004             130.        234.                   136.
10 October    2004             128.        234.                   136.
# … with 110 more rows, and 1 more variable: extra_large_dozen <dbl>

Looking at the data, we can see that each case consists of a year-month combination (e.g., January 2004), while the values are the average price (in cents) of four different types of eggs (e.g., large_half_dozen, large_dozen, etc) But really, wouldn’t it possibly make more sense to consider the case as a year-month-type combination, with a single price value for each case?

Pivot Longer - One New Category Variable

To do this (and make our data easier to graph and analyze), we can pivot longer - changing our data from 120 rows with 6 variables (2 grouping and 4 values) to 480 rows of 4 variables (with 3 grouping variables and a single price value).

eggs%>%
  pivot_longer(cols=large_half_dozen:extra_large_dozen, 
               names_to = "eggType",
               values_to = "avgPrice"
  )
# A tibble: 480 × 4
   month     year eggType                avgPrice
   <chr>    <dbl> <chr>                     <dbl>
 1 January   2004 large_half_dozen           126 
 2 January   2004 large_dozen                230 
 3 January   2004 extra_large_half_dozen     132 
 4 January   2004 extra_large_dozen          230 
 5 February  2004 large_half_dozen           128.
 6 February  2004 large_dozen                226.
 7 February  2004 extra_large_half_dozen     134.
 8 February  2004 extra_large_dozen          230 
 9 March     2004 large_half_dozen           131 
10 March     2004 large_dozen                225 
# … with 470 more rows

Well, that was super easy. But wait, what if you are interested in egg size - you want to know how much more expensive extra-large eggs are compared to large eggs. Right now, that will be annoying, as you will have to keep sorting out the egg quantity - whether the price is for a half_dozen or a dozen eggs. Wouldn’t it be nice if we didn’t have a long egg type column with both size and quantity squashed into a single categorical variable? It would be so useful to have a new dataset with 4 grouping variables (year, month, size, and quantity) and the same value (price).

Pivot Longer - Two New Category Variables

So, once again we want to use pivot longer, but we will be adding two new category variables (for a total of 4) and this will cut the number of rows in half (to 240). But how in the world can we let R know what we want it to do?? Thankfully, someone named the egg types (column-names) pretty systematically, but how can use this to our advantage? Working with patterns in the names_sep option of the pivot functions makes it pretty easy (well, except our variable names have more than one underscore, so we have to sort of hack this part by also using mutate on the resulting category labels.)

eggs%>%
  pivot_longer(cols=large_half_dozen:extra_large_dozen,
               names_to = c("size", "quantity"),
               names_sep="arge_",
               values_to = "price"
  ) %>%
  mutate(size = case_when(
    size == "l" ~ "Large",
    size == "extra_l" ~ "Extra Large"
  ))
# A tibble: 480 × 5
   month     year size        quantity   price
   <chr>    <dbl> <chr>       <chr>      <dbl>
 1 January   2004 Large       half_dozen  126 
 2 January   2004 Large       dozen       230 
 3 January   2004 Extra Large half_dozen  132 
 4 January   2004 Extra Large dozen       230 
 5 February  2004 Large       half_dozen  128.
 6 February  2004 Large       dozen       226.
 7 February  2004 Extra Large half_dozen  134.
 8 February  2004 Extra Large dozen       230 
 9 March     2004 Large       half_dozen  131 
10 March     2004 Large       dozen       225 
# … with 470 more rows

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Rolfe (2021, Aug. 17). DACSS 601 August 2021: Example Code for Pivot Longer. Retrieved from https://mrolfe.github.io/DACSS601August2021/posts/2021-08-17-example-code-for-pivot-longer/

BibTeX citation

@misc{rolfe2021example,
  author = {Rolfe, Meredith},
  title = {DACSS 601 August 2021: Example Code for Pivot Longer},
  url = {https://mrolfe.github.io/DACSS601August2021/posts/2021-08-17-example-code-for-pivot-longer/},
  year = {2021}
}