Challenge 3

challenge_3

Tidy data: pivoting

Author

Miranda Manka

Published

August 18, 2022

Code

library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
identify what needs to be done to tidy the current data
anticipate the shape of pivoted data
pivot the data into tidy format using pivot_longer

Read in data

Code

organic_eggs = read_excel("_data/organiceggpoultry.xls",
  sheet = "Data", 
  range = "B6:F125", 
  col_names = c("date", "extralarge_dozen", "extralarge_halfdozen", 
     "large_dozen", "large_halfdozen"))

organic_eggs = organic_eggs %>% 
  mutate(date = str_remove(date, " /1"))

organic_eggs = organic_eggs %>% 
  separate(date, into = c("month", "year"), sep = " ") %>% 
  fill(year)

dim(organic_eggs)

[1] 120   6

Code

head(organic_eggs)

# A tibble: 6 × 6
  month    year  extralarge_dozen extralarge_halfdozen large_dozen large_halfd…¹
  <chr>    <chr>            <dbl>                <dbl>       <dbl>         <dbl>
1 Jan      2004              230                  132         230           126 
2 February 2004              230                  134.        226.          128.
3 March    2004              230                  137         225           131 
4 April    2004              234.                 137         225           131 
5 May      2004              236                  137         225           131 
6 June     2004              241                  137         231.          134.
# … with abbreviated variable name ¹large_halfdozen

Briefly describe the data

After reading in the data and doing some cleaning, the dataset is ready to move on to the next step. This is a dataset about organic egg costs over time for different sizes and quantities of eggs (both large and extra large, dozen and half dozen eggs, for each month from 2004 to 2013), with 120 rows and 6 columns. The 6 columns are month, year, extralarge_dozen, extralarge_halfdozen, large_dozen, and large_halfdozen. As explained in class, these names make it easier to pivot data in the future (think about what the final pivoted data should look like and name accordingly). This data needs to be pivoted because right now each row does not represent a unique observation (for example Jan 2004 has prices for 4 different types of eggs).

Find current data dimensions & Anticipate the end result

This dataset has n = 120 rows and k = 6 variables (the dimensions were found in the code above). Since 2 of the variables are used to identify a case (month and year), k - 2 = 6 - 2 = 4 variables will be pivoted into a longer format. Therefore, there should be n * (k - 2) = 120 * (6 - 2) = 120 * 4 = 480 rows in the pivoted dataframe.

Pivot the data & Describe the final dimensions

Code

organic_eggs = organic_eggs %>% 
  pivot_longer(cols = contains("dozen"), names_to = c("size", "quantity"), 
     names_sep="_", values_to = "price")

dim(organic_eggs)

[1] 480   5

As expected, the final dataset has 480 rows. Each row (case) is a unique observation that contains the month, year, size, quantity, and price. The data has been successfully pivoted so that further analysis can take place. The data is now tidy because each row now contains a single observation/case (month and year information, ttype and quantity, and price) instead of what the dataset started at (date and 4 different columns for the different size and quantities). The naming conventions and thinking about what the final dataset will be, helped to pivot smoothly to better rows and columns.