Code
library(tidyverse)
library(readxl)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Miranda Manka
August 18, 2022
Today’s challenge is to:
pivot_longer
organic_eggs = read_excel("_data/organiceggpoultry.xls",
sheet = "Data",
range = "B6:F125",
col_names = c("date", "extralarge_dozen", "extralarge_halfdozen",
"large_dozen", "large_halfdozen"))
organic_eggs = organic_eggs %>%
mutate(date = str_remove(date, " /1"))
organic_eggs = organic_eggs %>%
separate(date, into = c("month", "year"), sep = " ") %>%
fill(year)
dim(organic_eggs)
[1] 120 6
# A tibble: 6 × 6
month year extralarge_dozen extralarge_halfdozen large_dozen large_halfd…¹
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Jan 2004 230 132 230 126
2 February 2004 230 134. 226. 128.
3 March 2004 230 137 225 131
4 April 2004 234. 137 225 131
5 May 2004 236 137 225 131
6 June 2004 241 137 231. 134.
# … with abbreviated variable name ¹large_halfdozen
After reading in the data and doing some cleaning, the dataset is ready to move on to the next step. This is a dataset about organic egg costs over time for different sizes and quantities of eggs (both large and extra large, dozen and half dozen eggs, for each month from 2004 to 2013), with 120 rows and 6 columns. The 6 columns are month, year, extralarge_dozen, extralarge_halfdozen, large_dozen, and large_halfdozen. As explained in class, these names make it easier to pivot data in the future (think about what the final pivoted data should look like and name accordingly). This data needs to be pivoted because right now each row does not represent a unique observation (for example Jan 2004 has prices for 4 different types of eggs).
This dataset has n = 120 rows and k = 6 variables (the dimensions were found in the code above). Since 2 of the variables are used to identify a case (month and year), k - 2 = 6 - 2 = 4 variables will be pivoted into a longer format. Therefore, there should be n * (k - 2) = 120 * (6 - 2) = 120 * 4 = 480 rows in the pivoted dataframe.
[1] 480 5
As expected, the final dataset has 480 rows. Each row (case) is a unique observation that contains the month, year, size, quantity, and price. The data has been successfully pivoted so that further analysis can take place. The data is now tidy because each row now contains a single observation/case (month and year information, ttype and quantity, and price) instead of what the dataset started at (date and 4 different columns for the different size and quantities). The naming conventions and thinking about what the final dataset will be, helped to pivot smoothly to better rows and columns.
---
title: "Challenge 3"
author: "Miranda Manka"
description: "Tidy data: pivoting"
date: "08/18/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2. identify what needs to be done to tidy the current data
3. anticipate the shape of pivoted data
4. pivot the data into tidy format using `pivot_longer`
## Read in data
```{r}
organic_eggs = read_excel("_data/organiceggpoultry.xls",
sheet = "Data",
range = "B6:F125",
col_names = c("date", "extralarge_dozen", "extralarge_halfdozen",
"large_dozen", "large_halfdozen"))
organic_eggs = organic_eggs %>%
mutate(date = str_remove(date, " /1"))
organic_eggs = organic_eggs %>%
separate(date, into = c("month", "year"), sep = " ") %>%
fill(year)
dim(organic_eggs)
head(organic_eggs)
```
### Briefly describe the data
After reading in the data and doing some cleaning, the dataset is ready to move on to the next step. This is a dataset about organic egg costs over time for different sizes and quantities of eggs (both large and extra large, dozen and half dozen eggs, for each month from 2004 to 2013), with 120 rows and 6 columns. The 6 columns are month, year, extralarge_dozen, extralarge_halfdozen, large_dozen, and large_halfdozen. As explained in class, these names make it easier to pivot data in the future (think about what the final pivoted data should look like and name accordingly). This data needs to be pivoted because right now each row does not represent a unique observation (for example Jan 2004 has prices for 4 different types of eggs).
## Find current data dimensions & Anticipate the end result
This dataset has n = 120 rows and k = 6 variables (the dimensions were found in the code above). Since 2 of the variables are used to identify a case (month and year), k - 2 = 6 - 2 = 4 variables will be pivoted into a longer format. Therefore, there should be n * (k - 2) = 120 * (6 - 2) = 120 * 4 = 480 rows in the pivoted dataframe.
### Pivot the data & Describe the final dimensions
```{r}
organic_eggs = organic_eggs %>%
pivot_longer(cols = contains("dozen"), names_to = c("size", "quantity"),
names_sep="_", values_to = "price")
dim(organic_eggs)
```
As expected, the final dataset has 480 rows. Each row (case) is a unique observation that contains the month, year, size, quantity, and price. The data has been successfully pivoted so that further analysis can take place. The data is now tidy because each row now contains a single observation/case (month and year information, ttype and quantity, and price) instead of what the dataset started at (date and 4 different columns for the different size and quantities). The naming conventions and thinking about what the final dataset will be, helped to pivot smoothly to better rows and columns.