Code
library(tidyverse)
library(readxl)
library(summarytools)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Ryan O’Donnell
August 17, 2022
Today’s challenge is to:
pivot_longer
Read in one (or more) of the following datasets, using the correct R package and command.
egg <- read_xls("_data/organiceggpoultry.xls",
skip = 5,
col_names = c("month_year", "egg-xl_12", "egg-xl_6", "egg-lg_12", "egg-lg_6", "delete", "chx-whole", "chx-bnl_breast", "chx-bone_breast", "chx-legs", "chx-thighs"),
na = c("too few"))
eggs <- egg %>%
mutate(delete = NULL) %>%
separate(col = month_year,
into = c("month", "year")) %>%
fill(year)
summary(eggs)
month year egg-xl_12 egg-xl_6
Length:120 Length:120 Min. :230.0 Min. :132.0
Class :character Class :character 1st Qu.:241.5 1st Qu.:135.8
Mode :character Mode :character Median :285.5 Median :185.5
Mean :266.8 Mean :164.2
3rd Qu.:285.5 3rd Qu.:185.5
Max. :290.0 Max. :188.1
egg-lg_12 egg-lg_6 chx-whole chx-bnl_breast
Min. :225.0 Min. :126.0 Min. :197.5 Min. :637.5
1st Qu.:233.5 1st Qu.:129.4 1st Qu.:220.5 1st Qu.:645.1
Median :267.5 Median :174.5 Median :235.0 Median :645.5
Mean :254.2 Mean :155.2 Mean :230.5 Mean :654.9
3rd Qu.:268.0 3rd Qu.:174.5 3rd Qu.:238.5 3rd Qu.:645.5
Max. :277.5 Max. :178.0 Max. :248.0 Max. :703.8
chx-bone_breast chx-legs chx-thighs
Min. :390.5 Min. :193.5 Min. :199.5
1st Qu.:390.5 1st Qu.:203.5 1st Qu.:215.0
Median :390.5 Median :203.5 Median :222.0
Mean :390.5 Mean :203.1 Mean :217.8
3rd Qu.:390.5 3rd Qu.:203.5 3rd Qu.:222.0
Max. :390.5 Max. :203.5 Max. :222.0
NA's :6 NA's :1
# A tibble: 6 × 11
month year egg-x…¹ egg-x…² egg-l…³ egg-l…⁴ chx-w…⁵ chx-b…⁶ chx-b…⁷ chx-l…⁸
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Jan 2004 230 132 230 126 198. 646. NA 194.
2 February 2004 230 134. 226. 128. 198. 642. NA 194.
3 March 2004 230 137 225 131 209 642. NA 194.
4 April 2004 234. 137 225 131 212 642. NA 194.
5 May 2004 236 137 225 131 214. 642. NA 194.
6 June 2004 241 137 231. 134. 216. 641 NA 202.
# … with 1 more variable: `chx-thighs` <dbl>, and abbreviated variable names
# ¹`egg-xl_12`, ²`egg-xl_6`, ³`egg-lg_12`, ⁴`egg-lg_6`, ⁵`chx-whole`,
# ⁶`chx-bnl_breast`, ⁷`chx-bone_breast`, ⁸`chx-legs`
Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”
organicpoultry.xls is an Excel file that contains 3 sheets. The first is a compilation of the data collected on the second and third sheet: organic egg prices 2004-2013, and organic poultry prices 2004-2013, respectively. All three sheets have the data presented visually which makes it messy. The first is the tidiest and contains the most information, so that is the sheet I will be working with. The column names were stored across multiple rows and so I renamed them upon import and deleted the empty column between the egg prices and the chicken prices. I also set the NA variable which was stored in this data as “too few.”
I also had to split the Month_Year column into two and fill the year down so that there is a year in every row. If the unique case is the “price”, I will have to pivot the data to be much longer with the new variables being the product. Since there are two major categories of project, I will split this into columns, product category (egg, chicken) and product type (size eggs, dozen or half dozen, parts of the chicken).
The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.*
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.*
Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to
variable and the current values in each of those columns will move into the values_to
variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!*
Lets see if this works with a simple example.
# A tibble: 6 × 5
country year trade outgoing incoming
<chr> <dbl> <chr> <dbl> <dbl>
1 Mexico 1980 NAFTA 610. -135.
2 USA 1990 NAFTA 422. 1068.
3 France 1980 EU 1453. 330.
4 Mexico 1990 NAFTA 715. 416.
5 USA 1980 NAFTA 402. 1045.
6 France 1990 EU 1354. 60.5
[1] 6
[1] 5
[1] 12
[1] 5
Our simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.
Document your work here.
[1] 120
[1] 11
[1] 1200
[1] 3
Any additional comments?
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
# A tibble: 12 × 5
country year trade trade_direction trade_value
<chr> <dbl> <chr> <chr> <dbl>
1 Mexico 1980 NAFTA outgoing 610.
2 Mexico 1980 NAFTA incoming -135.
3 USA 1990 NAFTA outgoing 422.
4 USA 1990 NAFTA incoming 1068.
5 France 1980 EU outgoing 1453.
6 France 1980 EU incoming 330.
7 Mexico 1990 NAFTA outgoing 715.
8 Mexico 1990 NAFTA incoming 416.
9 USA 1980 NAFTA outgoing 402.
10 USA 1980 NAFTA incoming 1045.
11 France 1990 EU outgoing 1354.
12 France 1990 EU incoming 60.5
Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!
Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?
A new case will be the price. It will be described by the month, year, product category, and product type. This is tidy because you will easily be able to analyze the price and then group by the different categories and time periods.
eggs_pivoted <- pivot_longer(eggs,
col = c(`egg-xl_12`, `egg-xl_6`, `egg-lg_12`, `egg-lg_6`, `chx-whole`, `chx-bnl_breast`, `chx-bone_breast`, `chx-legs`, `chx-thighs`),
names_to = "product_category-product_type",
values_to = "price") %>%
separate(col = `product_category-product_type`,
into = c("product_category", "product_type"),
sep = "-")
eggs_pivoted
# A tibble: 1,080 × 5
month year product_category product_type price
<chr> <chr> <chr> <chr> <dbl>
1 Jan 2004 egg xl_12 230
2 Jan 2004 egg xl_6 132
3 Jan 2004 egg lg_12 230
4 Jan 2004 egg lg_6 126
5 Jan 2004 chx whole 198.
6 Jan 2004 chx bnl_breast 646.
7 Jan 2004 chx bone_breast NA
8 Jan 2004 chx legs 194.
9 Jan 2004 chx thighs NA
10 February 2004 egg xl_12 230
# … with 1,070 more rows
Any additional comments?
The new table has 1080 rows, as expected!
---
title: "Challenge 3 Solution"
author: "Ryan O'Donnell"
desription: "Tidy Data: Pivoting"
date: "08/17/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- eggs
- ryan_odonnell
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
library(summarytools)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2. identify what needs to be done to tidy the current data
3. anticipate the shape of pivoted data
4. pivot the data into tidy format using `pivot_longer`
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- animal_weights.csv ⭐
- eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
- australian_marriage\*.xls ⭐⭐⭐
- USA Households\*.xlsx ⭐⭐⭐⭐
- sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟
```{r}
egg <- read_xls("_data/organiceggpoultry.xls",
skip = 5,
col_names = c("month_year", "egg-xl_12", "egg-xl_6", "egg-lg_12", "egg-lg_6", "delete", "chx-whole", "chx-bnl_breast", "chx-bone_breast", "chx-legs", "chx-thighs"),
na = c("too few"))
eggs <- egg %>%
mutate(delete = NULL) %>%
separate(col = month_year,
into = c("month", "year")) %>%
fill(year)
summary(eggs)
head(eggs)
```
### Briefly describe the data
*Describe the data, and be sure to comment on why you are planning to pivot it to make it "tidy"*
**organicpoultry.xls** is an Excel file that contains 3 sheets. The first is a compilation of the data collected on the second and third sheet: organic egg prices 2004-2013, and organic poultry prices 2004-2013, respectively. All three sheets have the data presented visually which makes it messy. The first is the tidiest and contains the most information, so that is the sheet I will be working with.
The column names were stored across multiple rows and so I renamed them upon import and deleted the empty column between the egg prices and the chicken prices. I also set the NA variable which was stored in this data as "too few."
I also had to split the Month_Year column into two and fill the year down so that there is a year in every row. If the unique case is the "price", I will have to pivot the data to be much longer with the new variables being the product. Since there are two major categories of project, I will split this into columns, product category (egg, chicken) and product type (size eggs, dozen or half dozen, parts of the chicken).
## Anticipate the End Result
The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.*
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.*
Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!*
### Example: find current and future data dimensions
Lets see if this works with a simple example.
```{r}
#| tbl-cap: Example
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
year = rep(c(1980,1990), 3),
trade = rep(c("NAFTA", "NAFTA", "EU"),2),
outgoing = rnorm(6, mean=1000, sd=500),
incoming = rlogis(6, location=1000,
scale = 400))
df
#existing rows/cases
nrow(df)
#existing columns/cases
ncol(df)
#expected rows/cases
nrow(df) * (ncol(df)-3)
# expected columns
3 + 2
```
Our simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.
### Challenge: Describe the final dimensions
Document your work here.
```{r}
#existing rows/cases
nrow(eggs)
#existing columns/cases
ncol(eggs)
#expected rows/cases
nrow(eggs) * (ncol(eggs)-1)
# expected columns
1 + 2
```
*Any additional comments?*
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
### Example
```{r}
#| tbl-cap: Pivoted Example
df<-pivot_longer(df, col = c(outgoing, incoming),
names_to="trade_direction",
values_to = "trade_value")
df
```
Yes, once it is pivoted long, our resulting data are $12x5$ - exactly what we expected!
### Challenge: Pivot the Chosen Data
*Document your work here. What will a new "case" be once you have pivoted the data? How does it meet requirements for tidy data?*
A new case will be the price. It will be described by the month, year, product category, and product type. This is tidy because you will easily be able to analyze the price and then group by the different categories and time periods.
```{r}
eggs_pivoted <- pivot_longer(eggs,
col = c(`egg-xl_12`, `egg-xl_6`, `egg-lg_12`, `egg-lg_6`, `chx-whole`, `chx-bnl_breast`, `chx-bone_breast`, `chx-legs`, `chx-thighs`),
names_to = "product_category-product_type",
values_to = "price") %>%
separate(col = `product_category-product_type`,
into = c("product_category", "product_type"),
sep = "-")
eggs_pivoted
```
*Any additional comments?*
The new table has 1080 rows, as expected!