Code
library(tidyverse)
library(summarytools)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Nayan Jani
August 17, 2022
Today’s challenge is to:
pivot_longer
Read in one (or more) of the following datasets, using the correct R package and command.
# A tibble: 120 × 6
month year large_half_dozen large_dozen extra_large_half_dozen extra_l…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 January 2004 126 230 132 230
2 February 2004 128. 226. 134. 230
3 March 2004 131 225 137 230
4 April 2004 131 225 137 234.
5 May 2004 131 225 137 236
6 June 2004 134. 231. 137 241
7 July 2004 134. 234. 137 241
8 August 2004 134. 234. 137 241
9 September 2004 130. 234. 136. 241
10 October 2004 128. 234. 136. 241
# … with 110 more rows, and abbreviated variable name ¹extra_large_dozen
# ℹ Use `print(n = ...)` to see more rows
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
month [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
year [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
large_half_dozen [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
large_dozen [numeric] |
|
12 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
extra_large_half_dozen [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
extra_large_dozen [numeric] |
|
11 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-28
This dataset is comprised of 120 rows with 6 variables. One variable is a character (month) and the rest are doubles. Based off the summary statistics of the egg variables, I can see that the variance of sales for each dozen of eggs are all high meaning that most cases are not near there average. ## Anticipate the End Result
The column names large_half_dozen, large_dozen, extra_large_half_dozen and extra_large_dozen represent values of the type variable, the values in the columns represent values of the sales variable, and each row represents multiple observations, not one.
Lets see if this works with a simple example.
# A tibble: 6 × 5
country year trade outgoing incoming
<chr> <dbl> <chr> <dbl> <dbl>
1 Mexico 1980 NAFTA 1256. 369.
2 USA 1990 NAFTA 998. 934.
3 France 1980 EU 2122. -678.
4 Mexico 1990 NAFTA 2074. 1980.
5 USA 1980 NAFTA 1136. 602.
6 France 1990 EU 1423. 1876.
[1] 6
[1] 5
[1] 12
[1] 5
Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.
Document your work here.
What needs to changed is that the column names that include the values for dozens needs to become its own columns called type and the corresponding values for that type will be stored in a column names sales
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
# A tibble: 12 × 5
country year trade trade_direction trade_value
<chr> <dbl> <chr> <chr> <dbl>
1 Mexico 1980 NAFTA outgoing 1256.
2 Mexico 1980 NAFTA incoming 369.
3 USA 1990 NAFTA outgoing 998.
4 USA 1990 NAFTA incoming 934.
5 France 1980 EU outgoing 2122.
6 France 1980 EU incoming -678.
7 Mexico 1990 NAFTA outgoing 2074.
8 Mexico 1990 NAFTA incoming 1980.
9 USA 1980 NAFTA outgoing 1136.
10 USA 1980 NAFTA incoming 602.
11 France 1990 EU outgoing 1423.
12 France 1990 EU incoming 1876.
Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!
Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?
# A tibble: 480 × 4
month year type sales
<chr> <dbl> <chr> <dbl>
1 January 2004 large_half_dozen 126
2 January 2004 large_dozen 230
3 January 2004 extra_large_half_dozen 132
4 January 2004 extra_large_dozen 230
5 February 2004 large_half_dozen 128.
6 February 2004 large_dozen 226.
7 February 2004 extra_large_half_dozen 134.
8 February 2004 extra_large_dozen 230
9 March 2004 large_half_dozen 131
10 March 2004 large_dozen 225
# … with 470 more rows
# ℹ Use `print(n = ...)` to see more rows
Now this dataset is in tidy form because each row is now one observation. I used pivot_longer because we needed more rows so that each observation could be individual in the dataset. When you go to read the dataset U can see each case only has one value attached to it.
---
title: "Challenge 3 Instructions"
author: "Nayan Jani"
desription: "Tidy Data: Pivoting"
date: "08/17/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- Pivot
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(summarytools)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2. identify what needs to be done to tidy the current data
3. anticipate the shape of pivoted data
4. pivot the data into tidy format using `pivot_longer`
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- animal_weights.csv ⭐
- eggs_tidy.csv ⭐⭐ or organicpoultry.xls ⭐⭐⭐
- australian_marriage\*.xlsx ⭐⭐⭐
- USA Households\*.xlsx ⭐⭐⭐⭐
- sce_labor_chart_data_public.csv 🌟🌟🌟🌟🌟
```{r}
eggs <-read_csv("_data/eggs_tidy.csv",
show_col_types = FALSE)
eggs
print(dfSummary(eggs, varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
```
### Briefly describe the data
This dataset is comprised of `r nrow(eggs)` rows with 6 variables. One variable is a character (month) and the rest
are doubles. Based off the summary statistics of the egg variables, I can see that the variance of sales for each dozen of eggs are all high meaning that most cases are not near there average.
## Anticipate the End Result
The column names large_half_dozen, large_dozen, extra_large_half_dozen and extra_large_dozen represent values of the type variable, the values in the columns represent values of the sales variable, and each row represents multiple observations, not one.
### Example: find current and future data dimensions
Lets see if this works with a simple example.
```{r}
#| tbl-cap: Example
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
year = rep(c(1980,1990), 3),
trade = rep(c("NAFTA", "NAFTA", "EU"),2),
outgoing = rnorm(6, mean=1000, sd=500),
incoming = rlogis(6, location=1000,
scale = 400))
df
#existing rows/cases
nrow(df)
#existing columns/cases
ncol(df)
#expected rows/cases
nrow(df) * (ncol(df)-3)
# expected columns
3 + 2
```
Or simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.
### Challenge: Describe the final dimensions
Document your work here.
```{r}
nrow(eggs)
ncol(eggs)
```
What needs to changed is that the column names that include the values for dozens needs to become its own columns called *type* and the corresponding values for that type will be stored in a column names *sales*
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
### Example
```{r}
#| tbl-cap: Pivoted Example
df<-pivot_longer(df, col = c(outgoing, incoming),
names_to="trade_direction",
values_to = "trade_value")
df
```
Yes, once it is pivoted long, our resulting data are $12x5$ - exactly what we expected!
### Challenge: Pivot the Chosen Data
Document your work here. What will a new "case" be once you have pivoted the data? How does it meet requirements for tidy data?
```{r}
eggs <- pivot_longer(eggs, col = c(large_half_dozen, large_dozen, extra_large_half_dozen, extra_large_dozen),
names_to="type",
values_to = "sales")
eggs
```
Now this dataset is in tidy form because each row is now one observation. I used pivot_longer because we needed more rows so that each observation could be individual in the dataset. When you go to read the dataset U can see each case only has one value attached to it.