DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Step 1 - Read in data
    • Step 2 - Briefly describe the data
  • Step 3 - Anticipate the End Result
    • Challenge: Describe the final dimensions
  • Step 4- Pivot the Data
    • Challenge: Pivot the Chosen Data

Challenge 3

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
eggs
darron_bunt
Author

Darron Bunt

Published

October 16, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Step 1 - Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • eggs_tidy.csv ⭐⭐
Code
eggs <- read_csv("_data/eggs_tidy.csv")

Step 2 - Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”

The dataset examines the monthly price (in cents) of large and extra large eggs, sold by either the half dozen or by the dozen, between 2004-2013.

For data to be tidy, it must meet three criteria: 1. each variable must have its own column. 2. each observation must have its own row. 3. each value must have its own cell.

The month and the year are good to go - they are their own variable, and in their own columns.

In the remaining four columns, however, there are two egg size variables (large and extra large) two egg quantity variables (half dozen and dozen), as well as a price variable.

Accordingly, to tidy this data, I want to pivot the data such that I will have five columns instead of six. Two of these columns will be the same as the original dataset - month and year - while three will be new - egg size, egg quantity, and egg price.

Step 3 - Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.

When we consider the full dataset, there are 480 potential egg values - 120 month/year combinations, multiplied by four variables (half/full dozen, large/extra large size).

If I want to pivot to five variables - month/year, egg size, egg quantity, and price - there are four variables that are going to dictate the final size of the dataset (egg size (2) and egg quantity (2)). 120 multiplied by 4 is 480, so I’m expecting to have a pivoted dataset that is 480 rows long and has the five identified columns.

Challenge: Describe the final dimensions

Document your work here.

There are two different variables contained within each of the four egg-related columns - the size (large/extra large) and the quantity (dozen/half dozen). I’m going to need to create TWO columns from the values in each of the original columns.

I can use the names_sep option in pivot functions to help me do this. But if my sep is going to be _ I’m going to need some new column names first (because three of the four original columns actually contain two _ each).

Code
eggs_new <- eggs %>%
  rename("Xlarge_halfdozen" = "extra_large_half_dozen",
         "Xlarge_dozen" = "extra_large_dozen",
         "large_halfdozen" = "large_half_dozen"
         )
eggs_new
# A tibble: 120 × 6
   month      year large_halfdozen large_dozen Xlarge_halfdozen Xlarge_dozen
   <chr>     <dbl>           <dbl>       <dbl>            <dbl>        <dbl>
 1 January    2004            126         230              132          230 
 2 February   2004            128.        226.             134.         230 
 3 March      2004            131         225              137          230 
 4 April      2004            131         225              137          234.
 5 May        2004            131         225              137          236 
 6 June       2004            134.        231.             137          241 
 7 July       2004            134.        234.             137          241 
 8 August     2004            134.        234.             137          241 
 9 September  2004            130.        234.             136.         241 
10 October    2004            128.        234.             136.         241 
# … with 110 more rows

Ok, now I have column names that I can use to pivot_longer and create the tibble I’m looking for.

Step 4- Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Challenge: Pivot the Chosen Data

Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?

Code
eggs_by_type <- eggs_new %>%
  pivot_longer(cols = contains("large"),
               names_to = c("Size", "Quantity"),
               names_sep= "_",
               values_to = "AvgPrice"
               )
eggs_by_type
# A tibble: 480 × 5
   month     year Size   Quantity  AvgPrice
   <chr>    <dbl> <chr>  <chr>        <dbl>
 1 January   2004 large  halfdozen     126 
 2 January   2004 large  dozen         230 
 3 January   2004 Xlarge halfdozen     132 
 4 January   2004 Xlarge dozen         230 
 5 February  2004 large  halfdozen     128.
 6 February  2004 large  dozen         226.
 7 February  2004 Xlarge halfdozen     134.
 8 February  2004 Xlarge dozen         230 
 9 March     2004 large  halfdozen     131 
10 March     2004 large  dozen         225 
# … with 470 more rows

This meets the case for tidy data because: 1. each variable has its own column (month, year, size, quantity, price) 2. Each observation has its own row 3. Every value has its own cell

Boom. Did the thing.

Source Code
---
title: "Challenge 3"
author: "Darron Bunt"
desription: "Tidy Data: Pivoting"
date: "10/16/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  -  eggs
  - darron_bunt
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1.  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2.  identify what needs to be done to tidy the current data
3.  anticipate the shape of pivoted data
4.  pivot the data into tidy format using `pivot_longer`

## Step 1 - Read in data

**Read in one (or more) of the following datasets, using the correct R package and command.**

-   eggs_tidy.csv ⭐⭐ 

```{r}
eggs <- read_csv("_data/eggs_tidy.csv")
```

### Step 2 - Briefly describe the data

**Describe the data, and be sure to comment on why you are planning to pivot it to make it "tidy"**

The dataset examines the monthly price (in cents) of large and extra large eggs, sold by either the half dozen or by the dozen, between 2004-2013.

For data to be tidy, it must meet three criteria:
1. each variable must have its own column.
2. each observation must have its own row.
3. each value must have its own cell.

The month and the year are good to go - they are their own variable, and in their own columns. 

In the remaining four columns, however, there are two egg size variables (large and extra large) two egg quantity variables (half dozen and dozen), as well as a price variable. 

Accordingly, to tidy this data, I want to pivot the data such that I will have five columns instead of six. Two of these columns will be the same as the original dataset - month and year - while three will be new - egg size, egg quantity, and egg price.

## Step 3 - Anticipate the End Result

**The first step in pivoting the data is to try to come up with a concrete vision of what the end product *should* look like - that way you will know whether or not your pivoting was successful.**

When we consider the full dataset, there are 480 potential egg values - 120 month/year combinations, multiplied by four variables (half/full dozen, large/extra large size).

If I want to pivot to five variables - month/year, egg size, egg quantity, and price - there are four variables that are going to dictate the final size of the dataset (egg size (2) and egg quantity (2)). 120 multiplied by 4 is 480, so I'm expecting to have a pivoted dataset that is 480 rows long and has the five identified columns. 


### Challenge: Describe the final dimensions

Document your work here.

There are two different variables contained within each of the four egg-related columns - the size (large/extra large) and the quantity (dozen/half dozen). I'm going to need to create TWO columns from the values in each of the original columns.

I can use the names_sep option in pivot functions to help me do this. But if my sep is going to be _ I'm going to need some new column names first (because three of the four original columns actually contain two _ each).

```{r}
eggs_new <- eggs %>%
  rename("Xlarge_halfdozen" = "extra_large_half_dozen",
         "Xlarge_dozen" = "extra_large_dozen",
         "large_halfdozen" = "large_half_dozen"
         )
eggs_new
```
Ok, now I have column names that I can use to pivot_longer and create the tibble I'm looking for.

## Step 4- Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.

### Challenge: Pivot the Chosen Data

Document your work here. What will a new "case" be once you have pivoted the data? How does it meet requirements for tidy data?

```{r}
eggs_by_type <- eggs_new %>%
  pivot_longer(cols = contains("large"),
               names_to = c("Size", "Quantity"),
               names_sep= "_",
               values_to = "AvgPrice"
               )
eggs_by_type

```
This meets the case for tidy data because:
1. each variable has its own column (month, year, size, quantity, price)
2. Each observation has its own row 
3. Every value has its own cell

Boom. Did the thing.