DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
  • Briefly describe the data
  • Anticipate the End Result
  • Pivot the Data

Challenge 3

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
animal_weights
eggs
australian_marriage
usa_households
sce_labor
Author

Jack Sniezek

Published

December 1, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

  • eggs_tidy.csv ⭐⭐
Code
eggs <- read_csv("_data/eggs_tidy.csv")%>%
    rename("xlarge_halfdozen" = "extra_large_half_dozen", 
         "xlarge_dozen" = "extra_large_dozen", 
         "large_halfdozen" = "large_half_dozen")
eggs
# A tibble: 120 × 6
   month      year large_halfdozen large_dozen xlarge_halfdozen xlarge_dozen
   <chr>     <dbl>           <dbl>       <dbl>            <dbl>        <dbl>
 1 January    2004            126         230              132          230 
 2 February   2004            128.        226.             134.         230 
 3 March      2004            131         225              137          230 
 4 April      2004            131         225              137          234.
 5 May        2004            131         225              137          236 
 6 June       2004            134.        231.             137          241 
 7 July       2004            134.        234.             137          241 
 8 August     2004            134.        234.             137          241 
 9 September  2004            130.        234.             136.         241 
10 October    2004            128.        234.             136.         241 
# … with 110 more rows
Code
summary(eggs)
    month                year      large_halfdozen  large_dozen   
 Length:120         Min.   :2004   Min.   :126.0   Min.   :225.0  
 Class :character   1st Qu.:2006   1st Qu.:129.4   1st Qu.:233.5  
 Mode  :character   Median :2008   Median :174.5   Median :267.5  
                    Mean   :2008   Mean   :155.2   Mean   :254.2  
                    3rd Qu.:2011   3rd Qu.:174.5   3rd Qu.:268.0  
                    Max.   :2013   Max.   :178.0   Max.   :277.5  
 xlarge_halfdozen  xlarge_dozen  
 Min.   :132.0    Min.   :230.0  
 1st Qu.:135.8    1st Qu.:241.5  
 Median :185.5    Median :285.5  
 Mean   :164.2    Mean   :266.8  
 3rd Qu.:185.5    3rd Qu.:285.5  
 Max.   :188.1    Max.   :290.0  

Briefly describe the data

After reading in the eggs dataset, I can see that there are 120 rows that contain each month from 2004-2013. There are 6 columns that represent the month and year, as well as average egg prices for 4 types/quantities of eggs.

On the read in, I also renamed the columns to keep the size and quantity of eggs separate, which will help me pivot the data.

Anticipate the End Result

Right now the data consists of 6 columns, 4 of which contain values and 2 categorize the data. To make the data easier to work with, I want to make one column with values(Price) and add a column for size and quantity of eggs. So, my new matrix will contain the month, year, size, quantity, and price. I also anticipate that there will be 480 rows, as I will be putting all the price values into one column (120 months x 4 price variables).

Pivot the Data

Code
eggs_longer <- eggs %>%
   pivot_longer(cols = contains("large"),
               names_to = c("size", "quantity"),
               names_sep = "_",
               values_to = "price")

eggs_longer
# A tibble: 480 × 5
   month     year size   quantity  price
   <chr>    <dbl> <chr>  <chr>     <dbl>
 1 January   2004 large  halfdozen  126 
 2 January   2004 large  dozen      230 
 3 January   2004 xlarge halfdozen  132 
 4 January   2004 xlarge dozen      230 
 5 February  2004 large  halfdozen  128.
 6 February  2004 large  dozen      226.
 7 February  2004 xlarge halfdozen  134.
 8 February  2004 xlarge dozen      230 
 9 March     2004 large  halfdozen  131 
10 March     2004 large  dozen      225 
# … with 470 more rows

The data matches my prediction, as I now have 480 rows and 5 columns. The data is now organized so that there is one column that contains all the price values.

Source Code
---
title: "Challenge 3"
author: "Jack Sniezek"
desription: "Tidy Data: Pivoting"
date: "12/1/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - animal_weights
  - eggs
  - australian_marriage
  - usa_households
  - sce_labor
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1.  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2.  identify what needs to be done to tidy the current data
3.  anticipate the shape of pivoted data
4.  pivot the data into tidy format using `pivot_longer`

## Read in data

-   eggs_tidy.csv ⭐⭐ 

```{r}
eggs <- read_csv("_data/eggs_tidy.csv")%>%
    rename("xlarge_halfdozen" = "extra_large_half_dozen", 
         "xlarge_dozen" = "extra_large_dozen", 
         "large_halfdozen" = "large_half_dozen")
eggs
summary(eggs)

```

## Briefly describe the data

After reading in the eggs dataset, I can see that there are 120 rows that contain each month from 2004-2013. There are 6 columns that represent the month and year, as well as average egg prices for 4 types/quantities of eggs.

On the read in, I also renamed the columns to keep the size and quantity of eggs separate, which will help me pivot the data.

## Anticipate the End Result

Right now the data consists of 6 columns, 4 of which contain values and 2 categorize the data. To make the data easier to work with, I want to make one column with values(Price) and add a column for size and quantity of eggs. So, my new matrix will contain the month, year, size, quantity, and price. I also anticipate that there will be 480 rows, as I will be putting all the price values into one column (120 months x 4 price variables).

## Pivot the Data

```{r}
eggs_longer <- eggs %>%
   pivot_longer(cols = contains("large"),
               names_to = c("size", "quantity"),
               names_sep = "_",
               values_to = "price")

eggs_longer

```


The data matches my prediction, as I now have 480 rows and 5 columns. The data is now organized so that there is one column that contains all the price values.