DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge 3 Tasks:
  • Task 1.) Read in data
    • Description of Eggs_Tidy
  • Task 2.) Anticipate the End Result
    • Pivoting Steps
  • Task 3.) Calculate Final Dimensions
  • Task 4.) Pivot the Data
    • Description of Pivoted Data

Challenge 3

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
shelton
eggs
Author

Dane Shelton

Published

October 5, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge 3 Tasks:

1.) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc) 2.) identify what needs to be done to tidy the current data 3.) anticipate the shape of pivoted data 4.) pivot the data into tidy format using pivot_longer

Task 1.) Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • eggs_tidy.csv ⭐⭐ organiceggpoultry.xls⭐⭐⭐
Rows: 120
Columns: 6
$ month                  <chr> "January", "February", "March", "April", "May",…
$ year                   <dbl> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004,…
$ large_half_dozen       <dbl> 126.00, 128.50, 131.00, 131.00, 131.00, 133.50,…
$ large_dozen            <dbl> 230.000, 226.250, 225.000, 225.000, 225.000, 23…
$ extra_large_half_dozen <dbl> 132.000, 134.500, 137.000, 137.000, 137.000, 13…
$ extra_large_dozen      <dbl> 230.0, 230.0, 230.0, 234.5, 236.0, 241.0, 241.0…
# A tibble: 120 × 2
   month      year
   <chr>     <dbl>
 1 January    2004
 2 February   2004
 3 March      2004
 4 April      2004
 5 May        2004
 6 June       2004
 7 July       2004
 8 August     2004
 9 September  2004
10 October    2004
# … with 110 more rows

Description of Eggs_Tidy

We can see that eggs_tidy represents the price of various sizes and quantities of eggs each month for the 10 year period between January 2004 to December 2013. The prices were originally listed in cents, but we transformed the columns to show the prices in dollars.

Unfortunately, our observations represent more that one case, so we’ll need to use pivot_longer to tidy up our data so that one row represents a single observation. The variables that will be used in the final data set to identify a single observation are month, year, size (large or extra large), and quantity (half dozen or dozen)

Task 2.) Anticipate the End Result

Pivoting Steps

Currently, we only have \(2\) of our \(6\) varibales identifying a case - month and year. This means we will have to pivot longer \(6 - 2 = 4\) columns total: large_half_dozen, large dozen, extra_large_half_dozen, and extra_large_dozen. We will split the descriptions by size and quantity using the names_sep argument, their values will go into a price column.

Task 3.) Calculate Final Dimensions

First, let’s take a look at the current dimensions of eggs_tidy:

Code
dim(eggs_tidy)
[1] 120   6

Our current dimensions are 120 rows, by 6 columns. As mentioned, we are pivoting 4 of the 6 columns into 3 new columns, which should result in \(120 * (6-2) = 480\) rows and \(2 + 3 = 5\) columns: month, year, size, quantity, price.

Task 4.) Pivot the Data

Error in eval(expr, envir, enclos): object 'final_eggs' not found
Error in eval(expr, envir, enclos): object 'final_eggs' not found

Description of Pivoted Data

Now, after pivoting eggs_tidy into our final dataset final_eggs, we can see that the dimensions match our prediction (480 x 5), and an individual case containing date, size, quantity, and price information is identified by each row. Each column represents a variable, each row a case, and each cell is a value - our data is tidy!

Source Code
---
title: "Challenge 3"
author: "Dane Shelton"
desription: "Tidy Data: Pivoting"
date: "10/05/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - shelton
  - eggs
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge 3 Tasks: 


1.)  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2.)  identify what needs to be done to tidy the current data
3.)  anticipate the shape of pivoted data
4.) pivot the data into tidy format using `pivot_longer`

## Task 1.) Read in data

Read in one (or more) of the following datasets, using the correct R package and command.


-   eggs_tidy.csv ⭐⭐ organiceggpoultry.xls⭐⭐⭐

```{r}
#| label: Read In and Description
#| echo: False
#| output: include

# Eggs Tidy
 eggs_tidy<- readr:: read_csv("_data/eggs_tidy.csv")
glimpse(eggs_tidy)

# which(is.na(eggs_tidy))

# eggs_tidy

# not actually familiar with purr style formulas, saw technique on stackexchange
 eggs_tidy <- eggs_tidy %>%
              mutate(across(3:6, ~(./100)))

eggs_tidy %>% 
    distinct(month,year)
 
```

### Description of Eggs_Tidy

We can see that `eggs_tidy` represents the price of various sizes and quantities of eggs each month for the 10 year period between January 2004 to December 2013. The prices were originally listed in cents, but we transformed the columns to show the prices in dollars.

Unfortunately, our observations represent more that one case, so we'll need to use `pivot_longer` to tidy up our data so that one row represents a single observation. The variables that will be used in the final data set to identify a single observation are month, year, size (large or extra large), and quantity (half dozen or dozen)


## Task 2.) Anticipate the End Result 

### Pivoting Steps

Currently, we only have $2$ of our $6$ varibales identifying a case - month and year. This means we will have to pivot longer $6 - 2 = 4$ columns total: `large_half_dozen`, `large dozen`, `extra_large_half_dozen`, and `extra_large_dozen`. We will split the descriptions by size and quantity using the `names_sep` argument, their values will go into a `price` column.



## Task 3.) Calculate Final Dimensions

First, let's take a look at the current dimensions of `eggs_tidy`:

```{r}
#| output: true
#| label: Dimensions of Current

dim(eggs_tidy)


```


Our current dimensions are 120 rows, by 6 columns. As mentioned, we are pivoting 4 of the 6 columns into 3 new columns, which should result in $120 * (6-2) = 480$ rows and $2 + 3 = 5$ columns: month, year, size, quantity, price.


## Task 4.) Pivot the Data


```{r}
#| label: Pivoting
#| output: true
#| echo: false

# eggstrial <- eggs_tidy %>%
            # pivot_longer(col = c(large_half_dozen, extra_large_half_dozen, large_dozen, # extra_large_dozen), names_to= c("Size", "Quantity (Dozen)"), names_sep = '_', values_to= "price")

# Need to change structure of variable names

# eggs_tidy <- eggs_tidy %>%
# rename(c(Large_Half=large_half_dozen, Large_Dozen =  large_dozen, 
# XL_Dozen=extra_large_dozen, XL_Half=extra_large_half_dozen))

#Attempt 2 
# final_eggs <- eggs_tidy %>%
#              pivot_longer(col = c(Large_Half, Large_Dozen, XL_Half, XL_Dozen), names_to= # c("Size", "Quantity (Dozen)"), names_sep = '_', values_to= "price")

final_eggs

dim(final_eggs)

```
### Description of Pivoted Data

Now, after pivoting `eggs_tidy` into our final dataset `final_eggs`, we can see that the dimensions match our prediction (480 x 5), and an individual case containing date, size, quantity, and price information is identified by each row. Each column represents a variable, each row a case, and each cell is a value - our data is tidy!