DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3 with R

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Pivot the Data

Challenge 3 with R

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
eggs
kristin_abijaoude
Author

Kristin Abijaoude

Published

September 25, 2022

Code
library(tidyverse)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Today, I will be tidying and pivoting eggs_tidy.csv.

Code
eggs<-read_csv("_data/eggs_tidy.csv")
eggs
# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen extra_l…¹
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>     <dbl>
 1 January    2004             126         230                    132       230 
 2 February   2004             128.        226.                   134.      230 
 3 March      2004             131         225                    137       230 
 4 April      2004             131         225                    137       234.
 5 May        2004             131         225                    137       236 
 6 June       2004             134.        231.                   137       241 
 7 July       2004             134.        234.                   137       241 
 8 August     2004             134.        234.                   137       241 
 9 September  2004             130.        234.                   136.      241 
10 October    2004             128.        234.                   136.      241 
# … with 110 more rows, and abbreviated variable name ¹​extra_large_dozen
Code
head(eggs)
# A tibble: 6 × 6
  month     year large_half_dozen large_dozen extra_large_half_dozen extra_lar…¹
  <chr>    <dbl>            <dbl>       <dbl>                  <dbl>       <dbl>
1 January   2004             126         230                    132         230 
2 February  2004             128.        226.                   134.        230 
3 March     2004             131         225                    137         230 
4 April     2004             131         225                    137         234.
5 May       2004             131         225                    137         236 
6 June      2004             134.        231.                   137         241 
# … with abbreviated variable name ¹​extra_large_dozen

Here, we get the first 6 rows of the dataset eggs_tidy.csv. From what I see, this dataset records the average price of eggs sold per carton in from 2004 to 2013.

Code
is_tibble(eggs)
[1] TRUE

Okay, what a mess. This is difficult to read and interpret. Let’s make it tidy first!

Code
as_tibble(eggs)
# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen extra_l…¹
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>     <dbl>
 1 January    2004             126         230                    132       230 
 2 February   2004             128.        226.                   134.      230 
 3 March      2004             131         225                    137       230 
 4 April      2004             131         225                    137       234.
 5 May        2004             131         225                    137       236 
 6 June       2004             134.        231.                   137       241 
 7 July       2004             134.        234.                   137       241 
 8 August     2004             134.        234.                   137       241 
 9 September  2004             130.        234.                   136.      241 
10 October    2004             128.        234.                   136.      241 
# … with 110 more rows, and abbreviated variable name ¹​extra_large_dozen

This is much better to read than the original format, but work still needs to be done.

Code
eggs %>%
  gather("large_half_dozen", "large_dozen", "extra_large_half_dozen", "extra_large_dozen", key = "Eggs Sold", value = "Price sold")
# A tibble: 480 × 4
   month      year `Eggs Sold`      `Price sold`
   <chr>     <dbl> <chr>                   <dbl>
 1 January    2004 large_half_dozen         126 
 2 February   2004 large_half_dozen         128.
 3 March      2004 large_half_dozen         131 
 4 April      2004 large_half_dozen         131 
 5 May        2004 large_half_dozen         131 
 6 June       2004 large_half_dozen         134.
 7 July       2004 large_half_dozen         134.
 8 August     2004 large_half_dozen         134.
 9 September  2004 large_half_dozen         130.
10 October    2004 large_half_dozen         128.
# … with 470 more rows
Code
eggs_tibble<-tibble(eggs)

#| label: Existing rows
nrow(eggs_tibble)
[1] 120
Code
#| Existing columns 
ncol(eggs_tibble)
[1] 6
Code
#| expected rows/cases
nrow(eggs_tibble) * (ncol(eggs_tibble)-2)
[1] 480
Code
#| expected columns
4
[1] 4

From there, I can expect the new amount of rows and columns for the eggs dataset.

Pivot the Data

Code
eggs_new <- eggs %>%
  mutate("Large Half Dozen per Cart Sold" = large_half_dozen / 100,
         "Large Dozen per Cart Sold" = large_dozen / 100,
         "Extra Large Half Dozen per Cart Sold" = extra_large_half_dozen / 100,
         "Extra Large Dozen per Cart Sold" = extra_large_dozen / 100)

#| label: Replace Old Columns with New Ones
eggs_new1 <- select(eggs_new,-c(large_half_dozen, large_dozen, extra_large_half_dozen, extra_large_dozen))
eggs_new1
# A tibble: 120 × 6
   month      year `Large Half Dozen per Cart Sold` Large Doze…¹ Extra…² Extra…³
   <chr>     <dbl>                            <dbl>        <dbl>   <dbl>   <dbl>
 1 January    2004                             1.26         2.3     1.32    2.3 
 2 February   2004                             1.28         2.26    1.34    2.3 
 3 March      2004                             1.31         2.25    1.37    2.3 
 4 April      2004                             1.31         2.25    1.37    2.35
 5 May        2004                             1.31         2.25    1.37    2.36
 6 June       2004                             1.34         2.31    1.37    2.41
 7 July       2004                             1.34         2.34    1.37    2.41
 8 August     2004                             1.34         2.34    1.37    2.41
 9 September  2004                             1.30         2.34    1.36    2.41
10 October    2004                             1.28         2.34    1.36    2.41
# … with 110 more rows, and abbreviated variable names
#   ¹​`Large Dozen per Cart Sold`, ²​`Extra Large Half Dozen per Cart Sold`,
#   ³​`Extra Large Dozen per Cart Sold`

I calculated the price of one cart sold in dollars with the mutate() command and removed the old columns to make room with the new ones.

Code
eggs_pivot <- eggs_new1%>%
  pivot_longer(cols=contains("large"),
               names_to = "size",
               values_to = "price"
  )
eggs_pivot
# A tibble: 480 × 4
   month     year size                                 price
   <chr>    <dbl> <chr>                                <dbl>
 1 January   2004 Large Half Dozen per Cart Sold        1.26
 2 January   2004 Large Dozen per Cart Sold             2.3 
 3 January   2004 Extra Large Half Dozen per Cart Sold  1.32
 4 January   2004 Extra Large Dozen per Cart Sold       2.3 
 5 February  2004 Large Half Dozen per Cart Sold        1.28
 6 February  2004 Large Dozen per Cart Sold             2.26
 7 February  2004 Extra Large Half Dozen per Cart Sold  1.34
 8 February  2004 Extra Large Dozen per Cart Sold       2.3 
 9 March     2004 Large Half Dozen per Cart Sold        1.31
10 March     2004 Large Dozen per Cart Sold             2.25
# … with 470 more rows

Okay, that’s (sort of) better! As you can see, the larger the eggs are, the more expensive they will be. The same concept applies to quantity, obviously, but you get more bang for your buck.

Source Code
---
title: "Challenge 3 with R"
author: "Kristin Abijaoude"
desription: "Tidy Data: Pivoting"
date: "09/25/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - eggs
  - kristin_abijaoude
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

Today, I will be tidying and pivoting eggs_tidy.csv. 

```{r}
#| label: Reading the Dataset
eggs<-read_csv("_data/eggs_tidy.csv")
eggs
```

```{r}
#| label: Return head of data frame
head(eggs)
```

Here, we get the first 6 rows of the dataset eggs_tidy.csv. From what I see, this dataset records the average price of eggs sold per carton in from 2004 to 2013.

```{r}
#| label:  Let's create a table, shall we?
is_tibble(eggs)
```

Okay, what a mess. This is difficult to read and interpret. Let's make it tidy first!

```{r}
as_tibble(eggs)
```
This is much better to read than the original format, but work still needs to be done. 

```{r}
eggs %>%
  gather("large_half_dozen", "large_dozen", "extra_large_half_dozen", "extra_large_dozen", key = "Eggs Sold", value = "Price sold")
```

```{r}
#| label: tibbling
eggs_tibble<-tibble(eggs)

#| label: Existing rows
nrow(eggs_tibble)

#| Existing columns 
ncol(eggs_tibble)

#| expected rows/cases
nrow(eggs_tibble) * (ncol(eggs_tibble)-2)

#| expected columns
4
```
From there, I can expect the new amount of rows and columns for the eggs dataset. 

## Pivot the Data

```{r}
#| label: Tidying up Columns
eggs_new <- eggs %>%
  mutate("Large Half Dozen per Cart Sold" = large_half_dozen / 100,
         "Large Dozen per Cart Sold" = large_dozen / 100,
         "Extra Large Half Dozen per Cart Sold" = extra_large_half_dozen / 100,
         "Extra Large Dozen per Cart Sold" = extra_large_dozen / 100)

#| label: Replace Old Columns with New Ones
eggs_new1 <- select(eggs_new,-c(large_half_dozen, large_dozen, extra_large_half_dozen, extra_large_dozen))
eggs_new1
```
I calculated the price of one cart sold in dollars with the mutate() command and removed the old columns to make room with the new ones. 

```{r}
eggs_pivot <- eggs_new1%>%
  pivot_longer(cols=contains("large"),
               names_to = "size",
               values_to = "price"
  )
eggs_pivot
```

Okay, that's (sort of) better! As you can see, the larger the eggs are, the more expensive they will be. The same concept applies to quantity, obviously, but you get more bang for your buck.