DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge-3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in data
    • Briefly describe the data
  • Anticipate the End Result
  • Pivot the Data

Challenge-3

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
eggs
Author

Said Arslan

Published

October 1, 2022

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

Code
eggs <- read.csv("_data/eggs_tidy.csv")

Briefly describe the data

Code
dim(eggs)
[1] 120   6
Code
head(eggs)
     month year large_half_dozen large_dozen extra_large_half_dozen
1  January 2004            126.0     230.000                  132.0
2 February 2004            128.5     226.250                  134.5
3    March 2004            131.0     225.000                  137.0
4    April 2004            131.0     225.000                  137.0
5      May 2004            131.0     225.000                  137.0
6     June 2004            133.5     231.375                  137.0
  extra_large_dozen
1             230.0
2             230.0
3             230.0
4             234.5
5             236.0
6             241.0
Code
tail(eggs)
        month year large_half_dozen large_dozen extra_large_half_dozen
115      July 2013              178       267.5                 188.13
116    August 2013              178       267.5                 188.13
117 September 2013              178       267.5                 188.13
118   October 2013              178       267.5                 188.13
119  November 2013              178       267.5                 188.13
120  December 2013              178       267.5                 188.13
    extra_large_dozen
115               290
116               290
117               290
118               290
119               290
120               290
Code
sample_n(eggs, 10)
       month year large_half_dozen large_dozen extra_large_half_dozen
1   November 2005           128.50       233.5                  135.5
2    January 2012           174.50       267.5                  185.5
3   December 2009           174.50       271.5                  185.5
4  September 2005           128.50       233.5                  135.5
5     August 2011           174.50       270.0                  185.5
6      March 2006           128.50       233.5                  135.5
7   December 2006           128.50       233.5                  135.5
8       July 2004           133.50       233.5                  137.0
9    October 2012           173.25       267.5                  185.5
10  February 2006           128.50       233.5                  135.5
   extra_large_dozen
1            241.000
2            285.500
3            285.500
4            241.000
5            285.500
6            241.375
7            241.500
8            241.000
9            288.500
10           241.000
Code
unique(eggs$year)
 [1] 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013

The dataset includes monthly prices of 4 types of boxes of eggs (in cents) beginning from January 2004 to December 2013.

The column variables large_half_dozen, large_dozen, extra_large_half_dozen and extra_large_half_dozen are actually not variable names but type and number of eggs in boxes. On the other hand, the values under these 4 columns show prices. Thus we should reorganize these columns so that each row represents an observation. Column names in tidied version are: month, year, type and price.

Anticipate the End Result

originally our dataset has 120 rows and 6 columns. I will use 4 variables for pivoting. So 2 variables will be used as identifiers for each observation. Therefore I should have 480(120*4) rows and 4(2+2) columns in the longer format.

Pivot the Data

Code
eggs.pivoted <-pivot_longer(eggs, cols= ends_with("dozen"),
                 names_to= "type",
                 values_to= "price")

head(eggs.pivoted)
# A tibble: 6 × 4
  month     year type                   price
  <chr>    <int> <chr>                  <dbl>
1 January   2004 large_half_dozen        126 
2 January   2004 large_dozen             230 
3 January   2004 extra_large_half_dozen  132 
4 January   2004 extra_large_dozen       230 
5 February  2004 large_half_dozen        128.
6 February  2004 large_dozen             226.

Also, it would be better if the first column is year and the second column is month because main identifier for time of an observation is year. Values of month are repetitive.

Code
eggs.pivoted <- eggs.pivoted[, c("year", "month", "type", "price")]

head(eggs.pivoted)
# A tibble: 6 × 4
   year month    type                   price
  <int> <chr>    <chr>                  <dbl>
1  2004 January  large_half_dozen        126 
2  2004 January  large_dozen             230 
3  2004 January  extra_large_half_dozen  132 
4  2004 January  extra_large_dozen       230 
5  2004 February large_half_dozen        128.
6  2004 February large_dozen             226.

To check if my calculation for dimension of new dataset is correct or not, let me look at row and column numbers of pivoted dataframe.

Code
cat("Number of rows are: \n")
Number of rows are: 
Code
nrow(eggs.pivoted)
[1] 480
Code
cat("Number of columns are: \n")
Number of columns are: 
Code
ncol(eggs.pivoted)
[1] 4

In new dataset, each row shows price of a specific type of box of eggs in a month of a year. Now, using this dataframe, we can do further price analysis by grouping based on type.

Source Code
---
title: "Challenge-3"
author: "Said Arslan"
desription: "Tidy Data: Pivoting"
date: "10/01/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - eggs

---

```{r}
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

```


## Read in data


```{r}
eggs <- read.csv("_data/eggs_tidy.csv")

```


### Briefly describe the data


```{r}
dim(eggs)
head(eggs)
tail(eggs)
sample_n(eggs, 10)

unique(eggs$year)

```

The dataset includes monthly prices of 4 types of boxes of eggs (in cents) beginning from January 2004 to December 2013.


The column variables `large_half_dozen`, `large_dozen`, `extra_large_half_dozen` and `extra_large_half_dozen` are actually not variable names but type and number of eggs in boxes. On the other hand, the values under these 4 columns show prices. Thus we should reorganize these columns so that each row represents an observation. Column names in tidied version are: `month`, `year`, `type` and `price`. 



## Anticipate the End Result


originally our dataset has 120 rows and 6 columns. I will use 4 variables for pivoting. So 2 variables will be used as identifiers for each observation. Therefore I should have 480(120*4) rows and 4(2+2) columns in the longer format.



## Pivot the Data


```{r}
eggs.pivoted <-pivot_longer(eggs, cols= ends_with("dozen"),
                 names_to= "type",
                 values_to= "price")

head(eggs.pivoted)

```


Also, it would be better if the first column is `year` and the second column is `month` because main identifier for time of an observation is `year`. Values of `month` are repetitive.


```{r}
eggs.pivoted <- eggs.pivoted[, c("year", "month", "type", "price")]

head(eggs.pivoted)

```


To check if my calculation for dimension of new dataset is correct or not, let me look at row and column numbers of pivoted dataframe.


```{r}
cat("Number of rows are: \n")
nrow(eggs.pivoted)

cat("Number of columns are: \n")
ncol(eggs.pivoted)

```

In new dataset, each row shows price of a specific type of box of eggs in a month of a year.
Now, using this dataframe, we can do further price analysis by grouping based on type.