DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describe the data
  • Anticipate the End Result
    • Challenge: Describe the final dimensions
  • Pivot the Data
    • Example

Challenge 3

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
animal_weights
eggs
australian_marriage
usa_households
sce_labor
Author

Shriya Sehgal

Published

November 17, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

Code
dataset <- read_csv("_data/eggs_tidy.csv")
dataset
# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen extra_l…¹
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>     <dbl>
 1 January    2004             126         230                    132       230 
 2 February   2004             128.        226.                   134.      230 
 3 March      2004             131         225                    137       230 
 4 April      2004             131         225                    137       234.
 5 May        2004             131         225                    137       236 
 6 June       2004             134.        231.                   137       241 
 7 July       2004             134.        234.                   137       241 
 8 August     2004             134.        234.                   137       241 
 9 September  2004             130.        234.                   136.      241 
10 October    2004             128.        234.                   136.      241 
# … with 110 more rows, and abbreviated variable name ¹​extra_large_dozen

Briefly describe the data

Code
summary(eggs_dataset)
Error in summary(eggs_dataset): object 'eggs_dataset' not found
Code
head(eggs_dataset)
Error in head(eggs_dataset): object 'eggs_dataset' not found

There are 129 rows and 6 columns containing the data of each month from the 2004-2013. The first 2 columns are the months and years and the rest 4 tells us the average price of the size and quantity of the eggs combined. The average price of the eggs ranges from 120-290 cents. The columns are as follows. a. large_half_dozen b. extra_large_half_dozen c. large_dozen d. extra_large_dozen

Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!

Code
#existing rows/cases
nrow(dataset)
[1] 120
Code
#existing columns/cases
ncol(dataset)
[1] 6
Code
#expected rows/cases
nrow(dataset) * (ncol(dataset)-2)
[1] 480
Code
# expected columns 
3 + 2
[1] 5

Challenge: Describe the final dimensions

It can be seen that the dataset with 120 rows and 6 columns. After pivoting we would have the columns as months, year, size of the egg (large/extra large) and the quanity of the eggs(half dozen/dozen) which would make it extremely easy to observe the changes from 2004-2013 throughout the year for the size of the eggs as well as the quantity of eggs sold.

The resultant dataset would be 4 times larger than the original separating the size-quantity pairs with 4 columns. Another column that contains the information about the average price of the eggs have been added. So now, the total number of the columns would decrease by 1 and the resultants columns would be month, year, size, quantity and average price.

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Example

Code
dataset_long <- dataset%>%
  pivot_longer(cols=contains("large"),
               names_to = c("size", "quantity"),
               names_sep="_",
               values_to = "cost"
  )
dataset_long
# A tibble: 480 × 5
   month     year size  quantity  cost
   <chr>    <dbl> <chr> <chr>    <dbl>
 1 January   2004 large half      126 
 2 January   2004 large dozen     230 
 3 January   2004 extra large     132 
 4 January   2004 extra large     230 
 5 February  2004 large half      128.
 6 February  2004 large dozen     226.
 7 February  2004 extra large     134.
 8 February  2004 extra large     230 
 9 March     2004 large half      131 
10 March     2004 large dozen     225 
# … with 470 more rows
Code
#existing rows/cases after the pivot
nrow(dataset_long)
[1] 480
Code
#existing columns/cases after the pivot
ncol(dataset_long)
[1] 5

It can been observed that the data is 4 times longer than the original dataset (120 to 480 rows) and the number of columns got reduced by 1. The process of pivoting makes the single observation per row and helps in easily understand the data and work for the future analysis.

Source Code
---
title: "Challenge 3"
author: "Shriya Sehgal"
desription: "Tidy Data: Pivoting"
date: "11/17/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - animal_weights
  - eggs
  - australian_marriage
  - usa_households
  - sce_labor
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1.  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2.  identify what needs to be done to tidy the current data
3.  anticipate the shape of pivoted data
4.  pivot the data into tidy format using `pivot_longer`

## Read in data


```{r}
dataset <- read_csv("_data/eggs_tidy.csv")
dataset

```

### Briefly describe the data

```{r}
summary(eggs_dataset)
```

```{r}
head(eggs_dataset)
```


There are 129 rows and 6 columns containing the data of each month from the 2004-2013. The first 2 columns are the months and years and the rest 4 tells us the average price of the size and quantity of the eggs combined. The average price of the eggs ranges from 120-290 cents. The columns are as follows.
a. large_half_dozen
b. extra_large_half_dozen
c. large_dozen
d. extra_large_dozen


## Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product *should* look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!



```{r}
#existing rows/cases
nrow(dataset)

#existing columns/cases
ncol(dataset)

#expected rows/cases
nrow(dataset) * (ncol(dataset)-2)

# expected columns 
3 + 2
```

### Challenge: Describe the final dimensions



It can be seen that the dataset with 120 rows and 6 columns. After pivoting we would have the columns as months, year, size of the egg (large/extra large) and the quanity of the eggs(half dozen/dozen) which would make it extremely easy to observe the changes from 2004-2013 throughout the year for the size of the eggs as well as the quantity of eggs sold.


The resultant dataset would be 4 times larger than the original separating the size-quantity pairs with 4 columns. Another column that contains the information about the average price of the eggs have been added. So now, the total number of the columns would decrease by 1 and the resultants columns would be month, year, size, quantity and average price.

## Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.

### Example

```{r}
#| tbl-cap: Pivoted Example

dataset_long <- dataset%>%
  pivot_longer(cols=contains("large"),
               names_to = c("size", "quantity"),
               names_sep="_",
               values_to = "cost"
  )
dataset_long
```

```{r}
#existing rows/cases after the pivot
nrow(dataset_long)
#existing columns/cases after the pivot
ncol(dataset_long)
```


It can been observed that the data is 4 times longer than the original dataset (120 to 480 rows) and the number of columns got reduced by 1. The process of pivoting makes the single observation per row and helps in easily understand the data and work for the future analysis.