DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describe the data
  • Anticipate the End Result
    • Description of the final dimensions
  • Pivot the Data
    • Conclusion

Challenge 3

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
animal_weights
eggs
australian_marriage
usa_households
sce_labor
Author

Neeharika Karanam

Published

December 2, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • animal_weights.csv ⭐
  • eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
  • australian_marriage*.xls ⭐⭐⭐
  • USA Households*.xlsx ⭐⭐⭐⭐
  • sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟

I have chosen the dataset eggs_tidy.csv to perform my analysis and help pivot the dataset as required.

Code
#Read the eggs dataset and then print the data.
eggs_dataset <- read_csv("_data/eggs_tidy.csv")

eggs_dataset
# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen extra_l…¹
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>     <dbl>
 1 January    2004             126         230                    132       230 
 2 February   2004             128.        226.                   134.      230 
 3 March      2004             131         225                    137       230 
 4 April      2004             131         225                    137       234.
 5 May        2004             131         225                    137       236 
 6 June       2004             134.        231.                   137       241 
 7 July       2004             134.        234.                   137       241 
 8 August     2004             134.        234.                   137       241 
 9 September  2004             130.        234.                   136.      241 
10 October    2004             128.        234.                   136.      241 
# … with 110 more rows, and abbreviated variable name ¹​extra_large_dozen

Briefly describe the data

Now, let us describe the dataset and perform the analysis.

Code
summary(eggs_dataset)
    month                year      large_half_dozen  large_dozen   
 Length:120         Min.   :2004   Min.   :126.0    Min.   :225.0  
 Class :character   1st Qu.:2006   1st Qu.:129.4    1st Qu.:233.5  
 Mode  :character   Median :2008   Median :174.5    Median :267.5  
                    Mean   :2008   Mean   :155.2    Mean   :254.2  
                    3rd Qu.:2011   3rd Qu.:174.5    3rd Qu.:268.0  
                    Max.   :2013   Max.   :178.0    Max.   :277.5  
 extra_large_half_dozen extra_large_dozen
 Min.   :132.0          Min.   :230.0    
 1st Qu.:135.8          1st Qu.:241.5    
 Median :185.5          Median :285.5    
 Mean   :164.2          Mean   :266.8    
 3rd Qu.:185.5          3rd Qu.:285.5    
 Max.   :188.1          Max.   :290.0    
Code
head(eggs_dataset)
# A tibble: 6 × 6
  month     year large_half_dozen large_dozen extra_large_half_dozen extra_lar…¹
  <chr>    <dbl>            <dbl>       <dbl>                  <dbl>       <dbl>
1 January   2004             126         230                    132         230 
2 February  2004             128.        226.                   134.        230 
3 March     2004             131         225                    137         230 
4 April     2004             131         225                    137         234.
5 May       2004             131         225                    137         236 
6 June      2004             134.        231.                   137         241 
# … with abbreviated variable name ¹​extra_large_dozen

From the summary of the dataset we can clearly understand that there are 120 rows and 6 columns which consists of the data from 2004-2013 for each and every month of the year(12 months/year). The first two columns gives us the month and the year whereas the rest 4 columns gives us the average price of the size and quantity of the eggs combined. The column names are a combination of the size and the quanity like large_half_dozen, extra_large_half_dozen, large_dozen, extra_large_dozen. I have observed that the average price ranges from 126-290 cents.

Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!

Code
#existing rows/cases in the eggs dataset
nrow(eggs_dataset)
[1] 120
Code
#existing columns/cases in the eggs dataset
ncol(eggs_dataset)
[1] 6
Code
#expected rows/cases in the eggs dataset after pivoting
nrow(eggs_dataset) * (ncol(eggs_dataset)-2)
[1] 480
Code
# expected columns in the eggs dataset after pivoting
3 + 2
[1] 5

Description of the final dimensions

In the dataset I have observed that the dataset has 120 rows and 6 columns and I expect the dataset after pivoting to have the month, year, size of the egg and the quanity of the eggs. In the process of arranging the data in this way it will be extremely easy to observe the changes which are made throughout the year and also the changes during the range of 2004-2013. This will also help understand the differences between the large, extra large eggs as well as whether they were being sold in dozens or half-dozens.

After pivoting I expect the resulting dataset to have 4 times longer data than it is at the moment as I want to separate the size-quantity pairing and have individual columns for each and also the size-quantity pairing will have it’s own row instead of 4 columns after month and year. I anticipate the total number of columns to be decreased by 1 because I want to remove the 4 size-quantity pairings names and replace them with month, year, size, quantity, average price.

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above.

Code
eggs_longer <- eggs_dataset%>%
  pivot_longer(cols=contains("large"),
               names_to = c("size", "quantity"),
               names_sep="_",
               values_to = "cost"
  )

eggs_longer
# A tibble: 480 × 5
   month     year size  quantity  cost
   <chr>    <dbl> <chr> <chr>    <dbl>
 1 January   2004 large half      126 
 2 January   2004 large dozen     230 
 3 January   2004 extra large     132 
 4 January   2004 extra large     230 
 5 February  2004 large half      128.
 6 February  2004 large dozen     226.
 7 February  2004 extra large     134.
 8 February  2004 extra large     230 
 9 March     2004 large half      131 
10 March     2004 large dozen     225 
# … with 470 more rows
Code
#existing rows/cases after the pivot
nrow(eggs_longer)
[1] 480
Code
#existing columns/cases after the pivot
ncol(eggs_longer)
[1] 5

Conclusion

As anticipated, I can observe that the data is 4 times longer than it was and changed from 120 -> 480 and the number of columns has been reduced by 1 from 6 -> 5. This makes us have a single observation per row and helps in easily understand the data and work for the future analysis. I would now like to mutate the cost of the eggs from cents to dollars for better understand ability. The table below gives us the cost of the eggs in dollars very precisely.

Code
#Mutate the cost of the eggs from cents to dollars.

eggs_USD <- mutate(eggs_longer, 
       avg_USD = cost / 100
       )%>%
  select(!contains ("price"))
eggs_USD
# A tibble: 480 × 6
   month     year size  quantity  cost avg_USD
   <chr>    <dbl> <chr> <chr>    <dbl>   <dbl>
 1 January   2004 large half      126     1.26
 2 January   2004 large dozen     230     2.3 
 3 January   2004 extra large     132     1.32
 4 January   2004 extra large     230     2.3 
 5 February  2004 large half      128.    1.28
 6 February  2004 large dozen     226.    2.26
 7 February  2004 extra large     134.    1.34
 8 February  2004 extra large     230     2.3 
 9 March     2004 large half      131     1.31
10 March     2004 large dozen     225     2.25
# … with 470 more rows
Source Code
---
title: "Challenge 3"
author: "Neeharika Karanam"
desription: "Tidy Data: Pivoting"
date: "12/02/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - animal_weights
  - eggs
  - australian_marriage
  - usa_households
  - sce_labor
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1.  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2.  identify what needs to be done to tidy the current data
3.  anticipate the shape of pivoted data
4.  pivot the data into tidy format using `pivot_longer`

## Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

-   animal_weights.csv ⭐
-   eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
-   australian_marriage\*.xls ⭐⭐⭐
-   USA Households\*.xlsx ⭐⭐⭐⭐
-   sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟

I have chosen the dataset eggs_tidy.csv to perform my analysis and help pivot the dataset as required.

```{r}
#Read the eggs dataset and then print the data.
eggs_dataset <- read_csv("_data/eggs_tidy.csv")

eggs_dataset

```

### Briefly describe the data

Now, let us describe the dataset and perform the analysis.

```{r}
summary(eggs_dataset)

```

```{r}
head(eggs_dataset)

```

From the summary of the dataset we can clearly understand that there are 120 rows and 6 columns which consists of the data from 2004-2013 for each and every month of the year(12 months/year). The first two columns gives us the month and the year whereas the rest 4 columns gives us the average price of the size and quantity of the eggs combined. The column names are a combination of the size and the quanity like large_half_dozen, extra_large_half_dozen, large_dozen, extra_large_dozen. I have observed that the average price ranges from 126-290 cents.  

## Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product *should* look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!

```{r}
#existing rows/cases in the eggs dataset
nrow(eggs_dataset)

#existing columns/cases in the eggs dataset
ncol(eggs_dataset)

#expected rows/cases in the eggs dataset after pivoting
nrow(eggs_dataset) * (ncol(eggs_dataset)-2)

# expected columns in the eggs dataset after pivoting
3 + 2
```

### Description of the final dimensions

In the dataset I have observed that the dataset has 120 rows and 6 columns and I expect the dataset after pivoting to have the month, year, size of the egg and the quanity of the eggs. In the process of arranging the data in this way it will be extremely easy to observe the changes which are made throughout the year and also the changes during the range of 2004-2013. This will also help understand the differences between the large, extra large eggs as well as whether they were being sold in dozens or half-dozens.

After pivoting I expect the resulting dataset to have 4 times longer data than it is at the moment as I want to separate the size-quantity pairing and have individual columns for each and also the size-quantity pairing will have it's own row instead of 4 columns after month and year. I anticipate the total number of columns to be decreased by 1 because I want to remove the 4 size-quantity pairings names and replace them with month, year, size, quantity, average price.

## Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above.

```{r}
eggs_longer <- eggs_dataset%>%
  pivot_longer(cols=contains("large"),
               names_to = c("size", "quantity"),
               names_sep="_",
               values_to = "cost"
  )

eggs_longer
```

```{r}
#existing rows/cases after the pivot
nrow(eggs_longer)

#existing columns/cases after the pivot
ncol(eggs_longer)
```

### Conclusion

As anticipated, I can observe that the data is 4 times longer than it was and changed from 120 -> 480 and the number of columns has been reduced by 1 from 6 -> 5. This makes us have a single observation per row and helps in easily understand the data and work for the future analysis. I would now like to mutate the cost of the eggs from cents to dollars for better understand ability. The table below gives us the cost of the eggs in dollars very precisely.

```{r}
#Mutate the cost of the eggs from cents to dollars.

eggs_USD <- mutate(eggs_longer, 
       avg_USD = cost / 100
       )%>%
  select(!contains ("price"))
eggs_USD
```