challenge_3
Harsha Kanaka Eswar Gudipudi
egg_tidy.csv
Tidy Data: Pivoting
Author

Harsha Kanaka Eswar Gudipudi

Published

May 15, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • animal_weights.csv ⭐
  • eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
  • australian_marriage*.xls ⭐⭐⭐
  • USA Households*.xlsx ⭐⭐⭐⭐
  • sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟
Code
library(readr)
#Read eggs_tidy csv data
df <- read_csv("_data/eggs_tidy.csv",show_col_types = FALSE)
head(df)
# A tibble: 6 × 6
  month     year large_half_dozen large_dozen extra_large_half_dozen
  <chr>    <dbl>            <dbl>       <dbl>                  <dbl>
1 January   2004             126         230                    132 
2 February  2004             128.        226.                   134.
3 March     2004             131         225                    137 
4 April     2004             131         225                    137 
5 May       2004             131         225                    137 
6 June      2004             134.        231.                   137 
# ℹ 1 more variable: extra_large_dozen <dbl>

Briefly describe the data

The summary below is showing the data for the month, year, and prices of different types of egg cartons over a period of 10 years from 2004 to 2013. The data is summarized based on the different egg carton sizes, which include large_half_dozen, large_dozen, extra_large_half_dozen, and extra_large_dozen.The data shows that the average price for the large_half_dozen egg cartons is $155.2, while the average price for the large_dozen egg cartons is $254.2. The average price for extra_large_half_dozen egg cartons is $164.2, and the average price for extra_large_dozen egg cartons is $266.8.

Code
summary(df)
    month                year      large_half_dozen  large_dozen   
 Length:120         Min.   :2004   Min.   :126.0    Min.   :225.0  
 Class :character   1st Qu.:2006   1st Qu.:129.4    1st Qu.:233.5  
 Mode  :character   Median :2008   Median :174.5    Median :267.5  
                    Mean   :2008   Mean   :155.2    Mean   :254.2  
                    3rd Qu.:2011   3rd Qu.:174.5    3rd Qu.:268.0  
                    Max.   :2013   Max.   :178.0    Max.   :277.5  
 extra_large_half_dozen extra_large_dozen
 Min.   :132.0          Min.   :230.0    
 1st Qu.:135.8          1st Qu.:241.5    
 Median :185.5          Median :285.5    
 Mean   :164.2          Mean   :266.8    
 3rd Qu.:185.5          3rd Qu.:285.5    
 Max.   :188.1          Max.   :290.0    
Code
dim(df)
[1] 120   6
Code
cat(paste("no of rows in df = ", nrow(df)))
no of rows in df =  120

Anticipate the End Result

Here the four columns “large_half_dozen, large_dozen, extra_large_half_dozen, extra_large_dozen” which can be made into two columns based on the size and quantity. And the corresponding values for the eggs should be displayed in the new column as price.

Challenge: Describe the final dimensions

Here we are piovting 4 columns and the no of rows of df is 120 before pivoting. So the final no of rows would be equal to nrows(df)4 = 1204 = 480 and the no of columns after pivoting = 3+2 = 5. therefore the final dimension is 480*5.

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Example

Code
df_long <- df %>% 
  pivot_longer(
    cols = -c(month, year),
    names_to = c("size", "quantity"),
    names_pattern = "(.+)_(.+)",
    values_to = "price"
  )

# View the pivoted data frame
df_long
# A tibble: 480 × 5
   month     year size             quantity price
   <chr>    <dbl> <chr>            <chr>    <dbl>
 1 January   2004 large_half       dozen     126 
 2 January   2004 large            dozen     230 
 3 January   2004 extra_large_half dozen     132 
 4 January   2004 extra_large      dozen     230 
 5 February  2004 large_half       dozen     128.
 6 February  2004 large            dozen     226.
 7 February  2004 extra_large_half dozen     134.
 8 February  2004 extra_large      dozen     230 
 9 March     2004 large_half       dozen     131 
10 March     2004 large            dozen     225 
# ℹ 470 more rows
Code
print("size: ")
[1] "size: "
Code
print(dim(df_long))
[1] 480   5

Yes, once it is pivoted long, our resulting data are \(480x5\) - exactly what we expected!

Challenge: Pivot the Chosen Data

Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?

Once pivoted, each row in the new data frame represents a unique combination of month, year, size, quantity and price. The value column contains the corresponding value for each combination. This meets the requirements for tidy data because each row represents a single observation and each variable is in a separate column. Additionally, the column names are informative and easy to understand.

Any additional comments?