challenge_3
animal_weights
eggs
australian_marriage
usa_households
sce_labor
Author

Kris(tine)_Smole

Published

March 8, 2023

Code
library(tidyverse)



knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

#####reposted 3/26/2023 to correct missing date in YML. Originally posted 3/8/23.

DataSet Used

Code
eggs_tidy<-read_csv("_data/eggs_tidy.csv")
Code
select(eggs_tidy, month,year,large_half_dozen,large_dozen,extra_large_half_dozen,extra_large_dozen)
# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen extra_l…¹
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>     <dbl>
 1 January    2004             126         230                    132       230 
 2 February   2004             128.        226.                   134.      230 
 3 March      2004             131         225                    137       230 
 4 April      2004             131         225                    137       234.
 5 May        2004             131         225                    137       236 
 6 June       2004             134.        231.                   137       241 
 7 July       2004             134.        234.                   137       241 
 8 August     2004             134.        234.                   137       241 
 9 September  2004             130.        234.                   136.      241 
10 October    2004             128.        234.                   136.      241 
# … with 110 more rows, and abbreviated variable name ¹​extra_large_dozen

Description of Eggs_Tidy data

The eggs_tidy dataset contains 120 rows and 6 columns.

The rows in the dataset contains ten years of monthly data from January 2004 through December 2013 for the average volume of the columns representing the 6 different types of cartons of eggs sold by various producers in a fictious region, within a ficticious country. We know the values are averages because full and half-dozen cartons of eggs are not sold by a partial unit.For example, February 2004 has a value of 128.50 half dozen sized cartons of large-sized eggs.

Some of the columns contain observations, rather than variables.

We know from our reading (ie: in R for Data Science) the following rules regarding “tidy” data.

“There are three interrelated rules which make a dataset tidy: 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell.”

We see in rule 1 & 2 that each variable must have its own column, and each observation must have its own row. For this dataset, there are four variables. The variables are (1) month, (2) year, (3) Carton Type, and (4) Unit. As described for rule 2, the original depiction of the dataset includes columns that represent observations (Large 1/2 doz.,Large dozen, Extra large dozen, Extra large 1/2 doz., that will no longer be reflected as individual columns.

Regarding rule 2, this dataset has observations related to each month and year, that are depicted in its columns entitled large_half_dozen, large_dozen, extra_large_half_dozen, extra_large_dozen. These observations must be made into rows under a column for carton tyoe since observations within the month and year must have their own row.

Each value must have its own cell, which is true for the original depiction of the dataset, although the values of the observations must be pivoted according to the corresponding variables.

How Eggs_Tidy will be made “Tidy”

We will pivot this dataset to have more rows and fewer columns to “tidy” this data. The columns to be pivoted are observations of carton types (Large 1/2 dozen, Large dozen, Extra Large 1/2 dozen, and Extra Large dozen), which will become rows in the corresponding month and year of observation. We will use the tidyverse function, pivot_longer to accomplish this pivot.

Estimating the Shape of the New Pivoted DataFrame

In the original depiction of the dataset, we have 120 rows of data in 6 columns.

Two of the columns identify variables, so we will pivot \(k-2\) variables into a longer format where the \(k-2\) variable names will move into the names_to variable (as defined in the Pivot_longer function) and the current values in each of those columns will move into the values_to variable (as defined in the Pivot_longer function). Therefore, we would expect \(n * (k-2)\) rows in the pivoted dataframe!

We expect \(n * (k-2)\) rows in the pivoted dataframe.

Suppose you have a dataset with \(n\) 120 rows and \(k\) 6 variables. In our example, 2 of the variables are used to identify a case, so you will be pivoting \(k-2\) variables into a longer format where the \(k-2\) variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect \(n * (k-4)\) rows in the pivoted dataframe!

Eggs_Tidy example has \(n =120\) rows and \(k - 2 = 4\) variables being pivoted, so we expect a new dataframe to have \(n * 4 = 480\) rows x \(2 + 2 = 4\) columns.

Eggs_Tidy After Using Pivot_Longer Function

Code
eggs_tidy %>%
  pivot_longer(
    cols = ends_with("dozen"),
    names_to = "carton type",
    values_to = "units")
# A tibble: 480 × 4
   month     year `carton type`          units
   <chr>    <dbl> <chr>                  <dbl>
 1 January   2004 large_half_dozen        126 
 2 January   2004 large_dozen             230 
 3 January   2004 extra_large_half_dozen  132 
 4 January   2004 extra_large_dozen       230 
 5 February  2004 large_half_dozen        128.
 6 February  2004 large_dozen             226.
 7 February  2004 extra_large_half_dozen  134.
 8 February  2004 extra_large_dozen       230 
 9 March     2004 large_half_dozen        131 
10 March     2004 large_dozen             225 
# … with 470 more rows

CONCLUSION

We see the pivoted dataset does, indeed, have 480 rows and 4 columns.