Challenge 3

challenge_3

animal_weights

eggs

australian_marriage

usa_households

sce_labor

Author

Shriya Sehgal

Published

November 17, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
identify what needs to be done to tidy the current data
anticipate the shape of pivoted data
pivot the data into tidy format using pivot_longer

Read in data

Code

dataset <- read_csv("_data/eggs_tidy.csv")
dataset

# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen extra_l…¹
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>     <dbl>
 1 January    2004             126         230                    132       230 
 2 February   2004             128.        226.                   134.      230 
 3 March      2004             131         225                    137       230 
 4 April      2004             131         225                    137       234.
 5 May        2004             131         225                    137       236 
 6 June       2004             134.        231.                   137       241 
 7 July       2004             134.        234.                   137       241 
 8 August     2004             134.        234.                   137       241 
 9 September  2004             130.        234.                   136.      241 
10 October    2004             128.        234.                   136.      241 
# … with 110 more rows, and abbreviated variable name ¹extra_large_dozen

Briefly describe the data

Code

summary(eggs_dataset)

Error in summary(eggs_dataset): object 'eggs_dataset' not found

Code

head(eggs_dataset)

Error in head(eggs_dataset): object 'eggs_dataset' not found

There are 129 rows and 6 columns containing the data of each month from the 2004-2013. The first 2 columns are the months and years and the rest 4 tells us the average price of the size and quantity of the eggs combined. The average price of the eggs ranges from 120-290 cents. The columns are as follows. a. large_half_dozen b. extra_large_half_dozen c. large_dozen d. extra_large_dozen

Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!

Code

#existing rows/cases
nrow(dataset)

[1] 120

Code

#existing columns/cases
ncol(dataset)

[1] 6

Code

#expected rows/cases
nrow(dataset) * (ncol(dataset)-2)

[1] 480

Code

# expected columns 
3 + 2

[1] 5

Challenge: Describe the final dimensions

It can be seen that the dataset with 120 rows and 6 columns. After pivoting we would have the columns as months, year, size of the egg (large/extra large) and the quanity of the eggs(half dozen/dozen) which would make it extremely easy to observe the changes from 2004-2013 throughout the year for the size of the eggs as well as the quantity of eggs sold.

The resultant dataset would be 4 times larger than the original separating the size-quantity pairs with 4 columns. Another column that contains the information about the average price of the eggs have been added. So now, the total number of the columns would decrease by 1 and the resultants columns would be month, year, size, quantity and average price.

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Example

Code

dataset_long <- dataset%>%
  pivot_longer(cols=contains("large"),
               names_to = c("size", "quantity"),
               names_sep="_",
               values_to = "cost"
  )
dataset_long

# A tibble: 480 × 5
   month     year size  quantity  cost
   <chr>    <dbl> <chr> <chr>    <dbl>
 1 January   2004 large half      126 
 2 January   2004 large dozen     230 
 3 January   2004 extra large     132 
 4 January   2004 extra large     230 
 5 February  2004 large half      128.
 6 February  2004 large dozen     226.
 7 February  2004 extra large     134.
 8 February  2004 extra large     230 
 9 March     2004 large half      131 
10 March     2004 large dozen     225 
# … with 470 more rows

Code

#existing rows/cases after the pivot
nrow(dataset_long)

[1] 480

Code

#existing columns/cases after the pivot
ncol(dataset_long)

[1] 5

It can been observed that the data is 4 times longer than the original dataset (120 to 480 rows) and the number of columns got reduced by 1. The process of pivoting makes the single observation per row and helps in easily understand the data and work for the future analysis.