Challenge 3 Solution

challenge_3

animal_weights

eggs

australian_marriage

usa_households

sce_labor

Tidy Data: Pivoting

Author

Shreya Varma

Published

May 30, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
identify what needs to be done to tidy the current data
anticipate the shape of pivoted data
pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

animal_weights.csv ⭐
eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
australian_marriage*.xls ⭐⭐⭐
USA Households*.xlsx ⭐⭐⭐⭐
sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟

Code

eggs_tidy <- read_csv("_data/eggs_tidy.csv")
eggs_tidy

# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>
 1 January    2004             126         230                    132 
 2 February   2004             128.        226.                   134.
 3 March      2004             131         225                    137 
 4 April      2004             131         225                    137 
 5 May        2004             131         225                    137 
 6 June       2004             134.        231.                   137 
 7 July       2004             134.        234.                   137 
 8 August     2004             134.        234.                   137 
 9 September  2004             130.        234.                   136.
10 October    2004             128.        234.                   136.
# ℹ 110 more rows
# ℹ 1 more variable: extra_large_dozen <dbl>

Code

glimpse(eggs_tidy)

Rows: 120
Columns: 6
$ month                  <chr> "January", "February", "March", "April", "May",…
$ year                   <dbl> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004,…
$ large_half_dozen       <dbl> 126.00, 128.50, 131.00, 131.00, 131.00, 133.50,…
$ large_dozen            <dbl> 230.000, 226.250, 225.000, 225.000, 225.000, 23…
$ extra_large_half_dozen <dbl> 132.000, 134.500, 137.000, 137.000, 137.000, 13…
$ extra_large_dozen      <dbl> 230.0, 230.0, 230.0, 234.5, 236.0, 241.0, 241.0…

Code

summary(eggs_tidy)

    month                year      large_half_dozen  large_dozen   
 Length:120         Min.   :2004   Min.   :126.0    Min.   :225.0  
 Class :character   1st Qu.:2006   1st Qu.:129.4    1st Qu.:233.5  
 Mode  :character   Median :2008   Median :174.5    Median :267.5  
                    Mean   :2008   Mean   :155.2    Mean   :254.2  
                    3rd Qu.:2011   3rd Qu.:174.5    3rd Qu.:268.0  
                    Max.   :2013   Max.   :178.0    Max.   :277.5  
 extra_large_half_dozen extra_large_dozen
 Min.   :132.0          Min.   :230.0    
 1st Qu.:135.8          1st Qu.:241.5    
 Median :185.5          Median :285.5    
 Mean   :164.2          Mean   :266.8    
 3rd Qu.:185.5          3rd Qu.:285.5    
 Max.   :188.1          Max.   :290.0

Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”

This dataset contains the data of type of eggs sold every month from 2004 to 2013. We can use pivot longer to reduce the number of columns and increase the number of rows. Thus the type of eggs can turn into columns and the egg value can become one column. Thus will result in 4 columns of dataset only and tidy it.

Anticipate the End Result

There are 120 rows and 6 columns in this dataset. Out of these 4 columns represent type of eggs which can become one column and its value can go into value column. Thus we should expect 120*4 = 480 rows in the pivoted dataframe.

Challenge: Describe the final dimensions

Code

df<-eggs_tidy
df

# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>
 1 January    2004             126         230                    132 
 2 February   2004             128.        226.                   134.
 3 March      2004             131         225                    137 
 4 April      2004             131         225                    137 
 5 May        2004             131         225                    137 
 6 June       2004             134.        231.                   137 
 7 July       2004             134.        234.                   137 
 8 August     2004             134.        234.                   137 
 9 September  2004             130.        234.                   136.
10 October    2004             128.        234.                   136.
# ℹ 110 more rows
# ℹ 1 more variable: extra_large_dozen <dbl>

Code

#existing rows/cases
nrow(df)

[1] 120

Code

#existing columns/cases
ncol(df)

[1] 6

Code

#expected rows/cases
nrow(df) * 4

[1] 480

Code

# expected columns 
ncol(df) -4 + 2

[1] 4

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Challenge: Pivot the Chosen Data

Code

eggs_tidy_pivoted <- pivot_longer(eggs_tidy, col = large_half_dozen:extra_large_dozen, names_to = "eggs_type", values_to = "number")

eggs_tidy_pivoted

# A tibble: 480 × 4
   month     year eggs_type              number
   <chr>    <dbl> <chr>                   <dbl>
 1 January   2004 large_half_dozen         126 
 2 January   2004 large_dozen              230 
 3 January   2004 extra_large_half_dozen   132 
 4 January   2004 extra_large_dozen        230 
 5 February  2004 large_half_dozen         128.
 6 February  2004 large_dozen              226.
 7 February  2004 extra_large_half_dozen   134.
 8 February  2004 extra_large_dozen        230 
 9 March     2004 large_half_dozen         131 
10 March     2004 large_dozen              225 
# ℹ 470 more rows

Thus we have pivoted the data to get 480 rows.