Challenge 3 with R

challenge_3

eggs

kristin_abijaoude

Author

Kristin Abijaoude

Published

September 25, 2022

Code

library(tidyverse)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Today, I will be tidying and pivoting eggs_tidy.csv.

Code

eggs<-read_csv("_data/eggs_tidy.csv")
eggs

# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen extra_l…¹
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>     <dbl>
 1 January    2004             126         230                    132       230 
 2 February   2004             128.        226.                   134.      230 
 3 March      2004             131         225                    137       230 
 4 April      2004             131         225                    137       234.
 5 May        2004             131         225                    137       236 
 6 June       2004             134.        231.                   137       241 
 7 July       2004             134.        234.                   137       241 
 8 August     2004             134.        234.                   137       241 
 9 September  2004             130.        234.                   136.      241 
10 October    2004             128.        234.                   136.      241 
# … with 110 more rows, and abbreviated variable name ¹extra_large_dozen

Code

head(eggs)

# A tibble: 6 × 6
  month     year large_half_dozen large_dozen extra_large_half_dozen extra_lar…¹
  <chr>    <dbl>            <dbl>       <dbl>                  <dbl>       <dbl>
1 January   2004             126         230                    132         230 
2 February  2004             128.        226.                   134.        230 
3 March     2004             131         225                    137         230 
4 April     2004             131         225                    137         234.
5 May       2004             131         225                    137         236 
6 June      2004             134.        231.                   137         241 
# … with abbreviated variable name ¹extra_large_dozen

Here, we get the first 6 rows of the dataset eggs_tidy.csv. From what I see, this dataset records the average price of eggs sold per carton in from 2004 to 2013.

Code

is_tibble(eggs)

[1] TRUE

Okay, what a mess. This is difficult to read and interpret. Let’s make it tidy first!

Code

as_tibble(eggs)

# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen extra_l…¹
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>     <dbl>
 1 January    2004             126         230                    132       230 
 2 February   2004             128.        226.                   134.      230 
 3 March      2004             131         225                    137       230 
 4 April      2004             131         225                    137       234.
 5 May        2004             131         225                    137       236 
 6 June       2004             134.        231.                   137       241 
 7 July       2004             134.        234.                   137       241 
 8 August     2004             134.        234.                   137       241 
 9 September  2004             130.        234.                   136.      241 
10 October    2004             128.        234.                   136.      241 
# … with 110 more rows, and abbreviated variable name ¹extra_large_dozen

This is much better to read than the original format, but work still needs to be done.

Code

eggs %>%
  gather("large_half_dozen", "large_dozen", "extra_large_half_dozen", "extra_large_dozen", key = "Eggs Sold", value = "Price sold")

# A tibble: 480 × 4
   month      year `Eggs Sold`      `Price sold`
   <chr>     <dbl> <chr>                   <dbl>
 1 January    2004 large_half_dozen         126 
 2 February   2004 large_half_dozen         128.
 3 March      2004 large_half_dozen         131 
 4 April      2004 large_half_dozen         131 
 5 May        2004 large_half_dozen         131 
 6 June       2004 large_half_dozen         134.
 7 July       2004 large_half_dozen         134.
 8 August     2004 large_half_dozen         134.
 9 September  2004 large_half_dozen         130.
10 October    2004 large_half_dozen         128.
# … with 470 more rows

Code

eggs_tibble<-tibble(eggs)

#| label: Existing rows
nrow(eggs_tibble)

[1] 120

Code

#| Existing columns 
ncol(eggs_tibble)

[1] 6

Code

#| expected rows/cases
nrow(eggs_tibble) * (ncol(eggs_tibble)-2)

[1] 480

Code

#| expected columns
4

[1] 4

From there, I can expect the new amount of rows and columns for the eggs dataset.

Pivot the Data

Code

eggs_new <- eggs %>%
  mutate("Large Half Dozen per Cart Sold" = large_half_dozen / 100,
         "Large Dozen per Cart Sold" = large_dozen / 100,
         "Extra Large Half Dozen per Cart Sold" = extra_large_half_dozen / 100,
         "Extra Large Dozen per Cart Sold" = extra_large_dozen / 100)

#| label: Replace Old Columns with New Ones
eggs_new1 <- select(eggs_new,-c(large_half_dozen, large_dozen, extra_large_half_dozen, extra_large_dozen))
eggs_new1

# A tibble: 120 × 6
   month      year `Large Half Dozen per Cart Sold` Large Doze…¹ Extra…² Extra…³
   <chr>     <dbl>                            <dbl>        <dbl>   <dbl>   <dbl>
 1 January    2004                             1.26         2.3     1.32    2.3 
 2 February   2004                             1.28         2.26    1.34    2.3 
 3 March      2004                             1.31         2.25    1.37    2.3 
 4 April      2004                             1.31         2.25    1.37    2.35
 5 May        2004                             1.31         2.25    1.37    2.36
 6 June       2004                             1.34         2.31    1.37    2.41
 7 July       2004                             1.34         2.34    1.37    2.41
 8 August     2004                             1.34         2.34    1.37    2.41
 9 September  2004                             1.30         2.34    1.36    2.41
10 October    2004                             1.28         2.34    1.36    2.41
# … with 110 more rows, and abbreviated variable names
#   ¹`Large Dozen per Cart Sold`, ²`Extra Large Half Dozen per Cart Sold`,
#   ³`Extra Large Dozen per Cart Sold`

I calculated the price of one cart sold in dollars with the mutate() command and removed the old columns to make room with the new ones.

Code

eggs_pivot <- eggs_new1%>%
  pivot_longer(cols=contains("large"),
               names_to = "size",
               values_to = "price"
  )
eggs_pivot

# A tibble: 480 × 4
   month     year size                                 price
   <chr>    <dbl> <chr>                                <dbl>
 1 January   2004 Large Half Dozen per Cart Sold        1.26
 2 January   2004 Large Dozen per Cart Sold             2.3 
 3 January   2004 Extra Large Half Dozen per Cart Sold  1.32
 4 January   2004 Extra Large Dozen per Cart Sold       2.3 
 5 February  2004 Large Half Dozen per Cart Sold        1.28
 6 February  2004 Large Dozen per Cart Sold             2.26
 7 February  2004 Extra Large Half Dozen per Cart Sold  1.34
 8 February  2004 Extra Large Dozen per Cart Sold       2.3 
 9 March     2004 Large Half Dozen per Cart Sold        1.31
10 March     2004 Large Dozen per Cart Sold             2.25
# … with 470 more rows

Okay, that’s (sort of) better! As you can see, the larger the eggs are, the more expensive they will be. The same concept applies to quantity, obviously, but you get more bang for your buck.