challenge_3
Aritra Basu
Eggs
Tidy Data: Pivoting
Author

Aritra Basu

Published

March 20, 2023

Code
library(tidyverse)
library(dplyr)
library(descr)
library(readr)
library(readxl)
library(summarytools)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Reading in data, and cleaning it.

I am working on the first sheet of organiceggpoultry. The first task is to clean the data. I have removed a column of NAs, and then used mutate, separate and fill to separate the month and years. I have then used select to consider the variables that are associated with eggs and not with chickens.

Code
egg_poultry <- read_excel("_data/organiceggpoultry.xls", skip=4)
subset <- subset( egg_poultry, select = -c(6) )


df1 <- subset %>% separate(...1, sep = " ", into=c('month', 'year'), remove = FALSE)
df2 <- mutate_all(df1, funs(replace(., .=='/1', NA)))
df3 <- mutate_all(df2, funs(replace(., .=='Jan', 'January')))

df4 <- df3 %>% fill(year)
str(df4)
tibble [120 × 12] (S3: tbl_df/tbl/data.frame)
 $ ...1                           : chr [1:120] "Jan 2004" "February" "March" "April" ...
 $ month                          : chr [1:120] "January" "February" "March" "April" ...
 $ year                           : chr [1:120] "2004" "2004" "2004" "2004" ...
 $ Extra Large 
Dozen            : chr [1:120] "230" "230" "230" "234.5" ...
 $ Extra Large 1/2 Doz.
1/2 Dozen: chr [1:120] "132" "134.5" "137" "137" ...
 $ Large 
Dozen                  : chr [1:120] "230" "226.25" "225" "225" ...
 $ Large 
1/2 Doz.               : chr [1:120] "126" "128.5" "131" "131" ...
 $ Whole                          : chr [1:120] "197.5" "197.5" "209" "212" ...
 $ B/S Breast                     : chr [1:120] "645.5" "642.5" "642.5" "642.5" ...
 $ Bone-in Breast                 : chr [1:120] "too few" "too few" "too few" "too few" ...
 $ Whole Legs                     : chr [1:120] "193.5" "193.5" "193.5" "193.5" ...
 $ Thighs                         : chr [1:120] "too few" "203" "203" "203" ...

Now, I change some of the variables from character to numeric.

Code
df5<-select(df4, "month", "year", "Extra Large 
Dozen", "Extra Large 1/2 Doz.
1/2 Dozen", "Large 
Dozen", "Large 
1/2 Doz.")
colnames(df5) <- c('month','year','Extra Large Dozen', 'Extra Large Half Dozen', 'Large Dozen', 'Large Half Dozen')
df6<-df5 %>%
  mutate_at('year', as.numeric)%>%
  mutate_at('Extra Large Dozen', as.numeric)%>%
  mutate_at('Extra Large Half Dozen', as.numeric)%>%
  mutate_at('Large Dozen', as.numeric)%>%
  mutate_at('Large Half Dozen', as.numeric)
  str(df6)
tibble [120 × 6] (S3: tbl_df/tbl/data.frame)
 $ month                 : chr [1:120] "January" "February" "March" "April" ...
 $ year                  : num [1:120] 2004 2004 2004 2004 2004 ...
 $ Extra Large Dozen     : num [1:120] 230 230 230 234 236 ...
 $ Extra Large Half Dozen: num [1:120] 132 134 137 137 137 ...
 $ Large Dozen           : num [1:120] 230 226 225 225 225 ...
 $ Large Half Dozen      : num [1:120] 126 128 131 131 131 ...
Code
  clean_eggs<-df6

Describing the data

Code
print(summarytools::dfSummary(clean_eggs,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

clean_eggs

Dimensions: 120 x 6
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
month [character]
1. April
2. August
3. December
4. February
5. January
6. July
7. June
8. March
9. May
10. November
[ 2 others ]
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
10 ( 8.3% )
20 ( 16.7% )
0 (0.0%)
year [numeric]
Mean (sd) : 2008.5 (2.9)
min ≤ med ≤ max:
2004 ≤ 2008.5 ≤ 2013
IQR (CV) : 5 (0)
2004 : 12 ( 10.0% )
2005 : 12 ( 10.0% )
2006 : 12 ( 10.0% )
2007 : 12 ( 10.0% )
2008 : 12 ( 10.0% )
2009 : 12 ( 10.0% )
2010 : 12 ( 10.0% )
2011 : 12 ( 10.0% )
2012 : 12 ( 10.0% )
2013 : 12 ( 10.0% )
0 (0.0%)
Extra Large Dozen [numeric]
Mean (sd) : 266.8 (22.8)
min ≤ med ≤ max:
230 ≤ 285.5 ≤ 290
IQR (CV) : 44 (0.1)
11 distinct values 0 (0.0%)
Extra Large Half Dozen [numeric]
Mean (sd) : 164.2 (24.7)
min ≤ med ≤ max:
132 ≤ 185.5 ≤ 188.1
IQR (CV) : 49.7 (0.2)
132.00   : 1 ( 0.8% )
134.50   : 1 ( 0.8% )
135.50   : 28 ( 23.3% )
135.88 ! : 1 ( 0.8% )
137.00   : 6 ( 5.0% )
138.12 ! : 1 ( 0.8% )
139.00   : 15 ( 12.5% )
185.50   : 53 ( 44.2% )
188.13   : 14 ( 11.7% )
! rounded
0 (0.0%)
Large Dozen [numeric]
Mean (sd) : 254.2 (18.5)
min ≤ med ≤ max:
225 ≤ 267.5 ≤ 277.5
IQR (CV) : 34.5 (0.1)
12 distinct values 0 (0.0%)
Large Half Dozen [numeric]
Mean (sd) : 155.2 (22.6)
min ≤ med ≤ max:
126 ≤ 174.5 ≤ 178
IQR (CV) : 45.1 (0.1)
126.00   : 1 ( 0.8% )
128.50   : 29 ( 24.2% )
129.75   : 1 ( 0.8% )
131.00   : 3 ( 2.5% )
131.12 ! : 1 ( 0.8% )
132.00   : 15 ( 12.5% )
133.50   : 3 ( 2.5% )
173.25   : 6 ( 5.0% )
174.50   : 47 ( 39.2% )
178.00   : 14 ( 11.7% )
! rounded
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-04-02

It can be seen that the dimension is 120 x 6.

The observations are in a month year format. The values are prices per cartons for different types of packaging for a particular month between 2004 and 2013 (inclusive). In the original file, the years, as well as the values were stored as characters, and I have converted them to numeric.

The way the data is presented is fairly intuitive. But this could become cumbersome if we had a few more columns. However, the data is already tidy

Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”

Anticipating the end results

Challenge: Describing the final dimensions

Documenting my work here.

Code
#existing rows/cases
nrow(clean_eggs)
[1] 120
Code
#existing columns/cases
ncol(clean_eggs)
[1] 6
Code
#expected rows/cases
nrow(clean_eggs) * (ncol(clean_eggs)-4)
[1] 240
Code
# expected columns 
4
[1] 4

Pivoting the Data

Challenge: Pivoting the Chosen Data

The new case again is a month year unit. However, the data is now in a long format instead of wide. It satisfies the requirement of tidy data.

Code
clean_eggs<-pivot_longer(clean_eggs, col = c("Extra Large Dozen", "Extra Large Half Dozen", "Large Dozen", "Large Half Dozen"),
                 names_to="Type of packaging",
                 values_to = "Price of Cartons")
clean_eggs
# A tibble: 480 × 4
   month     year `Type of packaging`    `Price of Cartons`
   <chr>    <dbl> <chr>                               <dbl>
 1 January   2004 Extra Large Dozen                    230 
 2 January   2004 Extra Large Half Dozen               132 
 3 January   2004 Large Dozen                          230 
 4 January   2004 Large Half Dozen                     126 
 5 February  2004 Extra Large Dozen                    230 
 6 February  2004 Extra Large Half Dozen               134.
 7 February  2004 Large Dozen                          226.
 8 February  2004 Large Half Dozen                     128.
 9 March     2004 Extra Large Dozen                    230 
10 March     2004 Extra Large Half Dozen               137 
# … with 470 more rows