Tidying and Pivoting: Challenge 3

challenge_3

Dirichi Umunna

eggs

Tidy Data: Pivoting

Author

Dirichi Umunna

Published

March 17, 2023

Code

library(tidyverse)
library(readxl)
library(magrittr)
library(tidyr)
library(dplyr)
library(stringr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

In this blog post, we will explore how to work with a given data set, starting from reading the data to tidying it using pivot_longer. The data set contains information on organic egg and poultry costs in the United States, with data points representing a range of values. We will describe the data set using words and provide additional supporting information as needed, such as tables, to provide insight into the contents of the data.

Code

#read in the dataset. we will be excluding unnecessary columns

cleanpoultry <- read_excel("_data/organiceggpoultry.xls", sheet = "Data", skip = 5, col_names = c("Date", "Xtral_12", "Xtral_6", "Large_12", "Large_6", "Delete", "Whole", "Bsbreast", "Boneinbreast", "Wholelegs", "Thighs")) %>%
select(-c(Delete))%>%
  drop_na()

#view data set
head(cleanpoultry)

# A tibble: 6 × 10
  Date     Xtral_12 Xtral_6 Large…¹ Large_6 Whole Bsbre…² Bonei…³ Whole…⁴ Thighs
  <chr>       <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl> <chr>     <dbl> <chr> 
1 Jan 2004     230     132     230     126   198.    646. too few    194. too f…
2 February     230     134.    226.    128.  198.    642. too few    194. 203   
3 March        230     137     225     131   209     642. too few    194. 203   
4 April        234.    137     225     131   212     642. too few    194. 203   
5 May          236     137     225     131   214.    642. too few    194. 203   
6 June         241     137     231.    134.  216.    641  too few    202. 200.3…
# … with abbreviated variable names ¹Large_12, ²Bsbreast, ³Boneinbreast,
#   ⁴Wholelegs

Briefly describe the data

The data set is focused on the monthly prices paid for USDA certified organic eggs and poultry, specifically chicken, between the years 2004 and 2013. It contains information on four different types of eggs and five different types of chicken, including their prices per carton and price cents per pound. The prices were set at the delivery to first receivers. The data set is sourced from the U.S. Department of Agriculture. Pivoting this data set to make it “tidy” is often required to make it easier to analyze and visualize the data.

Anticipate the End Result

We will attempt to visualize the end result of our pivoting activity by anticipating our expected number of columns and rows. First we will further clean dataset in anticipation.

Code

#separate the date into month and year
cleanpoultry <- separate (cleanpoultry, Date, into = c("Month", "Year", sep = " "))

#convert year to integer
cleanpoultry$Year <- as.integer(cleanpoultry$Year)

# fill in missing values in Year column
cleanpoultry <- fill(cleanpoultry, Year)

#remove unnecessary column
cleanpoultry <- cleanpoultry[, -which(names(cleanpoultry) == " ")]

## make all datatypes the same
cleanpoultry <- cleanpoultry %>%
  mutate(Boneinbreast = as.numeric(Boneinbreast),
         Thighs = as.numeric(Thighs))

cleanpoultry

# A tibble: 120 × 11
   Month      Year Xtral…¹ Xtral_6 Large…² Large_6 Whole Bsbre…³ Bonei…⁴ Whole…⁵
   <chr>     <int>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>
 1 Jan        2004    230     132     230     126   198.    646.     NA     194.
 2 February   2004    230     134.    226.    128.  198.    642.     NA     194.
 3 March      2004    230     137     225     131   209     642.     NA     194.
 4 April      2004    234.    137     225     131   212     642.     NA     194.
 5 May        2004    236     137     225     131   214.    642.     NA     194.
 6 June       2004    241     137     231.    134.  216.    641      NA     202.
 7 July       2004    241     137     234.    134.  217     642.    390.    204.
 8 August     2004    241     137     234.    134.  217     642.    390.    204.
 9 September  2004    241     136.    234.    130.  217     642.    390.    204.
10 October    2004    241     136.    234.    128.  217     642.    390.    204.
# … with 110 more rows, 1 more variable: Thighs <dbl>, and abbreviated variable
#   names ¹Xtral_12, ²Large_12, ³Bsbreast, ⁴Boneinbreast, ⁵Wholelegs

Code

# view data
colnames(cleanpoultry)

 [1] "Month"        "Year"         "Xtral_12"     "Xtral_6"      "Large_12"    
 [6] "Large_6"      "Whole"        "Bsbreast"     "Boneinbreast" "Wholelegs"   
[11] "Thighs"

Code

str(cleanpoultry)

tibble [120 × 11] (S3: tbl_df/tbl/data.frame)
 $ Month       : chr [1:120] "Jan" "February" "March" "April" ...
 $ Year        : int [1:120] 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
 $ Xtral_12    : num [1:120] 230 230 230 234 236 ...
 $ Xtral_6     : num [1:120] 132 134 137 137 137 ...
 $ Large_12    : num [1:120] 230 226 225 225 225 ...
 $ Large_6     : num [1:120] 126 128 131 131 131 ...
 $ Whole       : num [1:120] 198 198 209 212 214 ...
 $ Bsbreast    : num [1:120] 646 642 642 642 642 ...
 $ Boneinbreast: num [1:120] NA NA NA NA NA ...
 $ Wholelegs   : num [1:120] 194 194 194 194 194 ...
 $ Thighs      : num [1:120] NA 203 203 203 203 ...

Code

#existing rows/cases
nrow(cleanpoultry)

[1] 120

Code

#existing columns/cases
ncol(cleanpoultry)

[1] 11

Code

#expected rows/cases
nrow(cleanpoultry) * (ncol(cleanpoultry)-2)

[1] 1080

Code

# expected columns 
ncol(cleanpoultry) - 9 + 2

[1] 4

Code

cleanpoultry

# A tibble: 120 × 11
   Month      Year Xtral…¹ Xtral_6 Large…² Large_6 Whole Bsbre…³ Bonei…⁴ Whole…⁵
   <chr>     <int>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>
 1 Jan        2004    230     132     230     126   198.    646.     NA     194.
 2 February   2004    230     134.    226.    128.  198.    642.     NA     194.
 3 March      2004    230     137     225     131   209     642.     NA     194.
 4 April      2004    234.    137     225     131   212     642.     NA     194.
 5 May        2004    236     137     225     131   214.    642.     NA     194.
 6 June       2004    241     137     231.    134.  216.    641      NA     202.
 7 July       2004    241     137     234.    134.  217     642.    390.    204.
 8 August     2004    241     137     234.    134.  217     642.    390.    204.
 9 September  2004    241     136.    234.    130.  217     642.    390.    204.
10 October    2004    241     136.    234.    128.  217     642.    390.    204.
# … with 110 more rows, 1 more variable: Thighs <dbl>, and abbreviated variable
#   names ¹Xtral_12, ²Large_12, ³Bsbreast, ⁴Boneinbreast, ⁵Wholelegs

Pivot the Data

Code

#now we pivot the data

cleanpoultry <-pivot_longer(cleanpoultry, col = c("Xtral_12", "Xtral_6", "Large_12", "Large_6", "Whole", "Bsbreast", "Boneinbreast", "Wholelegs", "Thighs"),
                 names_to="Eggs & Poultry Type",
                 values_to = "Cost")
cleanpoultry

# A tibble: 1,080 × 4
   Month     Year `Eggs & Poultry Type`  Cost
   <chr>    <int> <chr>                 <dbl>
 1 Jan       2004 Xtral_12               230 
 2 Jan       2004 Xtral_6                132 
 3 Jan       2004 Large_12               230 
 4 Jan       2004 Large_6                126 
 5 Jan       2004 Whole                  198.
 6 Jan       2004 Bsbreast               646.
 7 Jan       2004 Boneinbreast            NA 
 8 Jan       2004 Wholelegs              194.
 9 Jan       2004 Thighs                  NA 
10 February  2004 Xtral_12               230 
# … with 1,070 more rows

Conclusion

In this current exercise, we successfully cleaned a data set and transformed it into a tidy format using pivot longer function. The resulting data set now has 1080 cases and 4 columns, like we anticipated. This makes it easy to analyze and manipulate the data. The process of cleaning and tidying the data set has made it more organized and readily usable for further analysis.