Challenge 3

challenge_3

eggs

Tidy Data: Pivoting

Author

Pooja Shah

Published

April 27, 2023

Code

library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
identify what needs to be done to tidy the current data
anticipate the shape of pivoted data
pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

animal_weights.csv ⭐
eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
australian_marriage*.xls ⭐⭐⭐
USA Households*.xlsx ⭐⭐⭐⭐
sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟

Code

#reading and printing unclean data
eggs = read_excel("_data/organiceggpoultry.xls", sheet = "Data", range = "B5:F125", col_names = c("date", "xlarge_dozen",
                               "xlarge_halfdozen", "large_dozen",
                               "large_halfdozen"))
eggs

# A tibble: 121 × 5
   date      xlarge_dozen          xlarge_halfdozen  large_dozen large_halfdozen
   <chr>     <chr>                 <chr>             <chr>       <chr>          
 1 <NA>      "Extra Large \nDozen" "Extra Large 1/2… "Large \nD… "Large \n1/2 D…
 2 Jan 2004  "230"                 "132"             "230"       "126"          
 3 February  "230"                 "134.5"           "226.25"    "128.5"        
 4 March     "230"                 "137"             "225"       "131"          
 5 April     "234.5"               "137"             "225"       "131"          
 6 May       "236"                 "137"             "225"       "131"          
 7 June      "241"                 "137"             "231.375"   "133.5"        
 8 July      "241"                 "137"             "233.5"     "133.5"        
 9 August    "241"                 "137"             "233.5"     "133.5"        
10 September "241"                 "135.875"         "233.5"     "129.75"       
# ℹ 111 more rows

Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy” I am working here with the “Data” sheet of the “organiceggpoultry” excel file. The data is in the form of multiple columns for types of eggs. The rows are prices of these eggs for each month.

Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

The tidy data should be of the form of columns for type of poultry, year and then the price for 12 months.

Find current and future data dimensions

Code

#existing rows
nrow(eggs)

[1] 121

Code

#existing columns
ncol(eggs)

[1] 5

Code

#expected rows
row = ((nrow(eggs) - 1)/12) * (ncol(eggs) - 1)
row

[1] 40

Code

# expected columns 
col = 12 + 2
col

[1] 14

Challenge: Describe the final dimensions

Code

print(paste("The final dimensions should be ", row, "x", col))

[1] "The final dimensions should be  40 x 14"

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Example

Code

eggs <- eggs %>% drop_na

eggs%>%
  count(date)

# A tibble: 22 × 2
   date            n
   <chr>       <int>
 1 April          10
 2 August         10
 3 December       10
 4 February        8
 5 February /1     2
 6 Jan 2004        1
 7 Jan 2005        1
 8 Jan 2006        1
 9 Jan 2007        1
10 Jan 2008        1
# ℹ 12 more rows

Code

eggs<-eggs%>%
  mutate(date = str_remove(date, " /1"))

eggs%>%
  count(date)

# A tibble: 21 × 2
   date         n
   <chr>    <int>
 1 April       10
 2 August      10
 3 December    10
 4 February    10
 5 Jan 2004     1
 6 Jan 2005     1
 7 Jan 2006     1
 8 Jan 2007     1
 9 Jan 2008     1
10 Jan 2009     1
# ℹ 11 more rows

Code

eggs<-eggs%>%
  separate(date, into=c("month", "year"), sep=" ")%>%
  fill(year)

eggs

# A tibble: 120 × 6
   month     year  xlarge_dozen xlarge_halfdozen large_dozen large_halfdozen
   <chr>     <chr> <chr>        <chr>            <chr>       <chr>          
 1 Jan       2004  230          132              230         126            
 2 February  2004  230          134.5            226.25      128.5          
 3 March     2004  230          137              225         131            
 4 April     2004  234.5        137              225         131            
 5 May       2004  236          137              225         131            
 6 June      2004  241          137              231.375     133.5          
 7 July      2004  241          137              233.5       133.5          
 8 August    2004  241          137              233.5       133.5          
 9 September 2004  241          135.875          233.5       129.75         
10 October   2004  241          135.5            233.5       128.5          
# ℹ 110 more rows

Challenge: Pivot the Chosen Data

Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data? The year is a seperate entity now. The data is divided into cols for the month. The row data is year and type of egg.

Code

eggs <- eggs %>%
  pivot_longer(cols=contains("large"), 
               names_to = "eggType",
               values_to = "avgPrice")

eggs <- eggs %>%
  pivot_wider(names_from = month,
               values_from = avgPrice)
eggs

# A tibble: 40 × 14
   year  eggType   Jan   February March April May   June  July  August September
   <chr> <chr>     <chr> <chr>    <chr> <chr> <chr> <chr> <chr> <chr>  <chr>    
 1 2004  xlarge_d… 230   230      230   234.5 236   241   241   241    241      
 2 2004  xlarge_h… 132   134.5    137   137   137   137   137   137    135.875  
 3 2004  large_do… 230   226.25   225   225   225   231.… 233.5 233.5  233.5    
 4 2004  large_ha… 126   128.5    131   131   131   133.5 133.5 133.5  129.75   
 5 2005  xlarge_d… 241   241      241   241   241   241   241   241    241      
 6 2005  xlarge_h… 135.5 135.5    135.5 135.5 135.5 135.5 135.5 135.5  135.5    
 7 2005  large_do… 233.5 233.5    233.5 233.5 233.5 233.5 233.5 233.5  233.5    
 8 2005  large_ha… 128.5 128.5    128.5 128.5 128.5 128.5 128.5 128.5  128.5    
 9 2006  xlarge_d… 241   241      241.… 241.5 241.5 241.5 241.5 241.5  241.5    
10 2006  xlarge_h… 135.5 135.5    135.5 135.5 135.5 135.5 135.5 135.5  135.5    
# ℹ 30 more rows
# ℹ 3 more variables: October <chr>, November <chr>, December <chr>