DACSS 601 Fall 2021: HW3 Data Wrangling

Eliza Geeslin

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(readxl)
library(stringr)
library(rmarkdown)
library(knitr)
library(formattable)
library(vctrs)

Step 1: Read in the Data

The first step is getting everything up and running. * This dataset has a few tabs, so even though I am using the first tab (the default) I want to ensure we are using the correctly named tab every time. * There are actually two data sets (eggs and pounltry). I am going to be honest - I did not totally realize this until halfway through the wrangingling, so I decided to try to tidy the dataset into one that could be filtered down to either chicken or eggs. Not totally sure it worked :).

organic_egg_poultry <- read_excel("../../_data/organiceggpoultry.xls", sheet = "Data") # assign dataset to variable

#head(organic_egg_poultry)

colnames(organic_egg_poultry) # colnames() returns column names

 [1] "(Certified Organic denotes products grown and processed according to USDA's national organic standards and certified by USDA-accredited State and private certification organizations.)"
 [2] "...2"                                                                                                                                                                                   
 [3] "...3"                                                                                                                                                                                   
 [4] "...4"                                                                                                                                                                                   
 [5] "...5"                                                                                                                                                                                   
 [6] "...6"                                                                                                                                                                                   
 [7] "...7"                                                                                                                                                                                   
 [8] "...8"                                                                                                                                                                                   
 [9] "...9"                                                                                                                                                                                   
[10] "...10"                                                                                                                                                                                  
[11] "...11"

Step 2: Tidy the Data

I am going to do a few things to tidy the data: * Get rid of extra columns and rows * Adjust column names (for future pivoting) * Adjust data types

First, the column names are off, so I am going to skip 4 rows to get the actual column names.

organic_egg_poultry <- read_excel("../../_data/organiceggpoultry.xls", sheet = "Data", skip = 4)

paged_table(organic_egg_poultry)

Some of the column names are a little warped, so I am going to rename them: * I will indicate the ones to be deleted (“del”) * I am trying to name them in a way that will help later on - I think I might have to break out some of these columns because it seems like they hold multiple variables (Produce, Amount, ProduceType). This is where I realized there are really two distinct data sets, but c’est la vie!

organic_egg_poultry <- read_excel("../../_data/organiceggpoultry.xls", sheet = "Data", skip = 5, col_names = c("Month", "ExtraLargeEggs_Dozen_Eggs", "ExtraLargeEggs_HalfDozen_Eggs", "LargeEggs_Dozen_Eggs", "LargeEggs_HalfDozen_Eggs", "del", "WholeChicken_Pound_Chicken", "BonelessChickenBreast_Pound_Chicken", "BoneChickenBreast_Pound_Chicken", "WholeChickenLegs_Pound_Chicken", "ChickenThighs_Pound_Chicken"))

paged_table(organic_egg_poultry)

Next, I noticed something weird with the month “February” in the dataset that will mess up the column splitting/filling (the next two steps), so I went ahead fixed that.

organic_egg_poultry2 <- organic_egg_poultry
organic_egg_poultry2[organic_egg_poultry2 == "February /1"] <- "February"

paged_table(organic_egg_poultry2)

Then, I split Month and Year.

organic_egg_poultry3 <- organic_egg_poultry2 %>%
separate(col = 1, into=c('Month', 'Year'), sep = ' ')

paged_table(organic_egg_poultry3)

As we can see, the year doesn’t fill up all the rows, so I used fill() to complete the rows.

organic_egg_poultry3 <- organic_egg_poultry3 %>%
  fill(Year)

paged_table(organic_egg_poultry3)

Then I got rid of the extra column in between the Eggs and Chicken datasets (because I am still working with them together).

organic_egg_poultry3 <- organic_egg_poultry3 %>%
  select(!starts_with("del"))

paged_table(organic_egg_poultry3)

Step 3: Wrangle the Data

Before I pivot this data, I need to do something about the “too few” string in the chicken part of the dateset (I know this will cause an issue because I didn’t do anything initially and I got an error: “Can't combineExtraLargeEggs_Dozen_Eggs<double> andBoneChickenBreast_Pound_Chicken<character>”).

I think the best bet here is to mutate “too few” into N/A? (although I am not totally sure if that will mess up the integrity of the data?).

organic_egg_poultry4 <- organic_egg_poultry3
organic_egg_poultry4[organic_egg_poultry4 == "too few"] <- NA

paged_table(organic_egg_poultry4)

I also have to make sure everything is the right data type.

organic_egg_poultry4 <- organic_egg_poultry4 %>%
  mutate(BoneChickenBreast_Pound_Chicken = as.double(BonelessChickenBreast_Pound_Chicken)) %>%
  mutate(ChickenThighs_Pound_Chicken = as.double(ChickenThighs_Pound_Chicken))

paged_table(organic_egg_poultry4)

At last I can pivot longer the data. This is where the column naming comes in handy!

#pivot_long
organic_egg_poultry5 <- organic_egg_poultry4 %>%
  pivot_longer(
    cols = !Month & !Year, 
    names_to = c("Produce", "Amount", "ProduceType"), 
    names_sep = "_",
    values_to = "Price"
  )

paged_table(organic_egg_poultry5)

At this point the data is probably ready to arrange, filter, and summarize. I am going to arrange it by Produce Type (Eggs or Chicken).

organic_egg_poultry5 <- organic_egg_poultry5
arrange(organic_egg_poultry5, ProduceType, desc(Price))

# A tibble: 1,080 x 6
   Month    Year  Produce               Amount ProduceType Price
   <chr>    <chr> <chr>                 <chr>  <chr>       <dbl>
 1 November 2012  BonelessChickenBreast Pound  Chicken      704.
 2 November 2012  BoneChickenBreast     Pound  Chicken      704.
 3 December 2012  BonelessChickenBreast Pound  Chicken      704.
 4 December 2012  BoneChickenBreast     Pound  Chicken      704.
 5 Jan      2013  BonelessChickenBreast Pound  Chicken      704.
 6 Jan      2013  BoneChickenBreast     Pound  Chicken      704.
 7 February 2013  BonelessChickenBreast Pound  Chicken      704.
 8 February 2013  BoneChickenBreast     Pound  Chicken      704.
 9 March    2013  BonelessChickenBreast Pound  Chicken      704.
10 March    2013  BoneChickenBreast     Pound  Chicken      704.
# ... with 1,070 more rows

Something that is a little annoying about this is that I wanted to arrange it so that the “Eggs” section was at the top, but it seems to only do alphabetical.

Next, I am going to filter the data so we are only working with the Eggs data and then summarize.

organic_egg_poultry6 <- organic_egg_poultry5
filter(organic_egg_poultry6, ProduceType == "Eggs") %>%
  group_by(Produce, Amount) %>%
  summarise(mean = mean(Price), max = max(Price), min = min(Price))

# A tibble: 4 x 5
# Groups:   Produce [2]
  Produce        Amount     mean   max   min
  <chr>          <chr>     <dbl> <dbl> <dbl>
1 ExtraLargeEggs Dozen      267.  290    230
2 ExtraLargeEggs HalfDozen  164.  188.   132
3 LargeEggs      Dozen      254.  278.   225
4 LargeEggs      HalfDozen  155.  178    126

Distill is a publication format for scientific and technical writing, native to the web.

Learn more about using Distill at https://rstudio.github.io/distill.

Comment on this article Share:

HW3 Data Wrangling

Step 1: Read in the Data

Step 2: Tidy the Data

Step 3: Wrangle the Data

Reuse

Citation