Challenge 4 Submission

challenge_4
abc_poll
eggs
fed_rates
hotel_bookings
debt
More data wrangling: pivoting
Author

Matt Zambetti

Published

May 30, 2023

Code
library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

The Data Set I chose to read in was the organic poultry data:

  • organiceggpoultry.xls⭐⭐
Code
org_poultry_data <- read_excel("_data/organiceggpoultry.xls")

head(org_poultry_data)
# A tibble: 6 × 11
  (Certified Organic den…¹ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
  <chr>                    <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <chr>
1 <NA>                      <NA>  <NA>  <NA>  <NA> NA    <NA>  <NA>  <NA>  <NA> 
2 USDA Certified Organic …  <NA>  <NA>  <NA>  <NA> NA    USDA… <NA>  <NA>  <NA> 
3 Price per Carton Delive…  <NA>  <NA>  <NA>  <NA> NA    Pric… <NA>  <NA>  <NA> 
4 <NA>                     "Ext… "Ext… "Lar… "Lar… NA    Whole B/S … Bone… Whol…
5 Jan 2004                 "230" "132" "230" "126" NA    197.5 645.5 too … 193.5
6 February                 "230" "134… "226… "128… NA    197.5 642.5 too … 193.5
# ℹ abbreviated name:
#   ¹​`(Certified Organic denotes products grown and processed according to USDA's national organic standards and certified by USDA-accredited State and private certification organizations.)`
# ℹ 1 more variable: ...11 <chr>

Briefly describe the data

This data has various parts of the chicken (ranging from whole, to breast) with their month, year and price.

Tidy Data (as needed)

We needed to do a few things to tidy the data. The first was removing the first three rows, these all contained data that was not necessary.

The second was adding the column headers. Interestingly enough, these were actually in the fourth row, so we moved it up. In addition, some of the names had an extra newline character, so we removed those with gsub.

Finally, the “Extra Large 1/2 Dozen” has some weird formatting so we manually edited this column name.

Code
org_poultry_data <- org_poultry_data[-c(1,2,3),]

# Changing the column names
colnames(org_poultry_data) <- 
  gsub('\n','',org_poultry_data[1,])

colnames(org_poultry_data)[1] = 'Month'

org_poultry_data <- org_poultry_data[-c(1),]

# This one is weird
colnames(org_poultry_data)[3] <- "Extra Large 1/2 Dozen"

head(org_poultry_data)
# A tibble: 6 × 11
  Month    `Extra Large Dozen` `Extra Large 1/2 Dozen` `Large Dozen`
  <chr>    <chr>               <chr>                   <chr>        
1 Jan 2004 230                 132                     230          
2 February 230                 134.5                   226.25       
3 March    230                 137                     225          
4 April    234.5               137                     225          
5 May      236                 137                     225          
6 June     241                 137                     231.375      
# ℹ 7 more variables: `Large 1/2 Doz.` <chr>, `NA` <lgl>, Whole <chr>,
#   `B/S Breast` <chr>, `Bone-in Breast` <chr>, `Whole Legs` <chr>,
#   Thighs <chr>

Identify variables that need to be mutated

The first set of variables that need to be mutated are the dates. The issue here is that only January of every year has a year attached to it. So the issue is expanding the year to a new column so every month is paired with a year.

Secondly, in the “Bone-in Breast” column, there were some string values that were labeled “too few”, so we had to replace those with the value 0.

Finally, all the columns aside from the month, needed to be converted into ints from chrs, which we were able to do as seen below.

Code
# Add the year
org_poultry_data <- add_column(org_poultry_data, year = 0, .before=1)

year <- 2003
for (idx in 1:120) {
  if (idx %% 12 == 1) {
    year <- year + 1
    org_poultry_data[idx, 2] <- "January"
  }
  org_poultry_data[idx, 1] <- year
}

# Replacing non-numbers to 0
org_poultry_data[9] <- replace(org_poultry_data$`Bone-in Breast`, 
                            org_poultry_data$`Bone-in Breast`=='too few','0')

# Change chr to int

org_poultry_data$'Extra Large Dozen' <- as.numeric(org_poultry_data$'Extra Large Dozen')
org_poultry_data$'Extra Large 1/2 Dozen' <- as.numeric(org_poultry_data$'Extra Large 1/2 Dozen')
org_poultry_data$'Large 1/2 Doz.' <- as.numeric(org_poultry_data$'Large 1/2 Doz.')
org_poultry_data$'Whole' <- as.numeric(org_poultry_data$'Whole')
org_poultry_data$'B/S Breast' <- as.numeric(org_poultry_data$'B/S Breast')
org_poultry_data$'Large Dozen' <- as.numeric(org_poultry_data$'Large Dozen')
org_poultry_data$'Bone-in Breast' <- as.numeric(org_poultry_data$'Bone-in Breast')
org_poultry_data$'Whole Legs' <- as.numeric(org_poultry_data$'Whole Legs')
org_poultry_data$'Thighs' <- as.numeric(org_poultry_data$'Thighs')

# Removing column 7 (NA)
org_poultry_data <- org_poultry_data[-c(7)]


# Printing the final data
org_poultry_data
# A tibble: 120 × 11
    year Month     `Extra Large Dozen` `Extra Large 1/2 Dozen` `Large Dozen`
   <dbl> <chr>                   <dbl>                   <dbl>         <dbl>
 1  2004 January                  230                     132           230 
 2  2004 February                 230                     134.          226.
 3  2004 March                    230                     137           225 
 4  2004 April                    234.                    137           225 
 5  2004 May                      236                     137           225 
 6  2004 June                     241                     137           231.
 7  2004 July                     241                     137           234.
 8  2004 August                   241                     137           234.
 9  2004 September                241                     136.          234.
10  2004 October                  241                     136.          234.
# ℹ 110 more rows
# ℹ 6 more variables: `Large 1/2 Doz.` <dbl>, Whole <dbl>, `B/S Breast` <dbl>,
#   `Bone-in Breast` <dbl>, `Whole Legs` <dbl>, Thighs <dbl>