Code
library(tidyverse)
library(readxl)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Matt Zambetti
May 30, 2023
Today’s challenge is to:
The Data Set I chose to read in was the organic poultry data:
# A tibble: 6 × 11
(Certified Organic den…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
<chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <chr>
1 <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA> <NA>
2 USDA Certified Organic … <NA> <NA> <NA> <NA> NA USDA… <NA> <NA> <NA>
3 Price per Carton Delive… <NA> <NA> <NA> <NA> NA Pric… <NA> <NA> <NA>
4 <NA> "Ext… "Ext… "Lar… "Lar… NA Whole B/S … Bone… Whol…
5 Jan 2004 "230" "132" "230" "126" NA 197.5 645.5 too … 193.5
6 February "230" "134… "226… "128… NA 197.5 642.5 too … 193.5
# ℹ abbreviated name:
# ¹`(Certified Organic denotes products grown and processed according to USDA's national organic standards and certified by USDA-accredited State and private certification organizations.)`
# ℹ 1 more variable: ...11 <chr>
This data has various parts of the chicken (ranging from whole, to breast) with their month, year and price.
We needed to do a few things to tidy the data. The first was removing the first three rows, these all contained data that was not necessary.
The second was adding the column headers. Interestingly enough, these were actually in the fourth row, so we moved it up. In addition, some of the names had an extra newline character, so we removed those with gsub.
Finally, the “Extra Large 1/2 Dozen” has some weird formatting so we manually edited this column name.
org_poultry_data <- org_poultry_data[-c(1,2,3),]
# Changing the column names
colnames(org_poultry_data) <-
gsub('\n','',org_poultry_data[1,])
colnames(org_poultry_data)[1] = 'Month'
org_poultry_data <- org_poultry_data[-c(1),]
# This one is weird
colnames(org_poultry_data)[3] <- "Extra Large 1/2 Dozen"
head(org_poultry_data)
# A tibble: 6 × 11
Month `Extra Large Dozen` `Extra Large 1/2 Dozen` `Large Dozen`
<chr> <chr> <chr> <chr>
1 Jan 2004 230 132 230
2 February 230 134.5 226.25
3 March 230 137 225
4 April 234.5 137 225
5 May 236 137 225
6 June 241 137 231.375
# ℹ 7 more variables: `Large 1/2 Doz.` <chr>, `NA` <lgl>, Whole <chr>,
# `B/S Breast` <chr>, `Bone-in Breast` <chr>, `Whole Legs` <chr>,
# Thighs <chr>
The first set of variables that need to be mutated are the dates. The issue here is that only January of every year has a year attached to it. So the issue is expanding the year to a new column so every month is paired with a year.
Secondly, in the “Bone-in Breast” column, there were some string values that were labeled “too few”, so we had to replace those with the value 0.
Finally, all the columns aside from the month, needed to be converted into ints from chrs, which we were able to do as seen below.
# Add the year
org_poultry_data <- add_column(org_poultry_data, year = 0, .before=1)
year <- 2003
for (idx in 1:120) {
if (idx %% 12 == 1) {
year <- year + 1
org_poultry_data[idx, 2] <- "January"
}
org_poultry_data[idx, 1] <- year
}
# Replacing non-numbers to 0
org_poultry_data[9] <- replace(org_poultry_data$`Bone-in Breast`,
org_poultry_data$`Bone-in Breast`=='too few','0')
# Change chr to int
org_poultry_data$'Extra Large Dozen' <- as.numeric(org_poultry_data$'Extra Large Dozen')
org_poultry_data$'Extra Large 1/2 Dozen' <- as.numeric(org_poultry_data$'Extra Large 1/2 Dozen')
org_poultry_data$'Large 1/2 Doz.' <- as.numeric(org_poultry_data$'Large 1/2 Doz.')
org_poultry_data$'Whole' <- as.numeric(org_poultry_data$'Whole')
org_poultry_data$'B/S Breast' <- as.numeric(org_poultry_data$'B/S Breast')
org_poultry_data$'Large Dozen' <- as.numeric(org_poultry_data$'Large Dozen')
org_poultry_data$'Bone-in Breast' <- as.numeric(org_poultry_data$'Bone-in Breast')
org_poultry_data$'Whole Legs' <- as.numeric(org_poultry_data$'Whole Legs')
org_poultry_data$'Thighs' <- as.numeric(org_poultry_data$'Thighs')
# Removing column 7 (NA)
org_poultry_data <- org_poultry_data[-c(7)]
# Printing the final data
org_poultry_data
# A tibble: 120 × 11
year Month `Extra Large Dozen` `Extra Large 1/2 Dozen` `Large Dozen`
<dbl> <chr> <dbl> <dbl> <dbl>
1 2004 January 230 132 230
2 2004 February 230 134. 226.
3 2004 March 230 137 225
4 2004 April 234. 137 225
5 2004 May 236 137 225
6 2004 June 241 137 231.
7 2004 July 241 137 234.
8 2004 August 241 137 234.
9 2004 September 241 136. 234.
10 2004 October 241 136. 234.
# ℹ 110 more rows
# ℹ 6 more variables: `Large 1/2 Doz.` <dbl>, Whole <dbl>, `B/S Breast` <dbl>,
# `Bone-in Breast` <dbl>, `Whole Legs` <dbl>, Thighs <dbl>
---
title: "Challenge 4 Submission"
author: "Matt Zambetti"
description: "More data wrangling: pivoting"
date: "5/30/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- abc_poll
- eggs
- fed_rates
- hotel_bookings
- debt
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
The Data Set I chose to read in was the organic poultry data:
- organiceggpoultry.xls⭐⭐
```{r}
org_poultry_data <- read_excel("_data/organiceggpoultry.xls")
head(org_poultry_data)
```
### Briefly describe the data
This data has various parts of the chicken (ranging from whole, to breast) with their month, year and price.
## Tidy Data (as needed)
We needed to do a few things to tidy the data. The first was removing the first three rows, these all contained data that was not necessary.
The second was adding the column headers. Interestingly enough, these were actually in the fourth row, so we moved it up. In addition, some of the names had an extra newline character, so we removed those with gsub.
Finally, the "Extra Large 1/2 Dozen" has some weird formatting so we manually edited this column name.
```{r}
org_poultry_data <- org_poultry_data[-c(1,2,3),]
# Changing the column names
colnames(org_poultry_data) <-
gsub('\n','',org_poultry_data[1,])
colnames(org_poultry_data)[1] = 'Month'
org_poultry_data <- org_poultry_data[-c(1),]
# This one is weird
colnames(org_poultry_data)[3] <- "Extra Large 1/2 Dozen"
head(org_poultry_data)
```
## Identify variables that need to be mutated
The first set of variables that need to be mutated are the dates. The issue here is that only January of every year has a year attached to it. So the issue is expanding the year to a new column so every month is paired with a year.
Secondly, in the "Bone-in Breast" column, there were some string values that were labeled "too few", so we had to replace those with the value 0.
Finally, all the columns aside from the month, needed to be converted into ints from chrs, which we were able to do as seen below.
```{r}
# Add the year
org_poultry_data <- add_column(org_poultry_data, year = 0, .before=1)
year <- 2003
for (idx in 1:120) {
if (idx %% 12 == 1) {
year <- year + 1
org_poultry_data[idx, 2] <- "January"
}
org_poultry_data[idx, 1] <- year
}
# Replacing non-numbers to 0
org_poultry_data[9] <- replace(org_poultry_data$`Bone-in Breast`,
org_poultry_data$`Bone-in Breast`=='too few','0')
# Change chr to int
org_poultry_data$'Extra Large Dozen' <- as.numeric(org_poultry_data$'Extra Large Dozen')
org_poultry_data$'Extra Large 1/2 Dozen' <- as.numeric(org_poultry_data$'Extra Large 1/2 Dozen')
org_poultry_data$'Large 1/2 Doz.' <- as.numeric(org_poultry_data$'Large 1/2 Doz.')
org_poultry_data$'Whole' <- as.numeric(org_poultry_data$'Whole')
org_poultry_data$'B/S Breast' <- as.numeric(org_poultry_data$'B/S Breast')
org_poultry_data$'Large Dozen' <- as.numeric(org_poultry_data$'Large Dozen')
org_poultry_data$'Bone-in Breast' <- as.numeric(org_poultry_data$'Bone-in Breast')
org_poultry_data$'Whole Legs' <- as.numeric(org_poultry_data$'Whole Legs')
org_poultry_data$'Thighs' <- as.numeric(org_poultry_data$'Thighs')
# Removing column 7 (NA)
org_poultry_data <- org_poultry_data[-c(7)]
# Printing the final data
org_poultry_data
```