Code
library(tidyverse)
library(dplyr)
library(descr)
library(readr)
library(readxl)
library(summarytools)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Aritra Basu
March 20, 2023
I am working on the first sheet of organiceggpoultry. The first task is to clean the data. I have removed a column of NAs, and then used mutate, separate and fill to separate the month and years. I have then used select to consider the variables that are associated with eggs and not with chickens.
egg_poultry <- read_excel("_data/organiceggpoultry.xls", skip=4)
subset <- subset( egg_poultry, select = -c(6) )
df1 <- subset %>% separate(...1, sep = " ", into=c('month', 'year'), remove = FALSE)
df2 <- mutate_all(df1, funs(replace(., .=='/1', NA)))
df3 <- mutate_all(df2, funs(replace(., .=='Jan', 'January')))
df4 <- df3 %>% fill(year)
str(df4)
tibble [120 × 12] (S3: tbl_df/tbl/data.frame)
$ ...1 : chr [1:120] "Jan 2004" "February" "March" "April" ...
$ month : chr [1:120] "January" "February" "March" "April" ...
$ year : chr [1:120] "2004" "2004" "2004" "2004" ...
$ Extra Large
Dozen : chr [1:120] "230" "230" "230" "234.5" ...
$ Extra Large 1/2 Doz.
1/2 Dozen: chr [1:120] "132" "134.5" "137" "137" ...
$ Large
Dozen : chr [1:120] "230" "226.25" "225" "225" ...
$ Large
1/2 Doz. : chr [1:120] "126" "128.5" "131" "131" ...
$ Whole : chr [1:120] "197.5" "197.5" "209" "212" ...
$ B/S Breast : chr [1:120] "645.5" "642.5" "642.5" "642.5" ...
$ Bone-in Breast : chr [1:120] "too few" "too few" "too few" "too few" ...
$ Whole Legs : chr [1:120] "193.5" "193.5" "193.5" "193.5" ...
$ Thighs : chr [1:120] "too few" "203" "203" "203" ...
Now, I change some of the variables from character to numeric.
df5<-select(df4, "month", "year", "Extra Large
Dozen", "Extra Large 1/2 Doz.
1/2 Dozen", "Large
Dozen", "Large
1/2 Doz.")
colnames(df5) <- c('month','year','Extra Large Dozen', 'Extra Large Half Dozen', 'Large Dozen', 'Large Half Dozen')
df6<-df5 %>%
mutate_at('year', as.numeric)%>%
mutate_at('Extra Large Dozen', as.numeric)%>%
mutate_at('Extra Large Half Dozen', as.numeric)%>%
mutate_at('Large Dozen', as.numeric)%>%
mutate_at('Large Half Dozen', as.numeric)
str(df6)
tibble [120 × 6] (S3: tbl_df/tbl/data.frame)
$ month : chr [1:120] "January" "February" "March" "April" ...
$ year : num [1:120] 2004 2004 2004 2004 2004 ...
$ Extra Large Dozen : num [1:120] 230 230 230 234 236 ...
$ Extra Large Half Dozen: num [1:120] 132 134 137 137 137 ...
$ Large Dozen : num [1:120] 230 226 225 225 225 ...
$ Large Half Dozen : num [1:120] 126 128 131 131 131 ...
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
month [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
year [numeric] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Extra Large Dozen [numeric] |
|
11 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Extra Large Half Dozen [numeric] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Large Dozen [numeric] |
|
12 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Large Half Dozen [numeric] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.2)
2023-04-02
It can be seen that the dimension is 120 x 6.
The observations are in a month year format. The values are prices per cartons for different types of packaging for a particular month between 2004 and 2013 (inclusive). In the original file, the years, as well as the values were stored as characters, and I have converted them to numeric.
The way the data is presented is fairly intuitive. But this could become cumbersome if we had a few more columns. However, the data is already tidy
Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”
Documenting my work here.
The new case again is a month year unit. However, the data is now in a long format instead of wide. It satisfies the requirement of tidy data.
# A tibble: 480 × 4
month year `Type of packaging` `Price of Cartons`
<chr> <dbl> <chr> <dbl>
1 January 2004 Extra Large Dozen 230
2 January 2004 Extra Large Half Dozen 132
3 January 2004 Large Dozen 230
4 January 2004 Large Half Dozen 126
5 February 2004 Extra Large Dozen 230
6 February 2004 Extra Large Half Dozen 134.
7 February 2004 Large Dozen 226.
8 February 2004 Large Half Dozen 128.
9 March 2004 Extra Large Dozen 230
10 March 2004 Extra Large Half Dozen 137
# … with 470 more rows
---
title: "Challenge 3"
author: "Aritra Basu"
description: "Tidy Data: Pivoting"
date: "03/20/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- Aritra Basu
- Eggs
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(dplyr)
library(descr)
library(readr)
library(readxl)
library(summarytools)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Reading in data, and cleaning it.
I am working on the first sheet of organiceggpoultry. The first task is to clean the data. I have removed a column of NAs, and then used mutate, separate and fill to separate the month and years. I have then used select to consider the variables that are associated with eggs and not with chickens.
```{r}
egg_poultry <- read_excel("_data/organiceggpoultry.xls", skip=4)
subset <- subset( egg_poultry, select = -c(6) )
df1 <- subset %>% separate(...1, sep = " ", into=c('month', 'year'), remove = FALSE)
df2 <- mutate_all(df1, funs(replace(., .=='/1', NA)))
df3 <- mutate_all(df2, funs(replace(., .=='Jan', 'January')))
df4 <- df3 %>% fill(year)
str(df4)
```
Now, I change some of the variables from character to numeric.
```{r}
df5<-select(df4, "month", "year", "Extra Large
Dozen", "Extra Large 1/2 Doz.
1/2 Dozen", "Large
Dozen", "Large
1/2 Doz.")
colnames(df5) <- c('month','year','Extra Large Dozen', 'Extra Large Half Dozen', 'Large Dozen', 'Large Half Dozen')
df6<-df5 %>%
mutate_at('year', as.numeric)%>%
mutate_at('Extra Large Dozen', as.numeric)%>%
mutate_at('Extra Large Half Dozen', as.numeric)%>%
mutate_at('Large Dozen', as.numeric)%>%
mutate_at('Large Half Dozen', as.numeric)
str(df6)
clean_eggs<-df6
```
### Describing the data
```{r}
print(summarytools::dfSummary(clean_eggs,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
```
It can be seen that the dimension is 120 x 6.
The observations are in a month year format. The values are prices per cartons for different types of packaging for a particular month between 2004 and 2013 (inclusive). In the original file, the years, as well as the values were stored as characters, and I have converted them to numeric.
The way the data is presented is fairly intuitive. But this could become cumbersome if we had a few more columns. However, the data is already tidy
Describe the data, and be sure to comment on why you are planning to pivot it to make it "tidy"
## Anticipating the end results
### Challenge: Describing the final dimensions
Documenting my work here.
```{r}
#existing rows/cases
nrow(clean_eggs)
#existing columns/cases
ncol(clean_eggs)
#expected rows/cases
nrow(clean_eggs) * (ncol(clean_eggs)-4)
# expected columns
4
```
## Pivoting the Data
### Challenge: Pivoting the Chosen Data
The new case again is a month year unit. However, the data is now in a long format instead of wide. It satisfies the requirement of tidy data.
```{r}
clean_eggs<-pivot_longer(clean_eggs, col = c("Extra Large Dozen", "Extra Large Half Dozen", "Large Dozen", "Large Half Dozen"),
names_to="Type of packaging",
values_to = "Price of Cartons")
clean_eggs
```