Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Harsha Kanaka Eswar Gudipudi
May 15, 2023
Today’s challenge is to:
pivot_longer
Read in one (or more) of the following datasets, using the correct R package and command.
# A tibble: 6 × 6
month year large_half_dozen large_dozen extra_large_half_dozen
<chr> <dbl> <dbl> <dbl> <dbl>
1 January 2004 126 230 132
2 February 2004 128. 226. 134.
3 March 2004 131 225 137
4 April 2004 131 225 137
5 May 2004 131 225 137
6 June 2004 134. 231. 137
# ℹ 1 more variable: extra_large_dozen <dbl>
The summary below is showing the data for the month, year, and prices of different types of egg cartons over a period of 10 years from 2004 to 2013. The data is summarized based on the different egg carton sizes, which include large_half_dozen, large_dozen, extra_large_half_dozen, and extra_large_dozen.The data shows that the average price for the large_half_dozen egg cartons is $155.2, while the average price for the large_dozen egg cartons is $254.2. The average price for extra_large_half_dozen egg cartons is $164.2, and the average price for extra_large_dozen egg cartons is $266.8.
month year large_half_dozen large_dozen
Length:120 Min. :2004 Min. :126.0 Min. :225.0
Class :character 1st Qu.:2006 1st Qu.:129.4 1st Qu.:233.5
Mode :character Median :2008 Median :174.5 Median :267.5
Mean :2008 Mean :155.2 Mean :254.2
3rd Qu.:2011 3rd Qu.:174.5 3rd Qu.:268.0
Max. :2013 Max. :178.0 Max. :277.5
extra_large_half_dozen extra_large_dozen
Min. :132.0 Min. :230.0
1st Qu.:135.8 1st Qu.:241.5
Median :185.5 Median :285.5
Mean :164.2 Mean :266.8
3rd Qu.:185.5 3rd Qu.:285.5
Max. :188.1 Max. :290.0
[1] 120 6
no of rows in df = 120
Here the four columns “large_half_dozen, large_dozen, extra_large_half_dozen, extra_large_dozen” which can be made into two columns based on the size and quantity. And the corresponding values for the eggs should be displayed in the new column as price.
Here we are piovting 4 columns and the no of rows of df is 120 before pivoting. So the final no of rows would be equal to nrows(df)4 = 1204 = 480 and the no of columns after pivoting = 3+2 = 5. therefore the final dimension is 480*5.
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
# A tibble: 480 × 5
month year size quantity price
<chr> <dbl> <chr> <chr> <dbl>
1 January 2004 large_half dozen 126
2 January 2004 large dozen 230
3 January 2004 extra_large_half dozen 132
4 January 2004 extra_large dozen 230
5 February 2004 large_half dozen 128.
6 February 2004 large dozen 226.
7 February 2004 extra_large_half dozen 134.
8 February 2004 extra_large dozen 230
9 March 2004 large_half dozen 131
10 March 2004 large dozen 225
# ℹ 470 more rows
[1] "size: "
[1] 480 5
Yes, once it is pivoted long, our resulting data are \(480x5\) - exactly what we expected!
Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?
Once pivoted, each row in the new data frame represents a unique combination of month, year, size, quantity and price. The value column contains the corresponding value for each combination. This meets the requirements for tidy data because each row represents a single observation and each variable is in a separate column. Additionally, the column names are informative and easy to understand.
Any additional comments?
---
title: "Challenge 3"
author: "Harsha Kanaka Eswar Gudipudi"
description: "Tidy Data: Pivoting"
date: "05/15/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- Harsha Kanaka Eswar Gudipudi
- egg_tidy.csv
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2. identify what needs to be done to tidy the current data
3. anticipate the shape of pivoted data
4. pivot the data into tidy format using `pivot_longer`
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- animal_weights.csv ⭐
- eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
- australian_marriage\*.xls ⭐⭐⭐
- USA Households\*.xlsx ⭐⭐⭐⭐
- sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟
```{r}
library(readr)
#Read eggs_tidy csv data
df <- read_csv("_data/eggs_tidy.csv",show_col_types = FALSE)
head(df)
```
### Briefly describe the data
The summary below is showing the data for the month, year, and prices of different types of egg cartons over a period of 10 years from 2004 to 2013. The data is summarized based on the different egg carton sizes, which include large_half_dozen, large_dozen, extra_large_half_dozen, and extra_large_dozen.The data shows that the average price for the large_half_dozen egg cartons is \$155.2, while the average price for the large_dozen egg cartons is \$254.2. The average price for extra_large_half_dozen egg cartons is \$164.2, and the average price for extra_large_dozen egg cartons is \$266.8.
```{r}
#| label: summary
summary(df)
dim(df)
cat(paste("no of rows in df = ", nrow(df)))
```
## Anticipate the End Result
Here the four columns "large_half_dozen, large_dozen, extra_large_half_dozen, extra_large_dozen" which can be made into two columns based on the size and quantity. And the corresponding values for the eggs should be displayed in the new column as price.
### Challenge: Describe the final dimensions
Here we are piovting 4 columns and the no of rows of df is 120 before pivoting. So the final no of rows would be equal to nrows(df)*4 = 120*4 = 480 and the no of columns after pivoting = 3+2 = 5. therefore the final dimension is 480*5.
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
### Example
```{r}
df_long <- df %>%
pivot_longer(
cols = -c(month, year),
names_to = c("size", "quantity"),
names_pattern = "(.+)_(.+)",
values_to = "price"
)
# View the pivoted data frame
df_long
print("size: ")
print(dim(df_long))
```
Yes, once it is pivoted long, our resulting data are $480x5$ - exactly what we expected!
### Challenge: Pivot the Chosen Data
Document your work here. What will a new "case" be once you have pivoted the data? How does it meet requirements for tidy data?
Once pivoted, each row in the new data frame represents a unique combination of month, year, size, quantity and price. The value column contains the corresponding value for each combination. This meets the requirements for tidy data because each row represents a single observation and each variable is in a separate column. Additionally, the column names are informative and easy to understand.
```{r}
```
Any additional comments?