Code
library(lubridate)
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Jaswanth Reddy Kommuru
May 9, 2023
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
# A tibble: 6 × 4
Product Year Month Price_Dollar
<chr> <dbl> <chr> <dbl>
1 Whole 2013 January 2.38
2 Whole 2013 February 2.38
3 Whole 2013 March 2.38
4 Whole 2013 April 2.38
5 Whole 2013 May 2.38
6 Whole 2013 June 2.38
tibble [600 × 4] (S3: tbl_df/tbl/data.frame)
$ Product : chr [1:600] "Whole" "Whole" "Whole" "Whole" ...
$ Year : num [1:600] 2013 2013 2013 2013 2013 ...
$ Month : chr [1:600] "January" "February" "March" "April" ...
$ Price_Dollar: num [1:600] 2.38 2.38 2.38 2.38 2.38 ...
# A tibble: 5 × 1
Product
<chr>
1 Whole
2 B/S Breast
3 Bone-in Breast
4 Whole Legs
5 Thighs
# A tibble: 10 × 1
Year
<dbl>
1 2013
2 2012
3 2011
4 2010
5 2009
6 2008
7 2007
8 2006
9 2005
10 2004
Product Year Month Price_Dollar
Length:600 Min. :2004 Length:600 Min. :1.935
Class :character 1st Qu.:2006 Class :character 1st Qu.:2.150
Mode :character Median :2008 Mode :character Median :2.350
Mean :2008 Mean :3.390
3rd Qu.:2011 3rd Qu.:3.905
Max. :2013 Max. :7.037
NA's :7
The dataset comprises 4 columns and 600 rows spanning from 2004 to 2013. It includes various poultry products such as Whole, B/S Breast, Bone-in Breast, Whole Legs, and Thighs. The price range for these products extends from a minimum of 1.935 to 7.037. Upon examining the summary, it becomes evident that certain price values are missing, specifically those associated with product Bone-in Breast and Thighs. Upon further investigation, it was observed that a majority of the missing values are concentrated within these two product categories.
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
The provided dataset is already in a tidy format. While there is the possibility of removing the rows with missing data, but still I will preserve these rows. Despite the presence of missing values, these rows still contain valuable information in other columns, making them important for a comprehensive analysis.
Any additional comments?
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
Making two modifications to the dataset would enhance the analysis. Firstly, since the prices are currently discrete values, it would be advantageous to introduce price ranges, allowing for categorization into high, low, and medium price categories. This would provide a more meaningful representation of the data. Secondly, I suggest combining the variables “year” and “month” into a single column to facilitate direct analysis using the merged time information. This would streamline the analysis process and enable more comprehensive insights.
# A tibble: 600 × 6
Product Year Month Price_Dollar Price_Range Date
<chr> <dbl> <chr> <dbl> <chr> <date>
1 Whole 2013 January 2.38 low 2013-01-01
2 Whole 2013 February 2.38 low 2013-02-01
3 Whole 2013 March 2.38 low 2013-03-01
4 Whole 2013 April 2.38 low 2013-04-01
5 Whole 2013 May 2.38 low 2013-05-01
6 Whole 2013 June 2.38 low 2013-06-01
7 Whole 2013 July 2.38 low 2013-07-01
8 Whole 2013 August 2.38 low 2013-08-01
9 Whole 2013 September 2.38 low 2013-09-01
10 Whole 2013 October 2.38 low 2013-10-01
# ℹ 590 more rows
# A tibble: 4 × 1
Price_Range
<chr>
1 low
2 high
3 medium
4 <NA>
Any additional comments?
---
title: "Challenge 4"
author: "Jaswanth Reddy Kommuru"
description: "More data wrangling: pivoting"
date: "05/09/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- Jaswanth Reddy Kommuru
- poultry_tidy
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(lubridate)
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- abc_poll.csv ⭐
- poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
- FedFundsRate.csv⭐⭐⭐
- hotel_bookings.csv⭐⭐⭐⭐
- debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
```{r}
poultrydata <-readxl::read_excel("~/Documents/601/601_Spring_2023/posts/_data/poultry_tidy.xlsx")
```
```{r}
head(poultrydata)
```
```{r}
dim(poultrydata)
```
```{r}
str(poultrydata)
```
```{r}
poultrydata%>%
select("Product") %>%
distinct(.)
```
```{r}
poultrydata%>%
select("Year") %>%
distinct(.)
```
```{r}
colnames(poultrydata)
```
```{r}
nrow(poultrydata)
```
```{r}
summary(poultrydata)
```
### Briefly describe the data
The dataset comprises 4 columns and 600 rows spanning from 2004 to 2013. It includes various poultry products such as Whole, B/S Breast, Bone-in Breast, Whole Legs, and Thighs. The price range for these products extends from a minimum of 1.935 to 7.037. Upon examining the summary, it becomes evident that certain price values are missing, specifically those associated with product Bone-in Breast and Thighs. Upon further investigation, it was observed that a majority of the missing values are concentrated within these two product categories.
## Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
The provided dataset is already in a tidy format. While there is the possibility of removing the rows with missing data, but still I will preserve these rows. Despite the presence of missing values, these rows still contain valuable information in other columns, making them important for a comprehensive analysis.
Any additional comments?
## Identify variables that need to be mutated
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
Making two modifications to the dataset would enhance the analysis. Firstly, since the prices are currently discrete values, it would be advantageous to introduce price ranges, allowing for categorization into high, low, and medium price categories. This would provide a more meaningful representation of the data. Secondly, I suggest combining the variables "year" and "month" into a single column to facilitate direct analysis using the merged time information. This would streamline the analysis process and enable more comprehensive insights.
```{r}
poultrydata_mutated <- poultrydata %>%
mutate(Price_Range = ifelse(Price_Dollar < 3, 'low', ifelse(Price_Dollar < 5, 'medium', 'high')))
poultrydata_mutated <- poultrydata_mutated %>%
mutate(Date = as.Date(paste(Year, Month, '01'), format = '%Y %B %d'))
poultrydata_mutated
```
```{r}
poultrydata_mutated%>%
select(`Price_Range`) %>%
n_distinct(.)
```
```{r}
poultrydata_mutated%>%
select(`Price_Range`) %>%
distinct(.)
```
Any additional comments?