Code
library(tidyverse)
library(lubridate)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Prachiti Parkar
August 18, 2022
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
# A tibble: 600 × 4
Product Year Month Price_Dollar
<chr> <dbl> <chr> <dbl>
1 Whole 2013 January 2.38
2 Whole 2013 February 2.38
3 Whole 2013 March 2.38
4 Whole 2013 April 2.38
5 Whole 2013 May 2.38
6 Whole 2013 June 2.38
7 Whole 2013 July 2.38
8 Whole 2013 August 2.38
9 Whole 2013 September 2.38
10 Whole 2013 October 2.38
# … with 590 more rows
Product Year Month Price_Dollar
Length:600 Min. :2004 Length:600 Min. :1.935
Class :character 1st Qu.:2006 Class :character 1st Qu.:2.150
Mode :character Median :2008 Mode :character Median :2.350
Mean :2008 Mean :3.390
3rd Qu.:2011 3rd Qu.:3.905
Max. :2013 Max. :7.037
NA's :7
# A tibble: 7 × 4
Product Year Month Price_Dollar
<chr> <dbl> <chr> <dbl>
1 Bone-in Breast 2004 January NA
2 Bone-in Breast 2004 February NA
3 Bone-in Breast 2004 March NA
4 Bone-in Breast 2004 April NA
5 Bone-in Breast 2004 May NA
6 Bone-in Breast 2004 June NA
7 Thighs 2004 January NA
[1] "Whole" "B/S Breast" "Bone-in Breast" "Whole Legs"
[5] "Thighs"
The data consists of 4 columns and 600 rows from the year 2004 to 2013. Specifically it contains the products, Whole, B/S Breast, Bone-in Breast, Whole legs and Thighs. The price ranges from minimum 1.935 to 7.037. We can see from summary that some of the values of prices are missing, particularly 7. I tried to dive deeper and understood that most of these are of product Bone-in breast and thighs.
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
Data is already tidy. We can choose to remove NA data but I believe since it has other column information, those rows are important too.
Any additional comments?
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
According to me, the year and months need to be joined so that we can sort them and also, we need to understand if price is low, medium or high instead of checking the exact price, we can just look at the variable.
[1] "low" "high" "med" NA
# A tibble: 600 × 6
Product Year Month Price_Dollar Date Price_type
<chr> <dbl> <chr> <dbl> <date> <chr>
1 Whole 2013 January 2.38 2013-01-01 low
2 Whole 2013 February 2.38 2013-02-01 low
3 Whole 2013 March 2.38 2013-03-01 low
4 Whole 2013 April 2.38 2013-04-01 low
5 Whole 2013 May 2.38 2013-05-01 low
6 Whole 2013 June 2.38 2013-06-01 low
7 Whole 2013 July 2.38 2013-07-01 low
8 Whole 2013 August 2.38 2013-08-01 low
9 Whole 2013 September 2.38 2013-09-01 low
10 Whole 2013 October 2.38 2013-10-01 low
# … with 590 more rows
Any additional comments?
---
title: "Challenge 4"
author: "Prachiti Parkar"
description: "More data wrangling: pivoting"
date: "08/18/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- abc_poll
- eggs
- fed_rates
- hotel_bookings
- debt
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- abc_poll.csv ⭐
- poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
- FedFundsRate.csv⭐⭐⭐
- hotel_bookings.csv⭐⭐⭐⭐
- debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
```{r}
poultry_data <-readxl::read_excel("_data/poultry_tidy.xlsx")
poultry_data
summary(poultry_data)
missing_data <- filter(poultry_data, is.na(Price_Dollar))
#Checking if there is any missing data
missing_data
unique(poultry_data$Product)
```
### Briefly describe the data
The data consists of 4 columns and 600 rows from the year 2004 to 2013. Specifically it contains the products, Whole, B/S Breast, Bone-in Breast, Whole legs and Thighs. The price ranges from minimum 1.935 to 7.037. We can see from summary that some of the values of prices are missing, particularly 7. I tried to dive deeper and understood that most of these are of product Bone-in breast and thighs.
## Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
Data is already tidy. We can choose to remove NA data but I believe since it has other column information, those rows are important too.
```{r}
```
Any additional comments?
## Identify variables that need to be mutated
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
According to me, the year and months need to be joined so that we can sort them and also, we need to understand if price is low, medium or high instead of checking the exact price, we can just look at the variable.
```{r}
poultry_data <- poultry_data %>%
mutate(Date = ym(paste(Year, Month)))
poultry_data <- poultry_data %>%
mutate(Price_type = case_when(Price_Dollar < 3 ~ 'low',
Price_Dollar < 5 ~ 'med',
Price_Dollar > 5 ~ 'high'))
unique(poultry_data$Price_type)
poultry_data
```
Any additional comments?