Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Cristhian Barba Garzon
January 3, 2023
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
The data set that was chosen to be read was “poultry_tidy.csv”.
# A tibble: 600 × 4
Product Year Month Price_Dollar
<chr> <dbl> <chr> <dbl>
1 Whole 2013 January 2.38
2 Whole 2013 February 2.38
3 Whole 2013 March 2.38
4 Whole 2013 April 2.38
5 Whole 2013 May 2.38
6 Whole 2013 June 2.38
7 Whole 2013 July 2.38
8 Whole 2013 August 2.38
9 Whole 2013 September 2.38
10 Whole 2013 October 2.38
# … with 590 more rows
[1] 600 4
[1] "Product" "Year" "Month" "Price_Dollar"
The data set “poultry_tidy.csv” contains 600 observations and 4 columns. It seems to describe the price of poultry products in each year and month. It can be used to analyze the price increase or decrease for specific products.
The data set “poultry_tidy.csv” is already tidy. All columns have their own variables stored within them, meaning their column title refers to these observations. I can further tidy this data by changing the arrangement of values; I can arrange the years in ascending order. After viewing the arranged data, we can see that the data is now in ascending order of years, allowing us to view the earliest year.
# A tibble: 600 × 4
Product Year Month Price_Dollar
<chr> <dbl> <chr> <dbl>
1 Whole 2004 January 1.98
2 Whole 2004 February 1.98
3 Whole 2004 March 2.09
4 Whole 2004 April 2.12
5 Whole 2004 May 2.14
6 Whole 2004 June 2.16
7 Whole 2004 July 2.17
8 Whole 2004 August 2.17
9 Whole 2004 September 2.17
10 Whole 2004 October 2.17
# … with 590 more rows
In this data set, the column titled “Month” can have its variables changed to numerical values; the months can be mutated to their respective numerical values in order to simplify the data set. Furthermore, we simplify the data set by joining the month and the year into one column. This will leave three columns, with the newly mutated data set having a “date” column in the format of mm/yyyy.
mutated = poll %>%
mutate(Month = case_when(
Month == "January" ~ 1,
Month == "February" ~ 2,
Month == "March" ~ 3,
Month == "April" ~ 4,
Month == "May" ~ 5 ,
Month == "June" ~ 6,
Month == "July" ~ 7,
Month == "August" ~ 8,
Month == "September" ~ 9,
Month == "October" ~ 10,
Month == "November" ~ 11,
Month == "December" ~ 12
))
arranged_mutate = mutated %>%
arrange(Year) #arranged by year in ascending order; the values in the "Month" column are now numerical
arranged_mutate
# A tibble: 600 × 4
Product Year Month Price_Dollar
<chr> <dbl> <dbl> <dbl>
1 Whole 2004 1 1.98
2 Whole 2004 2 1.98
3 Whole 2004 3 2.09
4 Whole 2004 4 2.12
5 Whole 2004 5 2.14
6 Whole 2004 6 2.16
7 Whole 2004 7 2.17
8 Whole 2004 8 2.17
9 Whole 2004 9 2.17
10 Whole 2004 10 2.17
# … with 590 more rows
dates = c() # Create an empty vector to store the dates
for (i in 1:length(arranged_mutate$Year)) { #for loop -- in range 1 to total length of Year column
date_string = sprintf("%02d/%04d", arranged_mutate$Month[i], arranged_mutate$Year[i]) # Format the year and month as a date string
dates = c(dates, date_string) # Append the date string to the dates vector
}
final_set = arranged_mutate %>%
mutate(Date = dates) %>% #adds new column Date with the vector dates from for-loop
subset(select = -c(Year:Month)) %>% #drops Year and Month columns so new date column will replace
relocate(Date,.before = 2, .after = NULL) #moves the column to the left hand side before the previous 2nd
final_set
# A tibble: 600 × 3
Product Date Price_Dollar
<chr> <chr> <dbl>
1 Whole 01/2004 1.98
2 Whole 02/2004 1.98
3 Whole 03/2004 2.09
4 Whole 04/2004 2.12
5 Whole 05/2004 2.14
6 Whole 06/2004 2.16
7 Whole 07/2004 2.17
8 Whole 08/2004 2.17
9 Whole 09/2004 2.17
10 Whole 10/2004 2.17
# … with 590 more rows
---
title: "Challenge 4"
author: "Cristhian Barba Garzon"
description: "More data wrangling: mutate"
date: "01/03/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- abc_poll.csv ⭐
- poultry_tidy.csv⭐⭐
- FedFundsRate.csv⭐⭐⭐
- hotel_bookings.csv⭐⭐⭐⭐
- debt_in_trillions ⭐⭐⭐⭐⭐
The data set that was chosen to be read was "poultry_tidy.csv".
```{r}
poll = read_csv("_data/poultry_tidy.csv")
poll
poll %>%
dim() # 600 rows by 4 columns
poll %>%
colnames()
```
### Briefly describe the data
The data set "poultry_tidy.csv" contains 600 observations and 4 columns. It seems to describe the price of poultry products in each year and month. It can be used to analyze the price increase or decrease for specific products.
## Tidying Data (as needed)
The data set "poultry_tidy.csv" is already tidy. All columns have their own variables stored within them, meaning their column title refers to these observations. I can further tidy this data by changing the arrangement of values; I can arrange the years in ascending order. After viewing the arranged data, we can see that the data is now in ascending order of years, allowing us to view the earliest year.
```{r}
arranged = poll %>%
arrange(Year) %>% #arranges data set by the year in ascending order
select(Product:Price_Dollar) #selects all columns in data set
arranged
```
## Identifying variables that need to be mutated
In this data set, the column titled "Month" can have its variables changed to numerical values; the months can be mutated to their respective numerical values in order to simplify the data set. Furthermore, we simplify the data set by joining the month and the year into one column. This will leave three columns, with the newly mutated data set having a "date" column in the format of mm/yyyy.
```{r}
mutated = poll %>%
mutate(Month = case_when(
Month == "January" ~ 1,
Month == "February" ~ 2,
Month == "March" ~ 3,
Month == "April" ~ 4,
Month == "May" ~ 5 ,
Month == "June" ~ 6,
Month == "July" ~ 7,
Month == "August" ~ 8,
Month == "September" ~ 9,
Month == "October" ~ 10,
Month == "November" ~ 11,
Month == "December" ~ 12
))
arranged_mutate = mutated %>%
arrange(Year) #arranged by year in ascending order; the values in the "Month" column are now numerical
arranged_mutate
dates = c() # Create an empty vector to store the dates
for (i in 1:length(arranged_mutate$Year)) { #for loop -- in range 1 to total length of Year column
date_string = sprintf("%02d/%04d", arranged_mutate$Month[i], arranged_mutate$Year[i]) # Format the year and month as a date string
dates = c(dates, date_string) # Append the date string to the dates vector
}
final_set = arranged_mutate %>%
mutate(Date = dates) %>% #adds new column Date with the vector dates from for-loop
subset(select = -c(Year:Month)) %>% #drops Year and Month columns so new date column will replace
relocate(Date,.before = 2, .after = NULL) #moves the column to the left hand side before the previous 2nd
final_set
```