challenge_4
More data wrangling: mutate
Author

Cristhian Barba Garzon

Published

January 3, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.csv⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions ⭐⭐⭐⭐⭐

The data set that was chosen to be read was “poultry_tidy.csv”.

Code
poll = read_csv("_data/poultry_tidy.csv")
poll
# A tibble: 600 × 4
   Product  Year Month     Price_Dollar
   <chr>   <dbl> <chr>            <dbl>
 1 Whole    2013 January           2.38
 2 Whole    2013 February          2.38
 3 Whole    2013 March             2.38
 4 Whole    2013 April             2.38
 5 Whole    2013 May               2.38
 6 Whole    2013 June              2.38
 7 Whole    2013 July              2.38
 8 Whole    2013 August            2.38
 9 Whole    2013 September         2.38
10 Whole    2013 October           2.38
# … with 590 more rows
Code
poll %>%
  dim() # 600 rows by 4 columns
[1] 600   4
Code
poll %>%
  colnames()
[1] "Product"      "Year"         "Month"        "Price_Dollar"

Briefly describe the data

The data set “poultry_tidy.csv” contains 600 observations and 4 columns. It seems to describe the price of poultry products in each year and month. It can be used to analyze the price increase or decrease for specific products.

Tidying Data (as needed)

The data set “poultry_tidy.csv” is already tidy. All columns have their own variables stored within them, meaning their column title refers to these observations. I can further tidy this data by changing the arrangement of values; I can arrange the years in ascending order. After viewing the arranged data, we can see that the data is now in ascending order of years, allowing us to view the earliest year.

Code
arranged = poll %>% 
  arrange(Year) %>% #arranges data set by the year in ascending order
  select(Product:Price_Dollar) #selects all columns in data set 

arranged
# A tibble: 600 × 4
   Product  Year Month     Price_Dollar
   <chr>   <dbl> <chr>            <dbl>
 1 Whole    2004 January           1.98
 2 Whole    2004 February          1.98
 3 Whole    2004 March             2.09
 4 Whole    2004 April             2.12
 5 Whole    2004 May               2.14
 6 Whole    2004 June              2.16
 7 Whole    2004 July              2.17
 8 Whole    2004 August            2.17
 9 Whole    2004 September         2.17
10 Whole    2004 October           2.17
# … with 590 more rows

Identifying variables that need to be mutated

In this data set, the column titled “Month” can have its variables changed to numerical values; the months can be mutated to their respective numerical values in order to simplify the data set. Furthermore, we simplify the data set by joining the month and the year into one column. This will leave three columns, with the newly mutated data set having a “date” column in the format of mm/yyyy.

Code
mutated = poll %>%
  mutate(Month = case_when(
    Month == "January" ~ 1,
    Month == "February" ~ 2,
    Month == "March" ~ 3,
    Month == "April" ~ 4,
    Month == "May" ~ 5 ,
    Month == "June" ~ 6,
    Month == "July" ~ 7,
    Month == "August" ~ 8,
    Month == "September" ~ 9,
    Month == "October" ~ 10,
    Month == "November" ~ 11,
    Month == "December" ~ 12
  ))

arranged_mutate = mutated %>%
  arrange(Year) #arranged by year in ascending order; the values in the "Month" column are now numerical 
arranged_mutate
# A tibble: 600 × 4
   Product  Year Month Price_Dollar
   <chr>   <dbl> <dbl>        <dbl>
 1 Whole    2004     1         1.98
 2 Whole    2004     2         1.98
 3 Whole    2004     3         2.09
 4 Whole    2004     4         2.12
 5 Whole    2004     5         2.14
 6 Whole    2004     6         2.16
 7 Whole    2004     7         2.17
 8 Whole    2004     8         2.17
 9 Whole    2004     9         2.17
10 Whole    2004    10         2.17
# … with 590 more rows
Code
dates = c() # Create an empty vector to store the dates

for (i in 1:length(arranged_mutate$Year)) { #for loop -- in range 1 to total length of Year column
  date_string = sprintf("%02d/%04d", arranged_mutate$Month[i], arranged_mutate$Year[i]) # Format the year and month as a date string
  dates = c(dates, date_string)  # Append the date string to the dates vector
}


final_set = arranged_mutate %>%
   mutate(Date = dates) %>% #adds new column Date with the vector dates from for-loop
  subset(select = -c(Year:Month)) %>% #drops Year and Month columns so new date column will replace
  relocate(Date,.before = 2, .after = NULL) #moves the column to the left hand side before the previous 2nd
final_set
# A tibble: 600 × 3
   Product Date    Price_Dollar
   <chr>   <chr>          <dbl>
 1 Whole   01/2004         1.98
 2 Whole   02/2004         1.98
 3 Whole   03/2004         2.09
 4 Whole   04/2004         2.12
 5 Whole   05/2004         2.14
 6 Whole   06/2004         2.16
 7 Whole   07/2004         2.17
 8 Whole   08/2004         2.17
 9 Whole   09/2004         2.17
10 Whole   10/2004         2.17
# … with 590 more rows