challenge_4
poultry_tidy
poobigan murugesan
More data wrangling: pivoting
Author

Poobigan Murugesan

Published

May 10, 2023

Code
library(tidyverse)
library(dplyr)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐

Reading in poultry_tidy dataset

Code
df <- read_excel("_data/poultry_tidy.xlsx")
df
# A tibble: 600 × 4
   Product  Year Month     Price_Dollar
   <chr>   <dbl> <chr>            <dbl>
 1 Whole    2013 January           2.38
 2 Whole    2013 February          2.38
 3 Whole    2013 March             2.38
 4 Whole    2013 April             2.38
 5 Whole    2013 May               2.38
 6 Whole    2013 June              2.38
 7 Whole    2013 July              2.38
 8 Whole    2013 August            2.38
 9 Whole    2013 September         2.38
10 Whole    2013 October           2.38
# ℹ 590 more rows

Dimensions of the dataset

Code
dim(df)
[1] 600   4

Summary of dataset

Code
summary(df)
   Product               Year         Month            Price_Dollar  
 Length:600         Min.   :2004   Length:600         Min.   :1.935  
 Class :character   1st Qu.:2006   Class :character   1st Qu.:2.150  
 Mode  :character   Median :2008   Mode  :character   Median :2.350  
                    Mean   :2008                      Mean   :3.390  
                    3rd Qu.:2011                      3rd Qu.:3.905  
                    Max.   :2013                      Max.   :7.037  
                                                      NA's   :7      

We can see that there are NaN values present so we further filter the data.

Code
filter(df,is.na(Price_Dollar))
# A tibble: 7 × 4
  Product         Year Month    Price_Dollar
  <chr>          <dbl> <chr>           <dbl>
1 Bone-in Breast  2004 January            NA
2 Bone-in Breast  2004 February           NA
3 Bone-in Breast  2004 March              NA
4 Bone-in Breast  2004 April              NA
5 Bone-in Breast  2004 May                NA
6 Bone-in Breast  2004 June               NA
7 Thighs          2004 January            NA

Unique poultry product categories

Code
distinct(select(df, Product))
# A tibble: 5 × 1
  Product       
  <chr>         
1 Whole         
2 B/S Breast    
3 Bone-in Breast
4 Whole Legs    
5 Thighs        

Briefly describe the data

This dataset of 600 rows and 4 columns, contains information about the prices of 5 different types of poultry cuts, namely Bone-in Breast, B/S Breast, Thighs, Whole Legs, and Whole, on a monthly basis from January 2004 to December 2013. However from the summary it is evident that there are NaN values present in the dataset indicating that some of the prices from 2004 are not available.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done?

The data is already tidy with each row containing the price during a specific month, year of a specific cut of meat.

Identify variables that need to be mutated

The data currently has separate columns for months and years. To sort the data chronologically, we need to add a date column.

Code
df1<-df %>%
  mutate(date = str_c(Month, Year, sep=" "),
         date = my(date))
df1<-select(df1, date, Product, Price_Dollar)
df1
# A tibble: 600 × 3
   date       Product Price_Dollar
   <date>     <chr>          <dbl>
 1 2013-01-01 Whole           2.38
 2 2013-02-01 Whole           2.38
 3 2013-03-01 Whole           2.38
 4 2013-04-01 Whole           2.38
 5 2013-05-01 Whole           2.38
 6 2013-06-01 Whole           2.38
 7 2013-07-01 Whole           2.38
 8 2013-08-01 Whole           2.38
 9 2013-09-01 Whole           2.38
10 2013-10-01 Whole           2.38
# ℹ 590 more rows

The first operation uses the mutate() function to create a new column called date. The str_c() function from the stringr package is used to concatenate the Month and Year columns, separated by a space. The my() function is then used to convert this concatenated string into a date object. Although the day of the month is set to the first for all observations, this will not affect the order of the data because it is consistent throughout the dataset. Therefore, we can use this new date column to sort the data chronologically.