challenge_4
poultry_tidy
Abhinav Reddy Yadatha
More data wrangling: pivoting
Author

Abhinav Reddy Yadatha

Published

May 6, 2023

Code
library(tidyverse)
library(dplyr)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Reading poultry_tidy.xlsx dataset.

  • poultry_tidy.xlsx⭐⭐
Code
# Reading the poultry dataset
poultry_data <-readxl::read_excel("_data/poultry_tidy.xlsx")
poultry_data
# A tibble: 600 × 4
   Product  Year Month     Price_Dollar
   <chr>   <dbl> <chr>            <dbl>
 1 Whole    2013 January           2.38
 2 Whole    2013 February          2.38
 3 Whole    2013 March             2.38
 4 Whole    2013 April             2.38
 5 Whole    2013 May               2.38
 6 Whole    2013 June              2.38
 7 Whole    2013 July              2.38
 8 Whole    2013 August            2.38
 9 Whole    2013 September         2.38
10 Whole    2013 October           2.38
# ℹ 590 more rows
Code
# Display the summary of the dataset
summary(poultry_data)
   Product               Year         Month            Price_Dollar  
 Length:600         Min.   :2004   Length:600         Min.   :1.935  
 Class :character   1st Qu.:2006   Class :character   1st Qu.:2.150  
 Mode  :character   Median :2008   Mode  :character   Median :2.350  
                    Mean   :2008                      Mean   :3.390  
                    3rd Qu.:2011                      3rd Qu.:3.905  
                    Max.   :2013                      Max.   :7.037  
                                                      NA's   :7      
Code
# Check for any empty/missing data
na_data <- filter(poultry_data, is.na(Price_Dollar))

na_data
# A tibble: 7 × 4
  Product         Year Month    Price_Dollar
  <chr>          <dbl> <chr>           <dbl>
1 Bone-in Breast  2004 January            NA
2 Bone-in Breast  2004 February           NA
3 Bone-in Breast  2004 March              NA
4 Bone-in Breast  2004 April              NA
5 Bone-in Breast  2004 May                NA
6 Bone-in Breast  2004 June               NA
7 Thighs          2004 January            NA
Code
# Displaying all the unique poultry product names.
unique(poultry_data$Product)
[1] "Whole"          "B/S Breast"     "Bone-in Breast" "Whole Legs"    
[5] "Thighs"        

Briefly describe the data

The dataset contains 600 rows and four columns that span from 2004 to 2013. The columns represent various poultry products, including Whole, B/S Breast, Bone-in Breast, Whole Legs, and Thighs, and the corresponding prices range from 1.935 to 7.037 dollars. Upon reviewing the summary, it’s evident that some of the prices are missing, particularly for product 7. Further analysis revealed that most of the missing values belong to the Bone-in Breast and Thighs products.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

The data is already tidy. Although we have the option to remove the NA data, I believe it’s essential to retain the rows with missing values as they contain other important column information Any additional comments?

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

I believe that two mutations would be beneficial foe the analysis of the dataset. First, it can be observed that the prices are discrete values and it would be helpful if there were ranges so that the price can be categorized as high, low and medium. Second, i believe that the variables year and month can be joined so that we can use it for analysis directly.

Code
# Creating ranges for the prices to be categorized in ranges.
poultry_data_mutated <- poultry_data %>%
  mutate(Price_type = case_when(Price_Dollar < 3 ~ 'low',Price_Dollar < 5 ~ 'medium', Price_Dollar > 5 ~ 'high'))

# Combining the year and the month.
poultry_data_mutated <- poultry_data_mutated %>%
        mutate(Date = ym(paste(Year, Month)))

poultry_data_mutated
# A tibble: 600 × 6
   Product  Year Month     Price_Dollar Price_type Date      
   <chr>   <dbl> <chr>            <dbl> <chr>      <date>    
 1 Whole    2013 January           2.38 low        2013-01-01
 2 Whole    2013 February          2.38 low        2013-02-01
 3 Whole    2013 March             2.38 low        2013-03-01
 4 Whole    2013 April             2.38 low        2013-04-01
 5 Whole    2013 May               2.38 low        2013-05-01
 6 Whole    2013 June              2.38 low        2013-06-01
 7 Whole    2013 July              2.38 low        2013-07-01
 8 Whole    2013 August            2.38 low        2013-08-01
 9 Whole    2013 September         2.38 low        2013-09-01
10 Whole    2013 October           2.38 low        2013-10-01
# ℹ 590 more rows
Code
# Displaying the unique values of Price_type column after the mutation.
unique(poultry_data_mutated$Price_type)
[1] "low"    "high"   "medium" NA      

Any additional comments?