challenge_4
Jaswanth Reddy Kommuru
poultry_tidy
More data wrangling: pivoting
Author

Jaswanth Reddy Kommuru

Published

May 9, 2023

Code
library(lubridate)
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
Code
poultrydata <-readxl::read_excel("~/Documents/601/601_Spring_2023/posts/_data/poultry_tidy.xlsx")
Code
head(poultrydata)
# A tibble: 6 × 4
  Product  Year Month    Price_Dollar
  <chr>   <dbl> <chr>           <dbl>
1 Whole    2013 January          2.38
2 Whole    2013 February         2.38
3 Whole    2013 March            2.38
4 Whole    2013 April            2.38
5 Whole    2013 May              2.38
6 Whole    2013 June             2.38
Code
dim(poultrydata)
[1] 600   4
Code
str(poultrydata)
tibble [600 × 4] (S3: tbl_df/tbl/data.frame)
 $ Product     : chr [1:600] "Whole" "Whole" "Whole" "Whole" ...
 $ Year        : num [1:600] 2013 2013 2013 2013 2013 ...
 $ Month       : chr [1:600] "January" "February" "March" "April" ...
 $ Price_Dollar: num [1:600] 2.38 2.38 2.38 2.38 2.38 ...
Code
 poultrydata%>%
  select("Product") %>%
  distinct(.)
# A tibble: 5 × 1
  Product       
  <chr>         
1 Whole         
2 B/S Breast    
3 Bone-in Breast
4 Whole Legs    
5 Thighs        
Code
 poultrydata%>%
  select("Year") %>%
  distinct(.)
# A tibble: 10 × 1
    Year
   <dbl>
 1  2013
 2  2012
 3  2011
 4  2010
 5  2009
 6  2008
 7  2007
 8  2006
 9  2005
10  2004
Code
colnames(poultrydata)
[1] "Product"      "Year"         "Month"        "Price_Dollar"
Code
nrow(poultrydata)
[1] 600
Code
summary(poultrydata)
   Product               Year         Month            Price_Dollar  
 Length:600         Min.   :2004   Length:600         Min.   :1.935  
 Class :character   1st Qu.:2006   Class :character   1st Qu.:2.150  
 Mode  :character   Median :2008   Mode  :character   Median :2.350  
                    Mean   :2008                      Mean   :3.390  
                    3rd Qu.:2011                      3rd Qu.:3.905  
                    Max.   :2013                      Max.   :7.037  
                                                      NA's   :7      

Briefly describe the data

The dataset comprises 4 columns and 600 rows spanning from 2004 to 2013. It includes various poultry products such as Whole, B/S Breast, Bone-in Breast, Whole Legs, and Thighs. The price range for these products extends from a minimum of 1.935 to 7.037. Upon examining the summary, it becomes evident that certain price values are missing, specifically those associated with product Bone-in Breast and Thighs. Upon further investigation, it was observed that a majority of the missing values are concentrated within these two product categories.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

The provided dataset is already in a tidy format. While there is the possibility of removing the rows with missing data, but still I will preserve these rows. Despite the presence of missing values, these rows still contain valuable information in other columns, making them important for a comprehensive analysis.

Any additional comments?

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

Making two modifications to the dataset would enhance the analysis. Firstly, since the prices are currently discrete values, it would be advantageous to introduce price ranges, allowing for categorization into high, low, and medium price categories. This would provide a more meaningful representation of the data. Secondly, I suggest combining the variables “year” and “month” into a single column to facilitate direct analysis using the merged time information. This would streamline the analysis process and enable more comprehensive insights.

Code
poultrydata_mutated <- poultrydata %>%
  mutate(Price_Range = ifelse(Price_Dollar < 3, 'low', ifelse(Price_Dollar < 5, 'medium', 'high')))

poultrydata_mutated <- poultrydata_mutated %>%
  mutate(Date = as.Date(paste(Year, Month, '01'), format = '%Y %B %d'))

poultrydata_mutated
# A tibble: 600 × 6
   Product  Year Month     Price_Dollar Price_Range Date      
   <chr>   <dbl> <chr>            <dbl> <chr>       <date>    
 1 Whole    2013 January           2.38 low         2013-01-01
 2 Whole    2013 February          2.38 low         2013-02-01
 3 Whole    2013 March             2.38 low         2013-03-01
 4 Whole    2013 April             2.38 low         2013-04-01
 5 Whole    2013 May               2.38 low         2013-05-01
 6 Whole    2013 June              2.38 low         2013-06-01
 7 Whole    2013 July              2.38 low         2013-07-01
 8 Whole    2013 August            2.38 low         2013-08-01
 9 Whole    2013 September         2.38 low         2013-09-01
10 Whole    2013 October           2.38 low         2013-10-01
# ℹ 590 more rows
Code
 poultrydata_mutated%>%
  select(`Price_Range`) %>%
  n_distinct(.)
[1] 4
Code
 poultrydata_mutated%>%
  select(`Price_Range`) %>%
  distinct(.)
# A tibble: 4 × 1
  Price_Range
  <chr>      
1 low        
2 high       
3 medium     
4 <NA>       

Any additional comments?