challenge_4
abc_poll
eggs
fed_rates
hotel_bookings
debt
More data wrangling: pivoting
Author

Prachiti Parkar

Published

August 18, 2022

Code
library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
Code
poultry_data <-readxl::read_excel("_data/poultry_tidy.xlsx")

poultry_data
# A tibble: 600 × 4
   Product  Year Month     Price_Dollar
   <chr>   <dbl> <chr>            <dbl>
 1 Whole    2013 January           2.38
 2 Whole    2013 February          2.38
 3 Whole    2013 March             2.38
 4 Whole    2013 April             2.38
 5 Whole    2013 May               2.38
 6 Whole    2013 June              2.38
 7 Whole    2013 July              2.38
 8 Whole    2013 August            2.38
 9 Whole    2013 September         2.38
10 Whole    2013 October           2.38
# … with 590 more rows
Code
summary(poultry_data)
   Product               Year         Month            Price_Dollar  
 Length:600         Min.   :2004   Length:600         Min.   :1.935  
 Class :character   1st Qu.:2006   Class :character   1st Qu.:2.150  
 Mode  :character   Median :2008   Mode  :character   Median :2.350  
                    Mean   :2008                      Mean   :3.390  
                    3rd Qu.:2011                      3rd Qu.:3.905  
                    Max.   :2013                      Max.   :7.037  
                                                      NA's   :7      
Code
missing_data <- filter(poultry_data, is.na(Price_Dollar))

#Checking if there is any missing data
missing_data 
# A tibble: 7 × 4
  Product         Year Month    Price_Dollar
  <chr>          <dbl> <chr>           <dbl>
1 Bone-in Breast  2004 January            NA
2 Bone-in Breast  2004 February           NA
3 Bone-in Breast  2004 March              NA
4 Bone-in Breast  2004 April              NA
5 Bone-in Breast  2004 May                NA
6 Bone-in Breast  2004 June               NA
7 Thighs          2004 January            NA
Code
unique(poultry_data$Product)
[1] "Whole"          "B/S Breast"     "Bone-in Breast" "Whole Legs"    
[5] "Thighs"        

Briefly describe the data

The data consists of 4 columns and 600 rows from the year 2004 to 2013. Specifically it contains the products, Whole, B/S Breast, Bone-in Breast, Whole legs and Thighs. The price ranges from minimum 1.935 to 7.037. We can see from summary that some of the values of prices are missing, particularly 7. I tried to dive deeper and understood that most of these are of product Bone-in breast and thighs.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Data is already tidy. We can choose to remove NA data but I believe since it has other column information, those rows are important too.

Any additional comments?

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

According to me, the year and months need to be joined so that we can sort them and also, we need to understand if price is low, medium or high instead of checking the exact price, we can just look at the variable.

Code
poultry_data <- poultry_data %>%
        mutate(Date = ym(paste(Year, Month)))



poultry_data <- poultry_data %>%
  mutate(Price_type = case_when(Price_Dollar < 3 ~ 'low',
                             Price_Dollar < 5 ~ 'med',
                             Price_Dollar > 5 ~ 'high'))

unique(poultry_data$Price_type)
[1] "low"  "high" "med"  NA    
Code
poultry_data
# A tibble: 600 × 6
   Product  Year Month     Price_Dollar Date       Price_type
   <chr>   <dbl> <chr>            <dbl> <date>     <chr>     
 1 Whole    2013 January           2.38 2013-01-01 low       
 2 Whole    2013 February          2.38 2013-02-01 low       
 3 Whole    2013 March             2.38 2013-03-01 low       
 4 Whole    2013 April             2.38 2013-04-01 low       
 5 Whole    2013 May               2.38 2013-05-01 low       
 6 Whole    2013 June              2.38 2013-06-01 low       
 7 Whole    2013 July              2.38 2013-07-01 low       
 8 Whole    2013 August            2.38 2013-08-01 low       
 9 Whole    2013 September         2.38 2013-09-01 low       
10 Whole    2013 October           2.38 2013-10-01 low       
# … with 590 more rows

Any additional comments?