Challenge 4

challenge_4

PoChunYang

abc_poll

lubridate

stringr

Author

PoChun Yang

Published

March 22, 2023

Code

library(tidyverse)
library(lubridate)
library(stringr)
library(readxl)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
identify variables that need to be mutated
mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐

Read the organiceggpoultry.xls used the read_excel

Code

df <- read_excel("_data/organiceggpoultry.xls", sheet = "Data",skip=4)
df2 <- df[,!names(df) %in% c("...6")]
df2

# A tibble: 120 × 10
   ...1     Extra…¹ Extra…² Large…³ Large…⁴ Whole B/S B…⁵ Bone-…⁶ Whole…⁷ Thighs
   <chr>      <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl> <chr>     <dbl> <chr> 
 1 Jan 2004    230     132     230     126   198.    646. too few    194. too f…
 2 February    230     134.    226.    128.  198.    642. too few    194. 203   
 3 March       230     137     225     131   209     642. too few    194. 203   
 4 April       234.    137     225     131   212     642. too few    194. 203   
 5 May         236     137     225     131   214.    642. too few    194. 203   
 6 June        241     137     231.    134.  216.    641  too few    202. 200.3…
 7 July        241     137     234.    134.  217     642. 390.5      204. 199.5 
 8 August      241     137     234.    134.  217     642. 390.5      204. 199.5 
 9 Septemb…    241     136.    234.    130.  217     642. 390.5      204. 199.5 
10 October     241     136.    234.    128.  217     642. 390.5      204. 199.5 
# … with 110 more rows, and abbreviated variable names ¹`Extra Large \nDozen`,
#   ²`Extra Large 1/2 Doz.\n1/2 Dozen`, ³`Large \nDozen`, ⁴`Large \n1/2 Doz.`,
#   ⁵`B/S Breast`, ⁶`Bone-in Breast`, ⁷`Whole Legs`

Code

colnames(df2)

 [1] "...1"                            "Extra Large \nDozen"            
 [3] "Extra Large 1/2 Doz.\n1/2 Dozen" "Large \nDozen"                  
 [5] "Large \n1/2 Doz."                "Whole"                          
 [7] "B/S Breast"                      "Bone-in Breast"                 
 [9] "Whole Legs"                      "Thighs"

Briefly describe the data

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

This is the code that I try to tidy all the organiceggpoultry_xls. First of all, I used the mutate to copy the date which columns call …1. Then, I used the rename to change all the columns name. It would be easy to use when I want to used the pivot command. Next, I find that some of the date with /1 so I used str_remove to remove it. In addition, I want to change the date between year and month. Finally, I found that the Jan does not show the full vocabulary as other items so I change Jan to January. Beside that, I separate the part of chicken’s price and the type of egg’s price.

Code

df3<-df2%>%
  mutate(date=...1)%>%
  select(11,2:10)
df3<-df3%>%
  rename(xlarge_dozen=2,xlarge_halfdozen=3,large_dozen=4,large_halfdozen=5,chicken_whole=6
         ,chicken_BS_Breast=7,chicken_Bone_Breast=8,chicken_whole_Legs=9,chicken_tight=10)
df4<-df3%>%
  mutate(date = str_remove(date," /1"))
df4<-df4%>%
  separate(date,into = c("Month", "Year"),sep = " ")%>%
  fill(Year)
df4<-df4%>%
  mutate(Month = replace(Month, Month == 'Jan', 'January')) 
df4

# A tibble: 120 × 11
   Month   Year  xlarg…¹ xlarg…² large…³ large…⁴ chick…⁵ chick…⁶ chick…⁷ chick…⁸
   <chr>   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>     <dbl>
 1 January 2004     230     132     230     126     198.    646. too few    194.
 2 Februa… 2004     230     134.    226.    128.    198.    642. too few    194.
 3 March   2004     230     137     225     131     209     642. too few    194.
 4 April   2004     234.    137     225     131     212     642. too few    194.
 5 May     2004     236     137     225     131     214.    642. too few    194.
 6 June    2004     241     137     231.    134.    216.    641  too few    202.
 7 July    2004     241     137     234.    134.    217     642. 390.5      204.
 8 August  2004     241     137     234.    134.    217     642. 390.5      204.
 9 Septem… 2004     241     136.    234.    130.    217     642. 390.5      204.
10 October 2004     241     136.    234.    128.    217     642. 390.5      204.
# … with 110 more rows, 1 more variable: chicken_tight <chr>, and abbreviated
#   variable names ¹xlarge_dozen, ²xlarge_halfdozen, ³large_dozen,
#   ⁴large_halfdozen, ⁵chicken_whole, ⁶chicken_BS_Breast, ⁷chicken_Bone_Breast,
#   ⁸chicken_whole_Legs

Code

chicken<-df4%>%
  select(1:2,7:11)
chicken

# A tibble: 120 × 7
   Month     Year  chicken_whole chicken_BS_Breast chicken_Bon…¹ chick…² chick…³
   <chr>     <chr>         <dbl>             <dbl> <chr>           <dbl> <chr>  
 1 January   2004           198.              646. too few          194. too few
 2 February  2004           198.              642. too few          194. 203    
 3 March     2004           209               642. too few          194. 203    
 4 April     2004           212               642. too few          194. 203    
 5 May       2004           214.              642. too few          194. 203    
 6 June      2004           216.              641  too few          202. 200.375
 7 July      2004           217               642. 390.5            204. 199.5  
 8 August    2004           217               642. 390.5            204. 199.5  
 9 September 2004           217               642. 390.5            204. 199.5  
10 October   2004           217               642. 390.5            204. 199.5  
# … with 110 more rows, and abbreviated variable names ¹chicken_Bone_Breast,
#   ²chicken_whole_Legs, ³chicken_tight

Code

eggs<-df4%>%
  select(1:6)
eggs

# A tibble: 120 × 6
   Month     Year  xlarge_dozen xlarge_halfdozen large_dozen large_halfdozen
   <chr>     <chr>        <dbl>            <dbl>       <dbl>           <dbl>
 1 January   2004          230              132         230             126 
 2 February  2004          230              134.        226.            128.
 3 March     2004          230              137         225             131 
 4 April     2004          234.             137         225             131 
 5 May       2004          236              137         225             131 
 6 June      2004          241              137         231.            134.
 7 July      2004          241              137         234.            134.
 8 August    2004          241              137         234.            134.
 9 September 2004          241              136.        234.            130.
10 October   2004          241              136.        234.            128.
# … with 110 more rows

Any additional comments?

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Well, I try my best to combine the time between the year and month and date. When I search the detail is that variable must be “int”. However, I change that I still not works here. Therefore, I used the stringr which command call str_c to solve the problem. In addition, I used the “-” to insert between the month ,yaear, and date.

Code

egg_time<-eggs%>%
  mutate(date = 15)%>%
  select(1,2,7)
mutate(egg_time,'measure_date' = make_date(year = Year, month = Month, day = date))

# A tibble: 120 × 4
   Month     Year   date measure_date
   <chr>     <chr> <dbl> <date>      
 1 January   2004     15 NA          
 2 February  2004     15 NA          
 3 March     2004     15 NA          
 4 April     2004     15 NA          
 5 May       2004     15 NA          
 6 June      2004     15 NA          
 7 July      2004     15 NA          
 8 August    2004     15 NA          
 9 September 2004     15 NA          
10 October   2004     15 NA          
# … with 110 more rows

Code

##after the search on the Net the make_date must be a number 
#mutate(egg_time,produce_time=make_date(Month,Year,date))
#egg_tidy<-egg_time%>%
#mutate(Month = replace(Month, Month == 'January',01))%>%
#mutate(Month = replace(Month, Month == 'February', 02))%>%
#mutate(Month = replace(Month, Month == 'March', 03))%>%
#mutate(Month = replace(Month, Month == 'April', 04))%>%
#mutate(Month = replace(Month, Month == 'May', 05))%>%
#mutate(Month = replace(Month, Month == 'June', 06))%>%
#mutate(Month = replace(Month, Month == 'July', 07))%>%
#mutate(Month = replace(Month, Month == 'August', 08))%>%
#mutate(Month = replace(Month, Month == 'September', 09))%>%
#mutate(Month = replace(Month, Month == 'October', 10))%>%
#mutate(Month = replace(Month, Month == 'November', 11))%>%
#mutate(Month = replace(Month, Month == 'December', 12))%>%
#mutate(produce_time=make_date(Month,Year,date))
mutate(egg_time, produce_time=str_c(date,Month,Year,sep='-'))

# A tibble: 120 × 4
   Month     Year   date produce_time     
   <chr>     <chr> <dbl> <chr>            
 1 January   2004     15 15-January-2004  
 2 February  2004     15 15-February-2004 
 3 March     2004     15 15-March-2004    
 4 April     2004     15 15-April-2004    
 5 May       2004     15 15-May-2004      
 6 June      2004     15 15-June-2004     
 7 July      2004     15 15-July-2004     
 8 August    2004     15 15-August-2004   
 9 September 2004     15 15-September-2004
10 October   2004     15 15-October-2004  
# … with 110 more rows

In this part, I used the pivot_longer to make a new format for type of egg with their prices. Then, I used the summarise command to get the max, min, mean, median of the prices from 2004 to 2013.

Code

data_summary<-eggs%>%
pivot_longer(col= contains("large"),
              names_to="type_of_egg",
              values_to = "price")
data_summary

# A tibble: 480 × 4
   Month    Year  type_of_egg      price
   <chr>    <chr> <chr>            <dbl>
 1 January  2004  xlarge_dozen      230 
 2 January  2004  xlarge_halfdozen  132 
 3 January  2004  large_dozen       230 
 4 January  2004  large_halfdozen   126 
 5 February 2004  xlarge_dozen      230 
 6 February 2004  xlarge_halfdozen  134.
 7 February 2004  large_dozen       226.
 8 February 2004  large_halfdozen   128.
 9 March    2004  xlarge_dozen      230 
10 March    2004  xlarge_halfdozen  137 
# … with 470 more rows

Code

data_summary%>%
group_by(type_of_egg)%>%
summarise(Min=min(price),
          Max=max(price),
          Mean=mean(price),
          Median=median(price))

# A tibble: 4 × 5
  type_of_egg        Min   Max  Mean Median
  <chr>            <dbl> <dbl> <dbl>  <dbl>
1 large_dozen        225  278.  254.   268.
2 large_halfdozen    126  178   155.   174.
3 xlarge_dozen       230  290   267.   286.
4 xlarge_halfdozen   132  188.  164.   186.

Any additional comments?