Code
library(tidyverse)
library(lubridate)
library(stringr)
library(readxl)
library(dplyr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
PoChun Yang
March 22, 2023
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
Read the organiceggpoultry.xls used the read_excel
# A tibble: 120 × 10
...1 Extra…¹ Extra…² Large…³ Large…⁴ Whole B/S B…⁵ Bone-…⁶ Whole…⁷ Thighs
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 Jan 2004 230 132 230 126 198. 646. too few 194. too f…
2 February 230 134. 226. 128. 198. 642. too few 194. 203
3 March 230 137 225 131 209 642. too few 194. 203
4 April 234. 137 225 131 212 642. too few 194. 203
5 May 236 137 225 131 214. 642. too few 194. 203
6 June 241 137 231. 134. 216. 641 too few 202. 200.3…
7 July 241 137 234. 134. 217 642. 390.5 204. 199.5
8 August 241 137 234. 134. 217 642. 390.5 204. 199.5
9 Septemb… 241 136. 234. 130. 217 642. 390.5 204. 199.5
10 October 241 136. 234. 128. 217 642. 390.5 204. 199.5
# … with 110 more rows, and abbreviated variable names ¹`Extra Large \nDozen`,
# ²`Extra Large 1/2 Doz.\n1/2 Dozen`, ³`Large \nDozen`, ⁴`Large \n1/2 Doz.`,
# ⁵`B/S Breast`, ⁶`Bone-in Breast`, ⁷`Whole Legs`
[1] "...1" "Extra Large \nDozen"
[3] "Extra Large 1/2 Doz.\n1/2 Dozen" "Large \nDozen"
[5] "Large \n1/2 Doz." "Whole"
[7] "B/S Breast" "Bone-in Breast"
[9] "Whole Legs" "Thighs"
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
This is the code that I try to tidy all the organiceggpoultry_xls. First of all, I used the mutate to copy the date which columns call …1. Then, I used the rename to change all the columns name. It would be easy to use when I want to used the pivot command. Next, I find that some of the date with /1 so I used str_remove to remove it. In addition, I want to change the date between year and month. Finally, I found that the Jan does not show the full vocabulary as other items so I change Jan to January. Beside that, I separate the part of chicken’s price and the type of egg’s price.
df3<-df2%>%
mutate(date=...1)%>%
select(11,2:10)
df3<-df3%>%
rename(xlarge_dozen=2,xlarge_halfdozen=3,large_dozen=4,large_halfdozen=5,chicken_whole=6
,chicken_BS_Breast=7,chicken_Bone_Breast=8,chicken_whole_Legs=9,chicken_tight=10)
df4<-df3%>%
mutate(date = str_remove(date," /1"))
df4<-df4%>%
separate(date,into = c("Month", "Year"),sep = " ")%>%
fill(Year)
df4<-df4%>%
mutate(Month = replace(Month, Month == 'Jan', 'January'))
df4
# A tibble: 120 × 11
Month Year xlarg…¹ xlarg…² large…³ large…⁴ chick…⁵ chick…⁶ chick…⁷ chick…⁸
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 January 2004 230 132 230 126 198. 646. too few 194.
2 Februa… 2004 230 134. 226. 128. 198. 642. too few 194.
3 March 2004 230 137 225 131 209 642. too few 194.
4 April 2004 234. 137 225 131 212 642. too few 194.
5 May 2004 236 137 225 131 214. 642. too few 194.
6 June 2004 241 137 231. 134. 216. 641 too few 202.
7 July 2004 241 137 234. 134. 217 642. 390.5 204.
8 August 2004 241 137 234. 134. 217 642. 390.5 204.
9 Septem… 2004 241 136. 234. 130. 217 642. 390.5 204.
10 October 2004 241 136. 234. 128. 217 642. 390.5 204.
# … with 110 more rows, 1 more variable: chicken_tight <chr>, and abbreviated
# variable names ¹xlarge_dozen, ²xlarge_halfdozen, ³large_dozen,
# ⁴large_halfdozen, ⁵chicken_whole, ⁶chicken_BS_Breast, ⁷chicken_Bone_Breast,
# ⁸chicken_whole_Legs
# A tibble: 120 × 7
Month Year chicken_whole chicken_BS_Breast chicken_Bon…¹ chick…² chick…³
<chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>
1 January 2004 198. 646. too few 194. too few
2 February 2004 198. 642. too few 194. 203
3 March 2004 209 642. too few 194. 203
4 April 2004 212 642. too few 194. 203
5 May 2004 214. 642. too few 194. 203
6 June 2004 216. 641 too few 202. 200.375
7 July 2004 217 642. 390.5 204. 199.5
8 August 2004 217 642. 390.5 204. 199.5
9 September 2004 217 642. 390.5 204. 199.5
10 October 2004 217 642. 390.5 204. 199.5
# … with 110 more rows, and abbreviated variable names ¹chicken_Bone_Breast,
# ²chicken_whole_Legs, ³chicken_tight
# A tibble: 120 × 6
Month Year xlarge_dozen xlarge_halfdozen large_dozen large_halfdozen
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 January 2004 230 132 230 126
2 February 2004 230 134. 226. 128.
3 March 2004 230 137 225 131
4 April 2004 234. 137 225 131
5 May 2004 236 137 225 131
6 June 2004 241 137 231. 134.
7 July 2004 241 137 234. 134.
8 August 2004 241 137 234. 134.
9 September 2004 241 136. 234. 130.
10 October 2004 241 136. 234. 128.
# … with 110 more rows
Any additional comments?
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Well, I try my best to combine the time between the year and month and date. When I search the detail is that variable must be “int”. However, I change that I still not works here. Therefore, I used the stringr which command call str_c to solve the problem. In addition, I used the “-” to insert between the month ,yaear, and date.
# A tibble: 120 × 4
Month Year date measure_date
<chr> <chr> <dbl> <date>
1 January 2004 15 NA
2 February 2004 15 NA
3 March 2004 15 NA
4 April 2004 15 NA
5 May 2004 15 NA
6 June 2004 15 NA
7 July 2004 15 NA
8 August 2004 15 NA
9 September 2004 15 NA
10 October 2004 15 NA
# … with 110 more rows
##after the search on the Net the make_date must be a number
#mutate(egg_time,produce_time=make_date(Month,Year,date))
#egg_tidy<-egg_time%>%
#mutate(Month = replace(Month, Month == 'January',01))%>%
#mutate(Month = replace(Month, Month == 'February', 02))%>%
#mutate(Month = replace(Month, Month == 'March', 03))%>%
#mutate(Month = replace(Month, Month == 'April', 04))%>%
#mutate(Month = replace(Month, Month == 'May', 05))%>%
#mutate(Month = replace(Month, Month == 'June', 06))%>%
#mutate(Month = replace(Month, Month == 'July', 07))%>%
#mutate(Month = replace(Month, Month == 'August', 08))%>%
#mutate(Month = replace(Month, Month == 'September', 09))%>%
#mutate(Month = replace(Month, Month == 'October', 10))%>%
#mutate(Month = replace(Month, Month == 'November', 11))%>%
#mutate(Month = replace(Month, Month == 'December', 12))%>%
#mutate(produce_time=make_date(Month,Year,date))
mutate(egg_time, produce_time=str_c(date,Month,Year,sep='-'))
# A tibble: 120 × 4
Month Year date produce_time
<chr> <chr> <dbl> <chr>
1 January 2004 15 15-January-2004
2 February 2004 15 15-February-2004
3 March 2004 15 15-March-2004
4 April 2004 15 15-April-2004
5 May 2004 15 15-May-2004
6 June 2004 15 15-June-2004
7 July 2004 15 15-July-2004
8 August 2004 15 15-August-2004
9 September 2004 15 15-September-2004
10 October 2004 15 15-October-2004
# … with 110 more rows
In this part, I used the pivot_longer to make a new format for type of egg with their prices. Then, I used the summarise command to get the max, min, mean, median of the prices from 2004 to 2013.
# A tibble: 480 × 4
Month Year type_of_egg price
<chr> <chr> <chr> <dbl>
1 January 2004 xlarge_dozen 230
2 January 2004 xlarge_halfdozen 132
3 January 2004 large_dozen 230
4 January 2004 large_halfdozen 126
5 February 2004 xlarge_dozen 230
6 February 2004 xlarge_halfdozen 134.
7 February 2004 large_dozen 226.
8 February 2004 large_halfdozen 128.
9 March 2004 xlarge_dozen 230
10 March 2004 xlarge_halfdozen 137
# … with 470 more rows
# A tibble: 4 × 5
type_of_egg Min Max Mean Median
<chr> <dbl> <dbl> <dbl> <dbl>
1 large_dozen 225 278. 254. 268.
2 large_halfdozen 126 178 155. 174.
3 xlarge_dozen 230 290 267. 286.
4 xlarge_halfdozen 132 188. 164. 186.
Any additional comments?
---
title: "Challenge 4"
author: "PoChun Yang"
desription: "More data wrangling: pivoting"
date: "03/22/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- PoChunYang
- abc_poll
- lubridate
- stringr
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(lubridate)
library(stringr)
library(readxl)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
Read the organiceggpoultry.xls used the read_excel
```{r}
df <- read_excel("_data/organiceggpoultry.xls", sheet = "Data",skip=4)
df2 <- df[,!names(df) %in% c("...6")]
df2
colnames(df2)
```
### Briefly describe the data
## Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
This is the code that I try to tidy all the organiceggpoultry_xls. First of all, I used the mutate to copy the date which columns call ...1. Then, I used the rename to change all the columns name. It would be easy to use when I want to used the pivot command. Next, I find that some of the date with /1 so I used str_remove to remove it. In addition, I want to change the date between year and month. Finally, I found that the Jan does not show the full vocabulary as other items so I change Jan to January. Beside that, I separate the part of chicken's price and the type of egg's price.
```{r}
df3<-df2%>%
mutate(date=...1)%>%
select(11,2:10)
df3<-df3%>%
rename(xlarge_dozen=2,xlarge_halfdozen=3,large_dozen=4,large_halfdozen=5,chicken_whole=6
,chicken_BS_Breast=7,chicken_Bone_Breast=8,chicken_whole_Legs=9,chicken_tight=10)
df4<-df3%>%
mutate(date = str_remove(date," /1"))
df4<-df4%>%
separate(date,into = c("Month", "Year"),sep = " ")%>%
fill(Year)
df4<-df4%>%
mutate(Month = replace(Month, Month == 'Jan', 'January'))
df4
chicken<-df4%>%
select(1:2,7:11)
chicken
eggs<-df4%>%
select(1:6)
eggs
```
Any additional comments?
## Identify variables that need to be mutated
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Well, I try my best to combine the time between the year and month and date. When I search the detail is that variable must be "int". However, I change that I still not works here. Therefore, I used the stringr which command call str_c to solve the problem. In addition, I used the "-" to insert between the month ,yaear, and date.
```{r}
egg_time<-eggs%>%
mutate(date = 15)%>%
select(1,2,7)
mutate(egg_time,'measure_date' = make_date(year = Year, month = Month, day = date))
##after the search on the Net the make_date must be a number
#mutate(egg_time,produce_time=make_date(Month,Year,date))
#egg_tidy<-egg_time%>%
#mutate(Month = replace(Month, Month == 'January',01))%>%
#mutate(Month = replace(Month, Month == 'February', 02))%>%
#mutate(Month = replace(Month, Month == 'March', 03))%>%
#mutate(Month = replace(Month, Month == 'April', 04))%>%
#mutate(Month = replace(Month, Month == 'May', 05))%>%
#mutate(Month = replace(Month, Month == 'June', 06))%>%
#mutate(Month = replace(Month, Month == 'July', 07))%>%
#mutate(Month = replace(Month, Month == 'August', 08))%>%
#mutate(Month = replace(Month, Month == 'September', 09))%>%
#mutate(Month = replace(Month, Month == 'October', 10))%>%
#mutate(Month = replace(Month, Month == 'November', 11))%>%
#mutate(Month = replace(Month, Month == 'December', 12))%>%
#mutate(produce_time=make_date(Month,Year,date))
mutate(egg_time, produce_time=str_c(date,Month,Year,sep='-'))
```
In this part, I used the pivot_longer to make a new format for type of egg with their prices. Then, I used the summarise command to get the max, min, mean, median of the prices from 2004 to 2013.
```{r}
data_summary<-eggs%>%
pivot_longer(col= contains("large"),
names_to="type_of_egg",
values_to = "price")
data_summary
data_summary%>%
group_by(type_of_egg)%>%
summarise(Min=min(price),
Max=max(price),
Mean=mean(price),
Median=median(price))
```
Any additional comments?