Challenge 3

challenge_3

organicpoultry

Author

Mekhala Kumar

Published

August 17, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Cleaning data and the reasoning for using pivot

The dataset used was organicpoultry. It contains information about the quantity of different poultry types for all months across the years 2004 to 2013. Currently the month and year data all fall under a single column. In order to make the data easy to interpret, first, the column with the data of the month and year need to be separated into two columns.
Following which, the data needs to be pivoted in such a manner that the years become columns and the types of poultry become rows. This format will make it easier to select a subgroup within the types of poultry and compare the changes across years.

Code

library(readxl)
library(tidyverse)
eggpoul <- read_excel("_data/organiceggpoultry.xls",skip=4)
View(eggpoul)
colnames(eggpoul)

 [1] "...1"                            "Extra Large \nDozen"            
 [3] "Extra Large 1/2 Doz.\n1/2 Dozen" "Large \nDozen"                  
 [5] "Large \n1/2 Doz."                "...6"                           
 [7] "Whole"                           "B/S Breast"                     
 [9] "Bone-in Breast"                  "Whole Legs"                     
[11] "Thighs"

Code

eggpoul=subset(eggpoul,select=-c(...6))
tail(eggpoul, 10)

# A tibble: 10 × 10
   ...1     Extra…¹ Extra…² Large…³ Large…⁴ Whole B/S B…⁵ Bone-…⁶ Whole…⁷ Thighs
   <chr>      <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl> <chr>     <dbl> <chr> 
 1 March        290    188.    268.     178  238.    704. 390.5      204. 216.25
 2 April        290    188.    268.     178  238.    704. 390.5      204. 216.25
 3 May          290    188.    268.     178  238.    704. 390.5      204. 216.25
 4 June         290    188.    268.     178  238.    704. 390.5      204. 216.25
 5 July         290    188.    268.     178  238.    704. 390.5      204. 216.25
 6 August       290    188.    268.     178  238.    704. 390.5      204. 216.25
 7 Septemb…     290    188.    268.     178  238.    704. 390.5      204. 216.25
 8 October      290    188.    268.     178  238.    704. 390.5      204. 216.25
 9 November     290    188.    268.     178  238.    704. 390.5      204. 216.25
10 December     290    188.    268.     178  238.    704. 390.5      204. 216.25
# … with abbreviated variable names ¹`Extra Large \nDozen`,
#   ²`Extra Large 1/2 Doz.\n1/2 Dozen`, ³`Large \nDozen`, ⁴`Large \n1/2 Doz.`,
#   ⁵`B/S Breast`, ⁶`Bone-in Breast`, ⁷`Whole Legs`

Code

head(eggpoul)

# A tibble: 6 × 10
  ...1     Extra …¹ Extra…² Large…³ Large…⁴ Whole B/S B…⁵ Bone-…⁶ Whole…⁷ Thighs
  <chr>       <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl> <chr>     <dbl> <chr> 
1 Jan 2004     230     132     230     126   198.    646. too few    194. too f…
2 February     230     134.    226.    128.  198.    642. too few    194. 203   
3 March        230     137     225     131   209     642. too few    194. 203   
4 April        234.    137     225     131   212     642. too few    194. 203   
5 May          236     137     225     131   214.    642. too few    194. 203   
6 June         241     137     231.    134.  216.    641  too few    202. 200.3…
# … with abbreviated variable names ¹`Extra Large \nDozen`,
#   ²`Extra Large 1/2 Doz.\n1/2 Dozen`, ³`Large \nDozen`, ⁴`Large \n1/2 Doz.`,
#   ⁵`B/S Breast`, ⁶`Bone-in Breast`, ⁷`Whole Legs`

Code

str(eggpoul)

tibble [120 × 10] (S3: tbl_df/tbl/data.frame)
 $ ...1                           : chr [1:120] "Jan 2004" "February" "March" "April" ...
 $ Extra Large 
Dozen            : num [1:120] 230 230 230 234 236 ...
 $ Extra Large 1/2 Doz.
1/2 Dozen: num [1:120] 132 134 137 137 137 ...
 $ Large 
Dozen                  : num [1:120] 230 226 225 225 225 ...
 $ Large 
1/2 Doz.               : num [1:120] 126 128 131 131 131 ...
 $ Whole                          : num [1:120] 198 198 209 212 214 ...
 $ B/S Breast                     : num [1:120] 646 642 642 642 642 ...
 $ Bone-in Breast                 : chr [1:120] "too few" "too few" "too few" "too few" ...
 $ Whole Legs                     : num [1:120] 194 194 194 194 194 ...
 $ Thighs                         : chr [1:120] "too few" "203" "203" "203" ...

Code

eggpoul<-eggpoul%>%
   mutate(`Bone-in Breast` = parse_number(na_if(`Bone-in Breast`, "too few")),
           Thighs = parse_number(na_if(Thighs, "too few")))
eggpoul<-eggpoul %>% separate(1, c("Month", "Year"), extra = "drop", fill = "right")
vec<-rep(c(1,2,3,4,5,6,7,8,9,10),each=12)
eggpoul$Year[vec==1] <- 2004
eggpoul$Year[vec==2] <- 2005
eggpoul$Year[vec==3] <- 2006
eggpoul$Year[vec==4] <- 2007
eggpoul$Year[vec==5] <- 2008
eggpoul$Year[vec==6] <- 2009
eggpoul$Year[vec==7] <- 2010
eggpoul$Year[vec==8] <- 2011
eggpoul$Year[vec==9] <- 2012
eggpoul$Year[vec==10] <- 2013
dim(eggpoul)

[1] 120  11

Challenge: Describe the final dimensions

The original dataset has 120 rows and 11 columns. 2 of the variables are being used to identify a case. Hence,after pivoting, we expect to have 1080 rows and 4 columns. It is anticipated that the data will be long (taller).

Code

#existing rows/cases
nrow(eggpoul)

[1] 120

Code

#existing columns/cases
ncol(eggpoul)

[1] 11

Code

#expected rows/cases
nrow(eggpoul) * (ncol(eggpoul)-2)

[1] 1080

Code

# expected columns 
(11-9)+2

[1] 4

Challenge: Pivot the Chosen Data

After pivoting, the data has become taller. Pivoting has also ensured that all the variables of poultry types have been kept in a single column and the values corresponding to them are easy to access.

Code

eggpoul<-pivot_longer(eggpoul, 3:11, names_to = "Type of Poultry", values_to = "Amount")
dim(eggpoul)

[1] 1080    4

Code

View(eggpoul)