read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
identify variables that need to be mutated
mutate variables and sanity check all mutations
Task 1) Read in, Tidy, and Describe the Data
I chose to read in the organic egg file, organiceggpoultry.xls. I use the same process as in my Challenge 3 submission. I left out the details in this post, and show only the first 18 of 1,080 lines of the final data frame.
I also use mutate and case_match to convert price strings to 0. I could also remove these rows but I think it’s better to leave them and document that 0 means there were too few cases to make an observation.
Code
listEggDFs <-map(set_names(excel_sheets("../posts/_data/organiceggpoultry.xls")),read_xls, path ="../posts/_data/organiceggpoultry.xls") # read in all sheets as dataframes in a listdf_eggPoultryData <-map(names(listEggDFs), ~assign(.x, listEggDFs[[.x]], envir = .GlobalEnv))[[1]] # assign the first df in the list to a separate dfproductname <-c(word(df_eggPoultryData[2,1], -1),word(df_eggPoultryData[2,7], -1)) # Store last word from header values in list variables for use after pivot.colnames(df_eggPoultryData) <- df_eggPoultryData[4, ] # Assign values from the 4th populated row to column namescolnames(df_eggPoultryData)[1] <-"month"# source data has no name in date columndf_eggPoultryData <- df_eggPoultryData %>%slice(-(1:4)) %>%# Slice off the first four rows, which do not have table data.select(!`NA`) %>%# Select all columns but the empty one in the middleseparate_wider_delim(month, delim =" ", names =c("month", "year"), too_few ="align_start", too_many ="merge") %>%mutate(month =replace(month, month =='Jan', 'January')) %>%mutate(year =replace(year, year =='/1', NA)) %>%fill(year) %>%# tidy the month column and separate year into its own column. pivot_longer(cols =-c(month,year), names_to ="item", values_to ="price") %>%# to pivot the egg and chicken product column headings into row values under a new `item` column, with their values listed in a `price` columnmutate(item =str_remove_all(item, '\\n')) %>%# tidy up item values and make labels consistentmutate(item =str_remove_all(item, '1/2 Dozen')) %>%mutate(item =str_replace_all(item, 'Doz\\.','Dozen')) %>%mutate(product =case_when( # assign the egg and chicken product variables that I recorded earlier according to each egg and chicken itemstr_detect(item, regex("^[EL]")) ~ productname[[1]],str_detect(item, regex("^[BTW]")) ~ productname[[2]] )) %>%mutate(price =case_match(price, "too few"~NA, .default =as.double(price))) %>%# make column all double and NAprint(n=18) # print a sample of the table
# A tibble: 1,080 × 5
month year item price product
<chr> <chr> <chr> <dbl> <chr>
1 January 2004 Extra Large Dozen 230 Eggs
2 January 2004 Extra Large 1/2 Dozen 132 Eggs
3 January 2004 Large Dozen 230 Eggs
4 January 2004 Large 1/2 Dozen 126 Eggs
5 January 2004 Whole 198. Chicken
6 January 2004 B/S Breast 646. Chicken
7 January 2004 Bone-in Breast NA Chicken
8 January 2004 Whole Legs 194. Chicken
9 January 2004 Thighs NA Chicken
10 February 2004 Extra Large Dozen 230 Eggs
11 February 2004 Extra Large 1/2 Dozen 134. Eggs
12 February 2004 Large Dozen 226. Eggs
13 February 2004 Large 1/2 Dozen 128. Eggs
14 February 2004 Whole 198. Chicken
15 February 2004 B/S Breast 642. Chicken
16 February 2004 Bone-in Breast NA Chicken
17 February 2004 Whole Legs 194. Chicken
18 February 2004 Thighs 203 Chicken
# … with 1,062 more rows
Briefly describe the data
The data describes the price of several types of USDA certified organic egg and chicken products from the years 2004 to 2013. The price was paid by ‘first receivers’, described in a footnote as “those entities that purchases the processed product from the poultry or egg company, such as a retailer, distributor, or manufacturer.” The cited data sources are the U.S. Department of Agriculture, Agricultural Marketing Service (AMS) Market News, Organic Poultry and Eggs (Weekly reports).
The data includes 5 variables (date, 4 egg carton items and 5 chicken items, and an empty column) with prices for 3 items for each month of 10 years in 1076 rows.
Task 2) Identify variables that need to be mutated
In the Tidying code block above, I pivot the source data columns to a single item column. I then use several mutate functions to make the data more consistent and apply labels from other metadata taken from the original header.
I also use mutate to create a month and year column, but these variables are still character data types.
The lubridate package wants to use dates or datetimes. It can do calculations with periods such as months, but the output is still dates or datetimes rather than months. I could pick the 1st of each month to work around this limitation, but there’s another option …
The zoo package yearmon class can return a year and month combo to represent dates on in monthly units. I like that! I decided to use zoo instead of lubridate for this specific case.
Code
df_eggPoultryData$monthYear <-as.yearmon(paste(df_eggPoultryData$year, df_eggPoultryData$month), "%Y %B") # turn combine month and year strings into a yearmon data type columndf_eggPoultryData <-subset(df_eggPoultryData, select =-c(month, year)) %>%relocate(monthYear) %>%# we no longer need the month and year string columnsprint(n=18)
# A tibble: 1,080 × 4
monthYear item price product
<yearmon> <chr> <dbl> <chr>
1 Jan 2004 Extra Large Dozen 230 Eggs
2 Jan 2004 Extra Large 1/2 Dozen 132 Eggs
3 Jan 2004 Large Dozen 230 Eggs
4 Jan 2004 Large 1/2 Dozen 126 Eggs
5 Jan 2004 Whole 198. Chicken
6 Jan 2004 B/S Breast 646. Chicken
7 Jan 2004 Bone-in Breast NA Chicken
8 Jan 2004 Whole Legs 194. Chicken
9 Jan 2004 Thighs NA Chicken
10 Feb 2004 Extra Large Dozen 230 Eggs
11 Feb 2004 Extra Large 1/2 Dozen 134. Eggs
12 Feb 2004 Large Dozen 226. Eggs
13 Feb 2004 Large 1/2 Dozen 128. Eggs
14 Feb 2004 Whole 198. Chicken
15 Feb 2004 B/S Breast 642. Chicken
16 Feb 2004 Bone-in Breast NA Chicken
17 Feb 2004 Whole Legs 194. Chicken
18 Feb 2004 Thighs 203 Chicken
# … with 1,062 more rows
Now I can summarize price information by item. There are 9 items, so this table has only 9 rows. (I also updated the find_mode function from the version I used in Challenge 2 – it now omits NA values, which makes the calculation of mode consistent with how I’m using other summary calculations.)
Code
#define function to calculate modefind_mode <-function(x) { u <-unique(x[!is.na(x)]) # unique list as an index, without NA tab <-tabulate(match(x[!is.na(x)], u)) # count how many times each index member occurs u[tab ==max(tab)] # the max occurrence is the modemean(u) # return mean in case the data is multimodal}df_eggPoultryData %>%group_by(item) %>%summarise(meanPrice =mean(price, na.rm =TRUE), modePrice =find_mode(price), minPrice =fivenum(price, na.rm =TRUE)[1], lowHingePrice =fivenum(price, na.rm =TRUE)[2], medianPrice =median(price, na.rm =TRUE), upHungePrice =fivenum(price, na.rm =TRUE)[4], maxPrice =fivenum(price, na.rm =TRUE)[5], count =n() )
---title: "Challenge 4"author: "Tim Shores"description: "More data wrangling: pivoting"date: "03/18/2023"format: html: toc: true code-fold: true code-copy: true code-tools: true link-external-icon: true link-external-newwindow: true link-external-filter: '^(?:http:|https:)\/\/www\.quarto\.org\/custom'categories: - challenge_4 - eggs---```{r}#| label: setup#| warning: false#| message: false#| label: setup#| warning: false#| message: falsemy_packages <-c("tidyverse", "readxl", "knitr", "zoo") # create vector of packagesinvisible(lapply(my_packages, require, character.only =TRUE)) # load multiple packagesknitr::opts_chunk$set(echo =TRUE, warning=FALSE, message=FALSE)```## Challenge OverviewChallenge 4 includes four tasks:1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)2) tidy data (as needed, including sanity checks)3) identify variables that need to be mutated4) mutate variables and sanity check all mutations## Task 1) Read in, Tidy, and Describe the DataI chose to read in the organic egg file, organiceggpoultry.xls. I use the same process as in [my Challenge 3 submission](https://dacss.github.io/601_Spring_2023/posts/TimShores_challenge3.html){.external target="_blank"}. I left out the details in this post, and show only the first 18 of 1,080 lines of the final data frame.I also use `mutate` and `case_match` to convert `price` strings to `0`. I could also remove these rows but I think it's better to leave them and document that `0` means there were too few cases to make an observation.```{r}listEggDFs <-map(set_names(excel_sheets("../posts/_data/organiceggpoultry.xls")),read_xls, path ="../posts/_data/organiceggpoultry.xls") # read in all sheets as dataframes in a listdf_eggPoultryData <-map(names(listEggDFs), ~assign(.x, listEggDFs[[.x]], envir = .GlobalEnv))[[1]] # assign the first df in the list to a separate dfproductname <-c(word(df_eggPoultryData[2,1], -1),word(df_eggPoultryData[2,7], -1)) # Store last word from header values in list variables for use after pivot.colnames(df_eggPoultryData) <- df_eggPoultryData[4, ] # Assign values from the 4th populated row to column namescolnames(df_eggPoultryData)[1] <-"month"# source data has no name in date columndf_eggPoultryData <- df_eggPoultryData %>%slice(-(1:4)) %>%# Slice off the first four rows, which do not have table data.select(!`NA`) %>%# Select all columns but the empty one in the middleseparate_wider_delim(month, delim =" ", names =c("month", "year"), too_few ="align_start", too_many ="merge") %>%mutate(month =replace(month, month =='Jan', 'January')) %>%mutate(year =replace(year, year =='/1', NA)) %>%fill(year) %>%# tidy the month column and separate year into its own column. pivot_longer(cols =-c(month,year), names_to ="item", values_to ="price") %>%# to pivot the egg and chicken product column headings into row values under a new `item` column, with their values listed in a `price` columnmutate(item =str_remove_all(item, '\\n')) %>%# tidy up item values and make labels consistentmutate(item =str_remove_all(item, '1/2 Dozen')) %>%mutate(item =str_replace_all(item, 'Doz\\.','Dozen')) %>%mutate(product =case_when( # assign the egg and chicken product variables that I recorded earlier according to each egg and chicken itemstr_detect(item, regex("^[EL]")) ~ productname[[1]],str_detect(item, regex("^[BTW]")) ~ productname[[2]] )) %>%mutate(price =case_match(price, "too few"~NA, .default =as.double(price))) %>%# make column all double and NAprint(n=18) # print a sample of the table```### Briefly describe the dataThe data describes the price of several types of USDA certified organic egg and chicken products from the years 2004 to 2013. The price was paid by 'first receivers', described in a footnote as "those entities that purchases the processed product from the poultry or egg company, such as a retailer, distributor, or manufacturer." The cited data sources are the U.S. Department of Agriculture, Agricultural Marketing Service (AMS) Market News, Organic Poultry and Eggs (Weekly reports).The data includes `r ncol(df_eggPoultryData)` variables (date, 4 egg carton items and 5 chicken items, and an empty column) with prices for `r ncol(df_eggPoultryData) - 2` items for each month of 10 years in `r nrow(df_eggPoultryData) - 4` rows.## Task 2) Identify variables that need to be mutatedIn the Tidying code block above, I pivot the source data columns to a single `item` column. I then use several mutate functions to make the data more consistent and apply labels from other metadata taken from the original header.I also use mutate to create a month and year column, but these variables are still character data types. The `lubridate` package wants to use dates or datetimes. It can do calculations with periods such as months, but the output is still dates or datetimes rather than months. I could pick the 1st of each month to work around this limitation, but there's another option ...The `zoo` package `yearmon` class can return a year and month combo to represent dates on in monthly units. I like that! I decided to use `zoo` instead of `lubridate` for this specific case.```{r}df_eggPoultryData$monthYear <-as.yearmon(paste(df_eggPoultryData$year, df_eggPoultryData$month), "%Y %B") # turn combine month and year strings into a yearmon data type columndf_eggPoultryData <-subset(df_eggPoultryData, select =-c(month, year)) %>%relocate(monthYear) %>%# we no longer need the month and year string columnsprint(n=18)```Now I can summarize price information by item. There are 9 items, so this table has only 9 rows. (I also updated the `find_mode` function from the version I used in Challenge 2 -- it now omits NA values, which makes the calculation of mode consistent with how I'm using other summary calculations.)```{r}#define function to calculate modefind_mode <-function(x) { u <-unique(x[!is.na(x)]) # unique list as an index, without NA tab <-tabulate(match(x[!is.na(x)], u)) # count how many times each index member occurs u[tab ==max(tab)] # the max occurrence is the modemean(u) # return mean in case the data is multimodal}df_eggPoultryData %>%group_by(item) %>%summarise(meanPrice =mean(price, na.rm =TRUE), modePrice =find_mode(price), minPrice =fivenum(price, na.rm =TRUE)[1], lowHingePrice =fivenum(price, na.rm =TRUE)[2], medianPrice =median(price, na.rm =TRUE), upHungePrice =fivenum(price, na.rm =TRUE)[4], maxPrice =fivenum(price, na.rm =TRUE)[5], count =n() )```Or by month and item (this time using `lubridate`!). I expect this table to show 108 rows = 9 items * 12 months.```{r}df_eggPoultryData %>%group_by(month =month(monthYear, label =TRUE), item) %>%summarise(meanPrice =mean(price, na.rm =TRUE), modePrice =find_mode(price), minPrice =fivenum(price, na.rm =TRUE)[1], lowHingePrice =fivenum(price, na.rm =TRUE)[2], medianPrice =median(price, na.rm =TRUE), upHungePrice =fivenum(price, na.rm =TRUE)[4], maxPrice =fivenum(price, na.rm =TRUE)[5], count =n() )```Or by product. One row each for eggs and chicken.```{r}df_eggPoultryData %>%group_by(product) %>%summarise(meanPrice =mean(price, na.rm =TRUE), modePrice =find_mode(price), minPrice =fivenum(price, na.rm =TRUE)[1], lowHingePrice =fivenum(price, na.rm =TRUE)[2], medianPrice =median(price, na.rm =TRUE), upHungePrice =fivenum(price, na.rm =TRUE)[4], maxPrice =fivenum(price, na.rm =TRUE)[5], count =n() )```Or by year and product. Ten years of eggs and chicken, in 20 rows.```{r}df_eggPoultryData %>%group_by(year =year(monthYear), product) %>%summarise(meanPrice =mean(price, na.rm =TRUE), modePrice =find_mode(price), minPrice =fivenum(price, na.rm =TRUE)[1], lowHingePrice =fivenum(price, na.rm =TRUE)[2], medianPrice =median(price, na.rm =TRUE), upHungePrice =fivenum(price, na.rm =TRUE)[4], maxPrice =fivenum(price, na.rm =TRUE)[5], count =n() )```