Reading in final project data - avocado.csv
Author

Shuqi Hong

Published

June 9, 2023

Introduction

The table below represents weekly 2015 - 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados.

Some relevant columns in the dataset:

Date - The date of the observation AveragePrice - the average price of a single avocado type - conventional or organic year - the year Region - the city or region of the observation Total Volume - Total number of avocados sold 4046 - Total number of avocados with PLU 4046 sold 4225 - Total number of avocados with PLU 4225 sold 4770 - Total number of avocados with PLU 4770 sold

I read in file first and delete the first column because it represents row number.

Code
library(tidyverse)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

avocado <- read_csv("_data/avocado.csv")

avocado <- avocado %>% select(-1)

avocado
# A tibble: 18,249 × 13
   Date       AveragePrice `Total Volume` `4046`  `4225` `4770` `Total Bags`
   <date>            <dbl>          <dbl>  <dbl>   <dbl>  <dbl>        <dbl>
 1 2015-12-27         1.33         64237.  1037.  54455.   48.2        8697.
 2 2015-12-20         1.35         54877.   674.  44639.   58.3        9506.
 3 2015-12-13         0.93        118220.   795. 109150.  130.         8145.
 4 2015-12-06         1.08         78992.  1132   71976.   72.6        5811.
 5 2015-11-29         1.28         51040.   941.  43838.   75.8        6184.
 6 2015-11-22         1.26         55980.  1184.  48068.   43.6        6684.
 7 2015-11-15         0.99         83454.  1369.  73673.   93.3        8319.
 8 2015-11-08         0.98        109428.   704. 101815.   80          6829.
 9 2015-11-01         1.02         99811.  1022.  87316.   85.3       11388.
10 2015-10-25         1.07         74339.   842.  64757.  113          8626.
# ℹ 18,239 more rows
# ℹ 6 more variables: `Small Bags` <dbl>, `Large Bags` <dbl>,
#   `XLarge Bags` <dbl>, type <chr>, year <dbl>, region <chr>

I separate “Date” into year, month and day column which shows clearer. Then I sorted them chronologically and mutate the variables. the column of 4046, 4225, 4770 are the size of Hass avocado, respectively small, large, extra large. So I also renamed them. I think I won’t analyze this data in each week, so I also delete it.

Code
clean_avocado <- avocado %>% select(-`year`) %>% 
  separate(Date, into= c("year","month","day"), sep="-") %>% arrange(year, month, day) %>%
  fill(year,month) %>% 
  mutate( month = case_when(
  month == "01" ~ "January", 
  month == "02" ~ "Feburary",
  month == "03" ~ "March",
  month == "04" ~ "April",
  month == "05" ~ "May",
  month == "06" ~ "June",
  month == "07" ~ "July",
  month == "08" ~ "August",
  month == "09" ~ "September",
  month == "10" ~ "October",
  month == "11" ~ "November",
  month == "12" ~ "December"
)) %>%
  mutate(
    month = as_factor(month),
    day = as_factor(day),
    type = as_factor(type)
  ) %>% 
  rename(c( small = `4046`, `large` = `4225`, `Extra_large` = `4770`)) %>% select(-`day`)
                  
clean_avocado   
# A tibble: 18,249 × 13
   year  month   AveragePrice `Total Volume`    small    large Extra_large
   <chr> <fct>          <dbl>          <dbl>    <dbl>    <dbl>       <dbl>
 1 2015  January         1.22         40873.    2820.   28287.        49.9
 2 2015  January         1           435021.  364302.   23821.        82.2
 3 2015  January         1.08        788025.   53987.  552906.     39995. 
 4 2015  January         1.01         80034.   44562.   24964.      2752. 
 5 2015  January         1.02        491738     7194.  396752.       129. 
 6 2015  January         1.4         116253.    3268.   55693.       110. 
 7 2015  January         0.93       5777335. 2843648. 2267755.    137480. 
 8 2015  January         1.19        166006.   29419.   47221.     38569. 
 9 2015  January         1.11        783068.   30270.  550752.    124506. 
10 2015  January         0.88        228570.    3274.  168765.      1447. 
# ℹ 18,239 more rows
# ℹ 6 more variables: `Total Bags` <dbl>, `Small Bags` <dbl>,
#   `Large Bags` <dbl>, `XLarge Bags` <dbl>, type <fct>, region <chr>
Code
glimpse(clean_avocado)
Rows: 18,249
Columns: 13
$ year           <chr> "2015", "2015", "2015", "2015", "2015", "2015", "2015",…
$ month          <fct> January, January, January, January, January, January, J…
$ AveragePrice   <dbl> 1.22, 1.00, 1.08, 1.01, 1.02, 1.40, 0.93, 1.19, 1.11, 0…
$ `Total Volume` <dbl> 40873.28, 435021.49, 788025.06, 80034.32, 491738.00, 11…
$ small          <dbl> 2819.50, 364302.39, 53987.31, 44562.12, 7193.87, 3267.9…
$ large          <dbl> 28287.42, 23821.16, 552906.04, 24964.23, 396752.18, 556…
$ Extra_large    <dbl> 49.90, 82.15, 39995.03, 2752.35, 128.82, 109.55, 137479…
$ `Total Bags`   <dbl> 9716.46, 46815.79, 141136.68, 7755.62, 87663.13, 57182.…
$ `Small Bags`   <dbl> 9186.93, 16707.15, 137146.07, 6064.30, 87406.84, 57182.…
$ `Large Bags`   <dbl> 529.53, 30108.64, 3990.61, 1691.32, 256.29, 0.00, 47882…
$ `XLarge Bags`  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 3375.80, 0.00, 0.00…
$ type           <fct> conventional, conventional, conventional, conventional,…
$ region         <chr> "Albany", "Atlanta", "BaltimoreWashington", "Boise", "B…
Code
unique(clean_avocado$type)
[1] conventional organic     
Levels: conventional organic
Code
clean_avocado <- clean_avocado %>% group_by(year, month, region, type) %>%
  mutate(mean_price_permonth = mean(AveragePrice)) %>% 
  ungroup() %>% 
  group_by(year, region, type) %>%
  mutate(mean_price_peryear = mean(AveragePrice))%>% 
  ungroup() %>% select(-AveragePrice)

clean_avocado #%>%   summarise(mean_AveragePrice = mean(AveragePrice))
# A tibble: 18,249 × 14
   year  month   `Total Volume`    small    large Extra_large `Total Bags`
   <chr> <fct>            <dbl>    <dbl>    <dbl>       <dbl>        <dbl>
 1 2015  January         40873.    2820.   28287.        49.9        9716.
 2 2015  January        435021.  364302.   23821.        82.2       46816.
 3 2015  January        788025.   53987.  552906.     39995.       141137.
 4 2015  January         80034.   44562.   24964.      2752.         7756.
 5 2015  January        491738     7194.  396752.       129.        87663.
 6 2015  January        116253.    3268.   55693.       110.        57183.
 7 2015  January       5777335. 2843648. 2267755.    137480.       528452.
 8 2015  January        166006.   29419.   47221.     38569.        50798.
 9 2015  January        783068.   30270.  550752.    124506.        77539.
10 2015  January        228570.    3274.  168765.      1447.        55083.
# ℹ 18,239 more rows
# ℹ 7 more variables: `Small Bags` <dbl>, `Large Bags` <dbl>,
#   `XLarge Bags` <dbl>, type <fct>, region <chr>, mean_price_permonth <dbl>,
#   mean_price_peryear <dbl>

The original average price is calculated per week which is useless for me. So I caculate the average price of each year and month.

Code
summarise_avocado <- clean_avocado %>% group_by(year, region, type) %>%
  summarise(mean_price_peryear = mean(mean_price_peryear), mean_total_volume = mean(`Total Volume`), mean_total_bags = mean(`Total Bags`) )

summarise_avocado
# A tibble: 432 × 6
# Groups:   year, region [216]
   year  region       type  mean_price_peryear mean_total_volume mean_total_bags
   <chr> <chr>        <fct>              <dbl>             <dbl>           <dbl>
 1 2015  Albany       conv…               1.17            76209.          12738.
 2 2015  Albany       orga…               1.91             1289.           1102.
 3 2015  Atlanta      conv…               1.05           440346.          56460.
 4 2015  Atlanta      orga…               1.71             6417.           1174.
 5 2015  BaltimoreWa… conv…               1.17           768141.         179069.
 6 2015  BaltimoreWa… orga…               1.57            13504.           3398.
 7 2015  Boise        conv…               1.05            70886.           9472.
 8 2015  Boise        orga…               1.69             1890.            490.
 9 2015  Boston       conv…               1.14           523781.         107586.
10 2015  Boston       orga…               1.80             4200.           3158.
# ℹ 422 more rows

Potential questions

This table is about the general things which are the average value of price, total volume and total bags in each year. I will plot some graphs about this tabble in the future.

From this table, I will focus on below information in the future.

  1. The change in annual purchase volume of avocados in the whole country.
  2. people prefer organic avocados or conventional one. (Analyze it from regions, months and years.)
  3. The relationship between the purchase of the size of avocados and purchase of the type of it.
  4. The relationship between the purchase of the total volume and average price.
  5. Annual change of purchase of the total volume in each region.
  6. Each region purchased different size of bags ratios each year.

This is what I think of now, and I will continue to update it in the future.