Challenge 5

challenge_5

Work Data

Introduction to Visualization

Author

Michaela Bowen

Published

October 19, 2022

library(tidyverse)
library(ggplot2)
library(readxl)
library(lubridate)
library(scales)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

In today’s Challenge I’ve attempted to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
mutate variables as needed (including sanity checks)
create at least two univariate visualizations
Create at least one bivariate visualization

Below I have ready in a two week transaction period of cannabis sales at the local dispensary.

On the Read in I have:

Renamed several variables to “delete” as they will not be used in our analysis and thus ignored:
Product Sku: Not necessary because it is replicated with the product name which carries more information
BioTrackId, BioTrack Response, RxNum: These are all categories unused by the business, and do not contain useful information
Remaining Qty: Does not contain any accurate information

Inventory Transactions

#Read in Excel Data
transactions_9_22_2022_10_20_2022_orig <- read_excel("Inventory Transactions 9_22_2022-10_20_2022.xlsx", 
    skip = 5,
    col_names = c("pos_id","product","delete","patient_name","transaction_date","qty_sold","daily_allottment_oz","weight_grams","cost","price","owner_name","owner_location","vendor","sold_by","receipt_no","delete","delete","delete","delete","delete"))%>%
  select(!contains("delete"))

The transactions data frame we are working with consists of the completed sales transaction at Resinate, Northampton Spanning from 9/22/2022 through 10/05/2022. There are 5,591 instances of 13 variables, meaning that nearly 5,600 transactions were completed during this time period. The variables describe the product type, category, date, patient name, receipt number, budtender, and other transaction information.

Column Mutations

Here I am mutating several variables:

date: I am separating out this date column into hour, minute, and second in order to pin point time of day in which customers are ordering certain products
category, category_names: I created these two variables from the 3 letter abbreviation at the beginning of the product name

#mutating date field and formatting as a date
transactions_9_22_2022_10_20_2022 <- transactions_9_22_2022_10_20_2022_orig%>%
  mutate(
    date = as.Date(transaction_date),
    hour = hour(transaction_date),
    minute = minute(transaction_date),
    second = second(transaction_date))%>%
  mutate(
    format_date = format(date, "%m/%d/%Y"),
    format_hour = paste(hour, minute, second, sep = ":")
  )%>%
#pulling the category abbreviation to determine category and create a category column
  mutate(
    category = substr(product,1,3)
  )%>%
  mutate(
    category_names = case_when(
      category == "FLO" | category == "Flo" ~ "Flower",
      category == "PRJ" ~ "Joint",
      category == "EDI" ~ "Edible",
      category == "MIP" ~ "Marijuana Infused Product",
      category == "CON" | category == "Con" ~ "Concentrate",
      category == "VAP" | category == "Vap" ~ "Vaporizer",
      category == "ACC" | category == "Pax" | category == "PAX" | category == "Hig" | category == "Bov" ~ "Accessories",
      category == "CLO" | category == "Res" ~ "Clothing",
      category == "HTC" ~ "HTCC Promotion",
      category == "SAM" ~ "Samples",
      category == "TOP" ~ "Topical",
      category == "REW" ~ "Rewards")
  )

Tidy Data (Sanity Checks)

#ensuring that the category, and category names columns contain no NA values and are accurately coded
unique(transactions_9_22_2022_10_20_2022$category_names)

 [1] "Edible"                    "Flower"                   
 [3] "Joint"                     "Accessories"              
 [5] "Vaporizer"                 "Concentrate"              
 [7] "Marijuana Infused Product" "Topical"                  
 [9] "HTCC Promotion"            "Clothing"                 
[11] "Samples"                   "Rewards"

unique(transactions_9_22_2022_10_20_2022$category)

 [1] "EDI" "FLO" "PRJ" "ACC" "VAP" "CON" "MIP" "TOP" "HTC" "Bov" "Res" "Pax"
[13] "CLO" "SAM" "Hig" "REW" "PAX" "Flo" "Con" "Vap"

Data Summary

print(summarytools::dfSummary(transactions_9_22_2022_10_20_2022,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

transactions_9_22_2022_10_20_2022

Dimensions: 11870 x 22
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

pos_id [numeric]

Mean (sd) : 184185292 (9231537)

min ≤ med ≤ max:

48660808 ≤ 184882944 ≤ 188477719

IQR (CV) : 3559642 (0.1)

11870 distinct values

0 (0.0%)

product [character]

1. PRJ Cappadonna Joint Capp

2. PRJ Resinate Joint Grandd

3. FLO Resinate Flower Phone

4. FLO Resinate Flower Jiggl

5. PRJ Resinate Joint Truthb

6. PRJ Resinate Joint Citral

7. FLO Resinate Flower Lava

8. PRJ Resinate Joint Ghost

9. PRJ Resinate Joint Lava C

10. PRJ Resinate Joint Lemon

[ 419 others ]

464	(	3.9%	)
352	(	3.0%	)
344	(	2.9%	)
266	(	2.2%	)
265	(	2.2%	)
252	(	2.1%	)
231	(	1.9%	)
230	(	1.9%	)
215	(	1.8%	)
215	(	1.8%	)
9036	(	76.1%	)

0 (0.0%)

patient_name [character]

1. jane babcook-fererra

2. Marc Girard

3. Joshua Hiller

4. Joseph Andrews

5. Kristina Tarkovskaya

6. Michaela REC Bowen

7. Resinate Inc - Worcester

8. Marilyn Jean Mikna

9. Brian Avery

10. Frank Parks

[ 2554 others ]

131	(	1.1%	)
104	(	0.9%	)
87	(	0.7%	)
74	(	0.6%	)
74	(	0.6%	)
73	(	0.6%	)
68	(	0.6%	)
65	(	0.5%	)
58	(	0.5%	)
58	(	0.5%	)
11078	(	93.3%	)

0 (0.0%)

transaction_date [POSIXct, POSIXt]

min : 2022-09-22 09:02:33

med : 2022-10-07 09:52:48

max : 2022-10-20 20:00:05

range : 28d 10H 57M 31.6S

3567 distinct values

0 (0.0%)

qty_sold [numeric]

1 distinct value

11870

(

100.0%

)

0 (0.0%)

daily_allottment_oz [numeric]

Mean (sd) : 3.1 (12.6)

min ≤ med ≤ max:

0 ≤ 2.5 ≤ 451.5

IQR (CV) : 2.5 (4.1)

73 distinct values

0 (0.0%)

weight_grams [numeric]

Mean (sd) : 2 (11.7)

min ≤ med ≤ max:

0 ≤ 1 ≤ 451.5

IQR (CV) : 2.4 (6)

62 distinct values

0 (0.0%)

cost [numeric]

Mean (sd) : 7.8 (14)

min ≤ med ≤ max:

0 ≤ 5 ≤ 232

IQR (CV) : 3.3 (1.8)

296 distinct values

0 (0.0%)

price [numeric]

Mean (sd) : 18.6 (20.5)

min ≤ med ≤ max:

0 ≤ 12 ≤ 290

IQR (CV) : 17 (1.1)

73 distinct values

0 (0.0%)

owner_name [character]

1. Resinate Inc

11870

(

100.0%

)

0 (0.0%)

owner_location [character]

1. Resinate Inc - Northampto

11870

(

100.0%

)

0 (0.0%)

vendor [character]

1. Resinate, Inc.

2. Bask

3. Green Gold Group, Inc.

4. Cultivauna, LLC

5. Cultivate Leicester, Inc

6. Luv Buds

7. ARL Healthcare

8. Nova Farms, LLC

9. T. Bear Inc.

10. Apothca, Inc.

[ 34 others ]

8600	(	72.5%	)
435	(	3.7%	)
364	(	3.1%	)
267	(	2.2%	)
220	(	1.9%	)
193	(	1.6%	)
192	(	1.6%	)
150	(	1.3%	)
149	(	1.3%	)
121	(	1.0%	)
1179	(	9.9%	)

0 (0.0%)

sold_by [character]

1. David Dumas

2. Eric Rueli

3. Gerardo Ramos

4. Haley Thomas

5. Kristina Tarkovskaya

6. Marc Girard

7. Michaela Bowen

8. Tiffany Jones

1179	(	9.9%	)
2499	(	21.1%	)
1421	(	12.0%	)
1450	(	12.2%	)
1896	(	16.0%	)
1796	(	15.1%	)
68	(	0.6%	)
1561	(	13.2%	)

0 (0.0%)

receipt_no [character]

1. 60856974
2. 61785156
3. 61425860
4. 61992747
5. 16561565
6. 61509826
7. 60754173
8. 60157153
9. 61138055
10. 61188329
[ 3560 others ]

75	(	0.6%	)
59	(	0.5%	)
56	(	0.5%	)
56	(	0.5%	)
52	(	0.4%	)
40	(	0.3%	)
31	(	0.3%	)
30	(	0.3%	)
28	(	0.2%	)
28	(	0.2%	)
11415	(	96.2%	)

0 (0.0%)

date [Date]

min : 2022-09-22

med : 2022-10-07

max : 2022-10-20

range : 28d

29 distinct values

0 (0.0%)

hour [integer]

Mean (sd) : 14.4 (3)

min ≤ med ≤ max:

9 ≤ 15 ≤ 21

IQR (CV) : 5 (0.2)

13 distinct values

0 (0.0%)

minute [integer]

Mean (sd) : 29.3 (17.6)

min ≤ med ≤ max:

0 ≤ 29 ≤ 59

IQR (CV) : 31 (0.6)

60 distinct values

0 (0.0%)

second [numeric]

Mean (sd) : 30.1 (17.3)

min ≤ med ≤ max:

0 ≤ 30.6 ≤ 60

IQR (CV) : 30.1 (0.6)

3246 distinct values

0 (0.0%)

format_date [character]

1. 10/14/2022

2. 10/07/2022

3. 09/24/2022

4. 10/08/2022

5. 09/23/2022

6. 10/10/2022

7. 09/30/2022

8. 10/03/2022

9. 10/09/2022

10. 10/15/2022

[ 19 others ]

618	(	5.2%	)
613	(	5.2%	)
584	(	4.9%	)
573	(	4.8%	)
532	(	4.5%	)
518	(	4.4%	)
510	(	4.3%	)
478	(	4.0%	)
446	(	3.8%	)
437	(	3.7%	)
6561	(	55.3%	)

0 (0.0%)

format_hour [character]

1. 15:48:44.6600000858307

2. 20:18:19.6570000648499

3. 13:46:56.9630000591278

4. 15:9:34.1700000762939

5. 10:56:38.0999999046326

6. 17:15:47.7630000114441

7. 10:52:29.6900000572205

8. 19:17:53.0569999217987

9. 11:10:10.7999999523163

10. 11:43:47.2799999713898

[ 3560 others ]

75	(	0.6%	)
59	(	0.5%	)
56	(	0.5%	)
56	(	0.5%	)
52	(	0.4%	)
40	(	0.3%	)
31	(	0.3%	)
30	(	0.3%	)
28	(	0.2%	)
28	(	0.2%	)
11415	(	96.2%	)

0 (0.0%)

category [character]

1. PRJ

2. EDI

3. FLO

4. VAP

5. CON

6. ACC

7. MIP

8. HTC

9. Bov

10. Res

[ 10 others ]

4873	(	41.1%	)
2716	(	22.9%	)
2596	(	21.9%	)
715	(	6.0%	)
446	(	3.8%	)
254	(	2.1%	)
94	(	0.8%	)
67	(	0.6%	)
39	(	0.3%	)
15	(	0.1%	)
55	(	0.5%	)

0 (0.0%)

category_names [character]

1. Joint

2. Edible

3. Flower

4. Vaporizer

5. Concentrate

6. Accessories

7. Marijuana Infused Product

8. HTCC Promotion

9. Clothing

10. Topical

[ 2 others ]

4873	(	41.1%	)
2716	(	22.9%	)
2603	(	21.9%	)
716	(	6.0%	)
447	(	3.8%	)
315	(	2.7%	)
94	(	0.8%	)
67	(	0.6%	)
19	(	0.2%	)
11	(	0.1%	)
9	(	0.1%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-11-16

In this farily simple univariate visualization, we can see the number of sales by category. It is clear that Joints are the most popular category followed by Edibles and Flower. This is to be expected in a dispensary as Joints and Edibles are some of the most convenient ways to consume cannabis, followed by Flower that can be ground up and used however the customer pleases.

#univariate visualization of the categories
ggplot(data = transactions_9_22_2022_10_20_2022, mapping = aes(category_names, fill = category_names), size = 1) +
  geom_bar() +
  ylim(0, NA) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  ylab("Number") +
  xlab("Categories")+
  ggtitle("Sale Number by Category")

In our first Bivariate visualization we can see the volume of sales by category over time with a violin graph. The volume graph gives a visual representation of popularity of category over time. This can tell us popularity trends, or interest trends by category as well.

ggplot(data = transactions_9_22_2022_10_20_2022, mapping = aes(x = date, y = category_names), size = 1) +
  geom_violin() +
  theme(legend.title=element_blank(),axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  ylab("Cannabis Categories") +
  scale_x_date(date_labels="%b %d", breaks = unique(transactions_9_22_2022_10_20_2022$date)) +
  ggtitle("Categorical Volume Sales by date")

In this second Bivariate Visualization I chose a scatterplot to compare the Average price by product to the quantity sold. I was able to gain the following inside:

Most products are priced between $0 and $90
Within a two week period we are selling approximately 50 units of our most popular products within that price range
We can also see that Joints are the most popular category, with nearly 300 units sold within the given time period.

#creating a flower transactions dataframe
sold_by_price <- transactions_9_22_2022_10_20_2022%>%
  group_by(product)%>%
  mutate(avg_price = mean(price),
         products_sold = sum(qty_sold))%>%
  select(-qty_sold)%>%
  distinct()%>%
  ggplot(aes(x = avg_price, y = products_sold, color = category_names)) +
  ylim(0,NA) +
  scale_x_continuous(breaks = c(0,30,60,90,120,150,180,210,240,270,300)) +
  geom_point(size = 1)+
  labs(title =  "Quantity of Products sold by Average Price")+
  xlab("Average Price") +
  ylab("Quantity Sold") +
  theme(legend.title=element_blank(),axis.text.x = element_text(angle = 45, vjust = 0.5) )

sold_by_price