DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 5

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview

Challenge 5

challenge_5
Work Data
Introduction to Visualization
Author

Michaela Bowen

Published

October 19, 2022

library(tidyverse)
library(ggplot2)
library(readxl)
library(lubridate)
library(scales)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

In today’s Challenge I’ve attempted to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. create at least two univariate visualizations
  5. Create at least one bivariate visualization
  • Read in and tidy data
  • Briefly describe the data
  • Univariate Visualizations
  • Bivariate Visualization(s)

Below I have ready in a two week transaction period of cannabis sales at the local dispensary.

On the Read in I have:

  • Renamed several variables to “delete” as they will not be used in our analysis and thus ignored:
  • Product Sku: Not necessary because it is replicated with the product name which carries more information
  • BioTrackId, BioTrack Response, RxNum: These are all categories unused by the business, and do not contain useful information
  • Remaining Qty: Does not contain any accurate information

Inventory Transactions

#Read in Excel Data
transactions_9_22_2022_10_20_2022_orig <- read_excel("Inventory Transactions 9_22_2022-10_20_2022.xlsx", 
    skip = 5,
    col_names = c("pos_id","product","delete","patient_name","transaction_date","qty_sold","daily_allottment_oz","weight_grams","cost","price","owner_name","owner_location","vendor","sold_by","receipt_no","delete","delete","delete","delete","delete"))%>%
  select(!contains("delete"))

The transactions data frame we are working with consists of the completed sales transaction at Resinate, Northampton Spanning from 9/22/2022 through 10/05/2022. There are 5,591 instances of 13 variables, meaning that nearly 5,600 transactions were completed during this time period. The variables describe the product type, category, date, patient name, receipt number, budtender, and other transaction information.

Column Mutations

Here I am mutating several variables:

  • date: I am separating out this date column into hour, minute, and second in order to pin point time of day in which customers are ordering certain products
  • category, category_names: I created these two variables from the 3 letter abbreviation at the beginning of the product name
#mutating date field and formatting as a date
transactions_9_22_2022_10_20_2022 <- transactions_9_22_2022_10_20_2022_orig%>%
  mutate(
    date = as.Date(transaction_date),
    hour = hour(transaction_date),
    minute = minute(transaction_date),
    second = second(transaction_date))%>%
  mutate(
    format_date = format(date, "%m/%d/%Y"),
    format_hour = paste(hour, minute, second, sep = ":")
  )%>%
#pulling the category abbreviation to determine category and create a category column
  mutate(
    category = substr(product,1,3)
  )%>%
  mutate(
    category_names = case_when(
      category == "FLO" | category == "Flo" ~ "Flower",
      category == "PRJ" ~ "Joint",
      category == "EDI" ~ "Edible",
      category == "MIP" ~ "Marijuana Infused Product",
      category == "CON" | category == "Con" ~ "Concentrate",
      category == "VAP" | category == "Vap" ~ "Vaporizer",
      category == "ACC" | category == "Pax" | category == "PAX" | category == "Hig" | category == "Bov" ~ "Accessories",
      category == "CLO" | category == "Res" ~ "Clothing",
      category == "HTC" ~ "HTCC Promotion",
      category == "SAM" ~ "Samples",
      category == "TOP" ~ "Topical",
      category == "REW" ~ "Rewards")
  )

Tidy Data (Sanity Checks)

#ensuring that the category, and category names columns contain no NA values and are accurately coded
unique(transactions_9_22_2022_10_20_2022$category_names)
 [1] "Edible"                    "Flower"                   
 [3] "Joint"                     "Accessories"              
 [5] "Vaporizer"                 "Concentrate"              
 [7] "Marijuana Infused Product" "Topical"                  
 [9] "HTCC Promotion"            "Clothing"                 
[11] "Samples"                   "Rewards"                  
unique(transactions_9_22_2022_10_20_2022$category)
 [1] "EDI" "FLO" "PRJ" "ACC" "VAP" "CON" "MIP" "TOP" "HTC" "Bov" "Res" "Pax"
[13] "CLO" "SAM" "Hig" "REW" "PAX" "Flo" "Con" "Vap"

Data Summary

print(summarytools::dfSummary(transactions_9_22_2022_10_20_2022,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

transactions_9_22_2022_10_20_2022

Dimensions: 11870 x 22
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
pos_id [numeric]
Mean (sd) : 184185292 (9231537)
min ≤ med ≤ max:
48660808 ≤ 184882944 ≤ 188477719
IQR (CV) : 3559642 (0.1)
11870 distinct values 0 (0.0%)
product [character]
1. PRJ Cappadonna Joint Capp
2. PRJ Resinate Joint Grandd
3. FLO Resinate Flower Phone
4. FLO Resinate Flower Jiggl
5. PRJ Resinate Joint Truthb
6. PRJ Resinate Joint Citral
7. FLO Resinate Flower Lava
8. PRJ Resinate Joint Ghost
9. PRJ Resinate Joint Lava C
10. PRJ Resinate Joint Lemon
[ 419 others ]
464(3.9%)
352(3.0%)
344(2.9%)
266(2.2%)
265(2.2%)
252(2.1%)
231(1.9%)
230(1.9%)
215(1.8%)
215(1.8%)
9036(76.1%)
0 (0.0%)
patient_name [character]
1. jane babcook-fererra
2. Marc Girard
3. Joshua Hiller
4. Joseph Andrews
5. Kristina Tarkovskaya
6. Michaela REC Bowen
7. Resinate Inc - Worcester
8. Marilyn Jean Mikna
9. Brian Avery
10. Frank Parks
[ 2554 others ]
131(1.1%)
104(0.9%)
87(0.7%)
74(0.6%)
74(0.6%)
73(0.6%)
68(0.6%)
65(0.5%)
58(0.5%)
58(0.5%)
11078(93.3%)
0 (0.0%)
transaction_date [POSIXct, POSIXt]
min : 2022-09-22 09:02:33
med : 2022-10-07 09:52:48
max : 2022-10-20 20:00:05
range : 28d 10H 57M 31.6S
3567 distinct values 0 (0.0%)
qty_sold [numeric] 1 distinct value
1:11870(100.0%)
0 (0.0%)
daily_allottment_oz [numeric]
Mean (sd) : 3.1 (12.6)
min ≤ med ≤ max:
0 ≤ 2.5 ≤ 451.5
IQR (CV) : 2.5 (4.1)
73 distinct values 0 (0.0%)
weight_grams [numeric]
Mean (sd) : 2 (11.7)
min ≤ med ≤ max:
0 ≤ 1 ≤ 451.5
IQR (CV) : 2.4 (6)
62 distinct values 0 (0.0%)
cost [numeric]
Mean (sd) : 7.8 (14)
min ≤ med ≤ max:
0 ≤ 5 ≤ 232
IQR (CV) : 3.3 (1.8)
296 distinct values 0 (0.0%)
price [numeric]
Mean (sd) : 18.6 (20.5)
min ≤ med ≤ max:
0 ≤ 12 ≤ 290
IQR (CV) : 17 (1.1)
73 distinct values 0 (0.0%)
owner_name [character] 1. Resinate Inc
11870(100.0%)
0 (0.0%)
owner_location [character] 1. Resinate Inc - Northampto
11870(100.0%)
0 (0.0%)
vendor [character]
1. Resinate, Inc.
2. Bask
3. Green Gold Group, Inc.
4. Cultivauna, LLC
5. Cultivate Leicester, Inc
6. Luv Buds
7. ARL Healthcare
8. Nova Farms, LLC
9. T. Bear Inc.
10. Apothca, Inc.
[ 34 others ]
8600(72.5%)
435(3.7%)
364(3.1%)
267(2.2%)
220(1.9%)
193(1.6%)
192(1.6%)
150(1.3%)
149(1.3%)
121(1.0%)
1179(9.9%)
0 (0.0%)
sold_by [character]
1. David Dumas
2. Eric Rueli
3. Gerardo Ramos
4. Haley Thomas
5. Kristina Tarkovskaya
6. Marc Girard
7. Michaela Bowen
8. Tiffany Jones
1179(9.9%)
2499(21.1%)
1421(12.0%)
1450(12.2%)
1896(16.0%)
1796(15.1%)
68(0.6%)
1561(13.2%)
0 (0.0%)
receipt_no [character]
1. 60856974
2. 61785156
3. 61425860
4. 61992747
5. 16561565
6. 61509826
7. 60754173
8. 60157153
9. 61138055
10. 61188329
[ 3560 others ]
75(0.6%)
59(0.5%)
56(0.5%)
56(0.5%)
52(0.4%)
40(0.3%)
31(0.3%)
30(0.3%)
28(0.2%)
28(0.2%)
11415(96.2%)
0 (0.0%)
date [Date]
min : 2022-09-22
med : 2022-10-07
max : 2022-10-20
range : 28d
29 distinct values 0 (0.0%)
hour [integer]
Mean (sd) : 14.4 (3)
min ≤ med ≤ max:
9 ≤ 15 ≤ 21
IQR (CV) : 5 (0.2)
13 distinct values 0 (0.0%)
minute [integer]
Mean (sd) : 29.3 (17.6)
min ≤ med ≤ max:
0 ≤ 29 ≤ 59
IQR (CV) : 31 (0.6)
60 distinct values 0 (0.0%)
second [numeric]
Mean (sd) : 30.1 (17.3)
min ≤ med ≤ max:
0 ≤ 30.6 ≤ 60
IQR (CV) : 30.1 (0.6)
3246 distinct values 0 (0.0%)
format_date [character]
1. 10/14/2022
2. 10/07/2022
3. 09/24/2022
4. 10/08/2022
5. 09/23/2022
6. 10/10/2022
7. 09/30/2022
8. 10/03/2022
9. 10/09/2022
10. 10/15/2022
[ 19 others ]
618(5.2%)
613(5.2%)
584(4.9%)
573(4.8%)
532(4.5%)
518(4.4%)
510(4.3%)
478(4.0%)
446(3.8%)
437(3.7%)
6561(55.3%)
0 (0.0%)
format_hour [character]
1. 15:48:44.6600000858307
2. 20:18:19.6570000648499
3. 13:46:56.9630000591278
4. 15:9:34.1700000762939
5. 10:56:38.0999999046326
6. 17:15:47.7630000114441
7. 10:52:29.6900000572205
8. 19:17:53.0569999217987
9. 11:10:10.7999999523163
10. 11:43:47.2799999713898
[ 3560 others ]
75(0.6%)
59(0.5%)
56(0.5%)
56(0.5%)
52(0.4%)
40(0.3%)
31(0.3%)
30(0.3%)
28(0.2%)
28(0.2%)
11415(96.2%)
0 (0.0%)
category [character]
1. PRJ
2. EDI
3. FLO
4. VAP
5. CON
6. ACC
7. MIP
8. HTC
9. Bov
10. Res
[ 10 others ]
4873(41.1%)
2716(22.9%)
2596(21.9%)
715(6.0%)
446(3.8%)
254(2.1%)
94(0.8%)
67(0.6%)
39(0.3%)
15(0.1%)
55(0.5%)
0 (0.0%)
category_names [character]
1. Joint
2. Edible
3. Flower
4. Vaporizer
5. Concentrate
6. Accessories
7. Marijuana Infused Product
8. HTCC Promotion
9. Clothing
10. Topical
[ 2 others ]
4873(41.1%)
2716(22.9%)
2603(21.9%)
716(6.0%)
447(3.8%)
315(2.7%)
94(0.8%)
67(0.6%)
19(0.2%)
11(0.1%)
9(0.1%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-11-16

In this farily simple univariate visualization, we can see the number of sales by category. It is clear that Joints are the most popular category followed by Edibles and Flower. This is to be expected in a dispensary as Joints and Edibles are some of the most convenient ways to consume cannabis, followed by Flower that can be ground up and used however the customer pleases.

#univariate visualization of the categories
ggplot(data = transactions_9_22_2022_10_20_2022, mapping = aes(category_names, fill = category_names), size = 1) +
  geom_bar() +
  ylim(0, NA) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  ylab("Number") +
  xlab("Categories")+
  ggtitle("Sale Number by Category")

In our first Bivariate visualization we can see the volume of sales by category over time with a violin graph. The volume graph gives a visual representation of popularity of category over time. This can tell us popularity trends, or interest trends by category as well.

ggplot(data = transactions_9_22_2022_10_20_2022, mapping = aes(x = date, y = category_names), size = 1) +
  geom_violin() +
  theme(legend.title=element_blank(),axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  ylab("Cannabis Categories") +
  scale_x_date(date_labels="%b %d", breaks = unique(transactions_9_22_2022_10_20_2022$date)) +
  ggtitle("Categorical Volume Sales by date")

In this second Bivariate Visualization I chose a scatterplot to compare the Average price by product to the quantity sold. I was able to gain the following inside:

  • Most products are priced between $0 and $90
  • Within a two week period we are selling approximately 50 units of our most popular products within that price range
  • We can also see that Joints are the most popular category, with nearly 300 units sold within the given time period.
#creating a flower transactions dataframe
sold_by_price <- transactions_9_22_2022_10_20_2022%>%
  group_by(product)%>%
  mutate(avg_price = mean(price),
         products_sold = sum(qty_sold))%>%
  select(-qty_sold)%>%
  distinct()%>%
  ggplot(aes(x = avg_price, y = products_sold, color = category_names)) +
  ylim(0,NA) +
  scale_x_continuous(breaks = c(0,30,60,90,120,150,180,210,240,270,300)) +
  geom_point(size = 1)+
  labs(title =  "Quantity of Products sold by Average Price")+
  xlab("Average Price") +
  ylab("Quantity Sold") +
  theme(legend.title=element_blank(),axis.text.x = element_text(angle = 45, vjust = 0.5) )

sold_by_price 

Source Code
---
title: "Challenge 5"
author: "Michaela Bowen"
description: "Introduction to Visualization"
date: "10/19/2022"
format:
  html:
    toc: true
    code-copy: true
    code-tools: true
categories:
  - challenge_5
  - Work Data
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(ggplot2)
library(readxl)
library(lubridate)
library(scales)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

In today's Challenge I've attempted to:

1)  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2)  tidy data (as needed, including sanity checks)
3)  mutate variables as needed (including sanity checks)
4)  create at least two univariate visualizations
5)  Create at least one bivariate visualization


::: {.panel-tabset}
## Read in and tidy data

Below I have ready in a two week transaction period of cannabis sales at the local dispensary. 

On the Read in I have: 

- Renamed several variables to "delete" as they will not be used in our analysis and thus ignored:
+ `Product Sku`: Not necessary because it is replicated with the product name which carries more information 
+ `BioTrackId`, `BioTrack Response`, `RxNum`: These are all categories unused by the business, and do not contain useful information
+ `Remaining  Qty`: Does not contain *any* accurate information


### Inventory Transactions

```{r}
#| label: read_in
#| warning: false
#| message: false
#Read in Excel Data
transactions_9_22_2022_10_20_2022_orig <- read_excel("Inventory Transactions 9_22_2022-10_20_2022.xlsx", 
    skip = 5,
    col_names = c("pos_id","product","delete","patient_name","transaction_date","qty_sold","daily_allottment_oz","weight_grams","cost","price","owner_name","owner_location","vendor","sold_by","receipt_no","delete","delete","delete","delete","delete"))%>%
  select(!contains("delete"))

```


## Briefly describe the data
The transactions data frame we are working with consists of the completed sales transaction at Resinate, Northampton Spanning from 9/22/2022 through 10/05/2022. There are 5,591 instances of 13 variables, meaning that nearly 5,600 transactions were completed during this time period. The variables describe the product type, category, date, patient name, receipt number, budtender, and other transaction information. 


### Column Mutations
Here I am mutating several variables:

- `date`: I am separating out this date column into hour, minute, and second in order to pin point time of day in which customers are ordering certain products
- `category`, `category_names`: I created these two variables from the 3 letter abbreviation at the beginning of the product name

```{r}
#| label: mutation
#| warning: false
#| message: false
#mutating date field and formatting as a date
transactions_9_22_2022_10_20_2022 <- transactions_9_22_2022_10_20_2022_orig%>%
  mutate(
    date = as.Date(transaction_date),
    hour = hour(transaction_date),
    minute = minute(transaction_date),
    second = second(transaction_date))%>%
  mutate(
    format_date = format(date, "%m/%d/%Y"),
    format_hour = paste(hour, minute, second, sep = ":")
  )%>%
#pulling the category abbreviation to determine category and create a category column
  mutate(
    category = substr(product,1,3)
  )%>%
  mutate(
    category_names = case_when(
      category == "FLO" | category == "Flo" ~ "Flower",
      category == "PRJ" ~ "Joint",
      category == "EDI" ~ "Edible",
      category == "MIP" ~ "Marijuana Infused Product",
      category == "CON" | category == "Con" ~ "Concentrate",
      category == "VAP" | category == "Vap" ~ "Vaporizer",
      category == "ACC" | category == "Pax" | category == "PAX" | category == "Hig" | category == "Bov" ~ "Accessories",
      category == "CLO" | category == "Res" ~ "Clothing",
      category == "HTC" ~ "HTCC Promotion",
      category == "SAM" ~ "Samples",
      category == "TOP" ~ "Topical",
      category == "REW" ~ "Rewards")
  )
```

### Tidy Data (Sanity Checks)


```{r}
#ensuring that the category, and category names columns contain no NA values and are accurately coded
unique(transactions_9_22_2022_10_20_2022$category_names)
unique(transactions_9_22_2022_10_20_2022$category)
```

### Data Summary
```{r}
#| label: Data Summary
#| warning: false
#| message: false
print(summarytools::dfSummary(transactions_9_22_2022_10_20_2022,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')
```


## Univariate Visualizations
In this farily simple univariate visualization, we can see the number of sales by category. It is clear that Joints are the most popular category followed by Edibles and Flower. This is to be expected in a dispensary as Joints and Edibles are some of the most convenient ways to consume cannabis, followed by Flower that can be ground up and used however the customer pleases.

```{r}
#| label: Sales by Categorie Univariate
#| warning: false
#| message: false
#univariate visualization of the categories
ggplot(data = transactions_9_22_2022_10_20_2022, mapping = aes(category_names, fill = category_names), size = 1) +
  geom_bar() +
  ylim(0, NA) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  ylab("Number") +
  xlab("Categories")+
  ggtitle("Sale Number by Category")
```

## Bivariate Visualization(s)
In our first Bivariate visualization we can see the volume of sales by category over time with a violin graph. The volume graph gives a visual representation of popularity of category over time. This can tell us popularity trends, or interest trends by category as well.
```{r}
#| label: Volume Sales by Date
#| warning: false
#| message: false
ggplot(data = transactions_9_22_2022_10_20_2022, mapping = aes(x = date, y = category_names), size = 1) +
  geom_violin() +
  theme(legend.title=element_blank(),axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  ylab("Cannabis Categories") +
  scale_x_date(date_labels="%b %d", breaks = unique(transactions_9_22_2022_10_20_2022$date)) +
  ggtitle("Categorical Volume Sales by date")
```
In this second Bivariate Visualization I chose a scatterplot to compare the Average price by product to the quantity sold. I was able to gain the following inside:

- Most products are priced between $0 and $90 
- Within a two week period we are selling approximately 50 units of our most popular products within that price range
- We can also see that Joints are the most popular category, with nearly 300 units sold within the given time period. 

```{r}
#| label: Average Price by Quantity Sold
#| warning: false
#| message: false

#creating a flower transactions dataframe
sold_by_price <- transactions_9_22_2022_10_20_2022%>%
  group_by(product)%>%
  mutate(avg_price = mean(price),
         products_sold = sum(qty_sold))%>%
  select(-qty_sold)%>%
  distinct()%>%
  ggplot(aes(x = avg_price, y = products_sold, color = category_names)) +
  ylim(0,NA) +
  scale_x_continuous(breaks = c(0,30,60,90,120,150,180,210,240,270,300)) +
  geom_point(size = 1)+
  labs(title =  "Quantity of Products sold by Average Price")+
  xlab("Average Price") +
  ylab("Quantity Sold") +
  theme(legend.title=element_blank(),axis.text.x = element_text(angle = 45, vjust = 0.5) )

sold_by_price 


```


:::