Final Project

Author

Siddharth Goel

Published

February 2, 2023

Introduction

Analysis of Online Retail Data is the most important factor in setting up a successful online business. Unlike offline stores, where footfall can be increased using physical marketing strategies, online stores mainly rely on online advertisements and data analysis for business growth.

While advertisements are key to getting new customers for the stores, data analysis can provide valuable information like top selling products, the best demographic, sales patterns and dead, top stock details. These bits of information can help the management decide the areas in which the business needs investment and which are the products that make the store’s maximum sales.

So, in this project, I have taken a small sample of the transactions of an online retail store and will try to analyse the data and identify some patterns which may help us answer some important business questions.

Code

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(ggplot2)
library(dplyr)
library(psych)


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

Code

library(readr)
library(summarytools)

Warning: no DISPLAY variable so Tk is not available

system might not have X11 capabilities; in case of errors when using dfSummary(), set st_options(use.x11 = FALSE)

Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view

Code

library(stringr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Import the Data

Importing the online retail dataset

Code

retail <- readr::read_csv("_data/online_retail.csv")
head(retail)

# A tibble: 6 × 8
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with abbreviated variable names ¹StockCode, ²Description, ³Quantity,
#   ⁴UnitPrice, ⁵CustomerID

Code

spec(retail)

cols(
  InvoiceNo = col_character(),
  StockCode = col_character(),
  Description = col_character(),
  Quantity = col_double(),
  InvoiceDate = col_datetime(format = ""),
  UnitPrice = col_double(),
  CustomerID = col_double(),
  Country = col_character()
)

Code

dim(retail)

[1] 541909      8

The input dataset has 541909 records of goods sold in an online retail store and has 8 columns.

Dataset Summary

The columns and their descriptions are as follows :

InvoiceNo: A 6 digit number which is the invoice number of the transaction in which the item was sold. As there can be multiple items in a transaction, the InvoiceNo column won’t be unique.
StockCode: Item code. A 5-digit integral number uniquely assigned to each product.
Description: Item name.
Quantity: The quantities of the item in the transaction.
InvoiceDate: Invoice Date and time.
UnitPrice: Product price per unit in sterling.
CustomerID: A 5-digit integral number uniquely assigned to each customer.
Country: The name of the country where the customer resides.

Now using the dfSummary method to get the summary of each column in the dataset.

Code

print(dfSummary(retail, 
                varnumbers = FALSE, 
                plain.ascii = FALSE, 
                style = "grid", 
                graph.magnif = 0.70, 
                valid.col= TRUE),
      method= 'render', 
      table.classes= 'table-condensed')

Data Frame Summary

retail

Dimensions: 541909 x 8
Duplicates: 5268

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

InvoiceNo [character]

1. 573585
2. 581219
3. 581492
4. 580729
5. 558475
6. 579777
7. 581217
8. 537434
9. 580730
10. 538071
[ 25890 others ]

1114	(	0.2%	)
749	(	0.1%	)
731	(	0.1%	)
721	(	0.1%	)
705	(	0.1%	)
687	(	0.1%	)
676	(	0.1%	)
675	(	0.1%	)
662	(	0.1%	)
652	(	0.1%	)
534537	(	98.6%	)

541909 (100.0%)

0 (0.0%)

StockCode [character]

1. 85123A
2. 22423
3. 85099B
4. 47566
5. 20725
6. 84879
7. 22720
8. 22197
9. 21212
10. 20727
[ 4060 others ]

2313	(	0.4%	)
2203	(	0.4%	)
2159	(	0.4%	)
1727	(	0.3%	)
1639	(	0.3%	)
1502	(	0.3%	)
1477	(	0.3%	)
1476	(	0.3%	)
1385	(	0.3%	)
1350	(	0.2%	)
524678	(	96.8%	)

541909 (100.0%)

0 (0.0%)

Description [character]

1. WHITE HANGING HEART T-LIG

2. REGENCY CAKESTAND 3 TIER

3. JUMBO BAG RED RETROSPOT

4. PARTY BUNTING

5. LUNCH BAG RED RETROSPOT

6. ASSORTED COLOUR BIRD ORNA

7. SET OF 3 CAKE TINS PANTRY

8. PACK OF 72 RETROSPOT CAKE

9. LUNCH BAG BLACK SKULL.

10. NATURAL SLATE HEART CHALK

[ 4201 others ]

2369	(	0.4%	)
2200	(	0.4%	)
2159	(	0.4%	)
1727	(	0.3%	)
1638	(	0.3%	)
1501	(	0.3%	)
1473	(	0.3%	)
1385	(	0.3%	)
1350	(	0.2%	)
1280	(	0.2%	)
523373	(	96.8%	)

540455 (99.7%)

1454 (0.3%)

Quantity [numeric]

Mean (sd) : 9.6 (218.1)

min ≤ med ≤ max:

-80995 ≤ 3 ≤ 80995

IQR (CV) : 9 (22.8)

722 distinct values

541909 (100.0%)

0 (0.0%)

InvoiceDate [POSIXct, POSIXt]

min : 2010-12-01 08:26:00

med : 2011-07-19 17:17:00

max : 2011-12-09 12:50:00

range : 1y 0m 8d 4H 24M 0S

23260 distinct values

541909 (100.0%)

0 (0.0%)

UnitPrice [numeric]

Mean (sd) : 4.6 (96.8)

min ≤ med ≤ max:

-11062.1 ≤ 2.1 ≤ 38970

IQR (CV) : 2.9 (21)

1630 distinct values

541909 (100.0%)

0 (0.0%)

CustomerID [numeric]

Mean (sd) : 15287.7 (1713.6)

min ≤ med ≤ max:

12346 ≤ 15152 ≤ 18287

IQR (CV) : 2838 (0.1)

4372 distinct values

406829 (75.1%)

135080 (24.9%)

Country [character]

1. United Kingdom

2. Germany

3. France

4. EIRE

5. Spain

6. Netherlands

7. Belgium

8. Switzerland

9. Portugal

10. Australia

[ 28 others ]

495478	(	91.4%	)
9495	(	1.8%	)
8557	(	1.6%	)
8196	(	1.5%	)
2533	(	0.5%	)
2371	(	0.4%	)
2069	(	0.4%	)
2002	(	0.4%	)
1519	(	0.3%	)
1259	(	0.2%	)
8430	(	1.6%	)

541909 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-09

From the summary, we can see that there are 25900 unique invoice numbers, which implies that this data is of 25900 transactions from 2010-12-01 to 2011-12-09 12:50:00. We can also observe that 91.4% of the customers of this store are from the United Kingdom. Moreover, the mean quantity per transaction is 9.6 which indicates that this store is mostly used to cater to bulk ordering needs of the customers.

On reading the item descriptions of some of the items, it is very clear that most of the SKUs in this store are large quantity sets of items. ie.PACK OF 72 RETROSPOT CAKE. This is also an indicator that this store mainly caters to wholesale/bulk orders.

Tidying the dataset

Cleaning the null values

Checking the number of rows that has null data in the dataset.

Code

sum(is.na(retail))

[1] 136534

The total number of N/A values in the dataset is 136534, which is a bit high. Now, we will see whether any specific columns contain the bulk of these values or if the values are distributed across columns.

Code

na_cols <- colSums(is.na(retail))
colnames(retail) [na_cols > 0]

[1] "Description" "CustomerID"

Only the columns Description and CustomerID have N/A values in them.

Checking the number of rows in each column that has null data.

Code

sum(is.na(retail$Description))

[1] 1454

Code

sum(is.na(retail$CustomerID))

[1] 135080

About 25% of the values in the column CustomerID have N/A values whereas only 0.3% of values in the column Description are N/A. So, we should be careful to form our analysis on Customers and should focus on other fields more.

Also, as there are 25% N/A values in the column CustomerID, we cannot remove the whole column as we still might moss out on important information. So, an important thing to look out for here will be entries with both CustomerID and Description as N/A. We can safely remove the entries if such a condition is met.

Code

retail.cust_desc_null <- retail %>% filter(is.na(CustomerID), is.na(Description), .preserve = TRUE)
head(retail.cust_desc_null)

# A tibble: 6 × 8
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536414    22139    <NA>         56 2010-12-01 11:52:00       0      NA United…
2 536545    21134    <NA>          1 2010-12-01 14:32:00       0      NA United…
3 536546    22145    <NA>          1 2010-12-01 14:33:00       0      NA United…
4 536547    37509    <NA>          1 2010-12-01 14:33:00       0      NA United…
5 536549    85226A   <NA>          1 2010-12-01 14:34:00       0      NA United…
6 536550    85044    <NA>          1 2010-12-01 14:34:00       0      NA United…
# … with abbreviated variable names ¹StockCode, ²Description, ³Quantity,
#   ⁴UnitPrice, ⁵CustomerID

Code

count(retail.cust_desc_null)

# A tibble: 1 × 1
      n
  <int>
1  1454

An interesting thing we see here is that the UnitPrice is also 0 when Description and CustomerID columns are N/A.

To verify this, let us count the values which meet all 3 conditions

Code

retail %>% filter(is.na(CustomerID), is.na(Description), UnitPrice == 0, .preserve = TRUE) %>% count

# A tibble: 1 × 1
      n
  <int>
1  1454

The counts match, which means it will be safe to delete the entries which match all the three conditions.

Code

mutated_retail <- retail %>% subset(!(is.na(CustomerID) & is.na(Description) & UnitPrice == 0))
head(mutated_retail)

# A tibble: 6 × 8
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with abbreviated variable names ¹StockCode, ²Description, ³Quantity,
#   ⁴UnitPrice, ⁵CustomerID

Code

nrow(mutated_retail)

[1] 540455

We can similarly drop all those records with a N/A CustomerID and UnitPrice 0

Code

mutated_retail <- retail %>% subset(!(is.na(CustomerID) & UnitPrice == 0))
head(mutated_retail)

# A tibble: 6 × 8
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with abbreviated variable names ¹StockCode, ²Description, ³Quantity,
#   ⁴UnitPrice, ⁵CustomerID

Processing and Visualization

Analysing Invoice Numbers and Categorizing Transactions

Let us start our analysis by looking at the InvoiceNo column. This is an important field to be considered because the values in this column are repeated and we will use this field to get data about unique transactions.

Looking at the values of the column, most of the InvoiceNo values are 6 digit integers whereas other values start with either the character A or C. Let us find the count of these values and try to deep dive into the meaning of these.

Code

distinct_invoice_no <- distinct(mutated_retail, InvoiceNo, .keep_all = TRUE)
distinct_invoice_no['InvoiceStart'] <- substr(distinct_invoice_no$InvoiceNo, 0, 1)
unique(distinct_invoice_no$InvoiceStart)

[1] "5" "C" "A"

Code

# Counting the frequency of the first character of the InvoiceNo
invoice_freq <- distinct_invoice_no %>% group_by(InvoiceStart) %>% 
  summarise(total_count=n(),
            .groups = 'drop')
invoice_freq

# A tibble: 3 × 2
  InvoiceStart total_count
  <chr>              <int>
1 5                  19963
2 A                      3
3 C                   3836

We can observe from the above analysis that approx. 14.8% of the Invoice No values starts with C, which is a significant number. Now, let us try t find the significance of this count and the starting letters by looking at the data values where the InvoiceNo starts with A and C.

Code

retail.char_invoice_A <- mutated_retail %>% filter(startsWith(InvoiceNo, 'A'))

retail.char_invoice_C <- mutated_retail %>% filter(startsWith(InvoiceNo, 'C'))

head(retail.char_invoice_A)

# A tibble: 3 × 8
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 A563185   B        Adjust…       1 2011-08-12 14:50:00  11062.      NA United…
2 A563186   B        Adjust…       1 2011-08-12 14:51:00 -11062.      NA United…
3 A563187   B        Adjust…       1 2011-08-12 14:52:00 -11062.      NA United…
# … with abbreviated variable names ¹StockCode, ²Description, ³Quantity,
#   ⁴UnitPrice, ⁵CustomerID

Code

head(retail.char_invoice_C)

# A tibble: 6 × 8
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 C536379   D        Discou…      -1 2010-12-01 09:41:00   27.5    14527 United…
2 C536383   35004C   SET OF…      -1 2010-12-01 09:49:00    4.65   15311 United…
3 C536391   22556    PLASTE…     -12 2010-12-01 10:24:00    1.65   17548 United…
4 C536391   21984    PACK O…     -24 2010-12-01 10:24:00    0.29   17548 United…
5 C536391   21983    PACK O…     -24 2010-12-01 10:24:00    0.29   17548 United…
6 C536391   21980    PACK O…     -24 2010-12-01 10:24:00    0.29   17548 United…
# … with abbreviated variable names ¹StockCode, ²Description, ³Quantity,
#   ⁴UnitPrice, ⁵CustomerID

As we can see from the above generated tables, retail.char_invoice_A and retail.char_invoice_C. We notice that the values in which InvoiceNo starts with A are transactions where bad debt is adjusted. Whereas, the values where InvoiceNo starts with C have negative Quantity written in transactions. Which means, that these transactions are not sale transactions, but cancelled transactions in which the goods were returned to the seller.

I will plot this stat in a graph for better understanding

Code

invoice_status <- invoice_freq %>%
  mutate(Type = case_when(
    InvoiceStart == "C" ~ "Cancelled",
    InvoiceStart == "A" ~ "Adjust Debt",
    InvoiceStart == "5" ~ "Sell",
    ))

ggplot(data = invoice_status, aes(x = Type, y=total_count, fill = Type)) +
  geom_bar(stat = "identity") + 
  scale_fill_brewer(palette = "Oranges") +
  ggtitle("Transaction type vs their count") +
  xlab("Transaction type") +
  ylab("Count")

Also, let’s add a Type column to the dataset so it would be easier to filter out data if needed.

Code

mutated_retail <- mutated_retail %>%
  mutate(Type = case_when(
    startsWith(InvoiceNo,'C') ~ "Cancelled",
    startsWith(InvoiceNo,'A') ~ "AdjustDebt",
    startsWith(InvoiceNo,'5') ~ "Sale",
    ))

unique(mutated_retail$Type)

[1] "Sale"       "Cancelled"  "AdjustDebt"

Code

head(mutated_retail)

# A tibble: 6 × 9
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with 1 more variable: Type <chr>, and abbreviated variable names
#   ¹StockCode, ²Description, ³Quantity, ⁴UnitPrice, ⁵CustomerID

One more adjustment that I would like to make to the data for future purposes is to split the InvoiceDate column into Year, Month and Day columns.

Code

mutated_retail <- mutated_retail %>%
  mutate(Year = as.numeric(format(InvoiceDate, format = "%Y"))) %>%
  mutate(Month = as.numeric(format(InvoiceDate, format = "%m"))) %>%
  mutate(Day = as.numeric(format(InvoiceDate, format = "%d")))

head(mutated_retail)

# A tibble: 6 × 12
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with 4 more variables: Type <chr>, Year <dbl>, Month <dbl>, Day <dbl>, and
#   abbreviated variable names ¹StockCode, ²Description, ³Quantity, ⁴UnitPrice,
#   ⁵CustomerID

Now that we know that there are multiple types of transactions -> Cancelled and Sell, there is one interesting thing that we can do with this data => Calculate the most sought after products (stocks with net quantity > 0 => most bought products) and also the dead stock (stock with net quantity < 0 => the most returned product).

Code

retail.quantity <- mutated_retail %>% filter(str_detect(substr(StockCode, 1, 1), '^[0-9]')) %>% group_by(StockCode, Description) %>% summarise(net_quantity = sum(Quantity))
hot_stock <- retail.quantity %>% arrange(desc(net_quantity))
hot_stock

# A tibble: 4,150 × 3
# Groups:   StockCode [3,914]
   StockCode Description                        net_quantity
   <chr>     <chr>                                     <dbl>
 1 84077     WORLD WAR 2 GLIDERS ASSTD DESIGNS         53847
 2 85099B    JUMBO BAG RED RETROSPOT                   47359
 3 84879     ASSORTED COLOUR BIRD ORNAMENT             36381
 4 22197     POPCORN HOLDER                            36334
 5 21212     PACK OF 72 RETROSPOT CAKE CASES           36039
 6 85123A    WHITE HANGING HEART T-LIGHT HOLDER        35021
 7 23084     RABBIT NIGHT LIGHT                        30680
 8 22492     MINI PAINT SET VINTAGE                    26437
 9 22616     PACK OF 12 LONDON TISSUES                 26111
10 21977     PACK OF 60 PINK PAISLEY CAKE CASES        24753
# … with 4,140 more rows

Code

dead_stock <- retail.quantity %>% arrange(net_quantity)
dead_stock

# A tibble: 4,150 × 3
# Groups:   StockCode [3,914]
   StockCode Description                     net_quantity
   <chr>     <chr>                                  <dbl>
 1 21645     ASSORTED TUTTI FRUTTI ROUND BOX          -24
 2 79323W    WHITE CHERRY LIGHTS                       -8
 3 21144     PINK POODLE HANGING DECORATION            -7
 4 20703     BLUE PADDED SOFT MOBILE                   -6
 5 21412     VINTAGE GOLD TINSEL REEL                  -6
 6 35400     WOODEN BOX ADVENT CALENDAR                -6
 7 79323P    PINK CHERRY LIGHTS                        -4
 8 84750A    PINK SMALL GLASS CAKE STAND               -3
 9 85042     ANTIQUE LILY FAIRY LIGHTS                 -3
10 85063     CREAM SWEETHEART MAGAZINE RACK            -3
# … with 4,140 more rows

We now have the top selling products and the most returned products for this online store. Now, let us make it look a bit better. First, let’s find out the top 10 products in the store.

Code

top_10_prods <- head(hot_stock, 10)
top_10_prods

# A tibble: 10 × 3
# Groups:   StockCode [10]
   StockCode Description                        net_quantity
   <chr>     <chr>                                     <dbl>
 1 84077     WORLD WAR 2 GLIDERS ASSTD DESIGNS         53847
 2 85099B    JUMBO BAG RED RETROSPOT                   47359
 3 84879     ASSORTED COLOUR BIRD ORNAMENT             36381
 4 22197     POPCORN HOLDER                            36334
 5 21212     PACK OF 72 RETROSPOT CAKE CASES           36039
 6 85123A    WHITE HANGING HEART T-LIGHT HOLDER        35021
 7 23084     RABBIT NIGHT LIGHT                        30680
 8 22492     MINI PAINT SET VINTAGE                    26437
 9 22616     PACK OF 12 LONDON TISSUES                 26111
10 21977     PACK OF 60 PINK PAISLEY CAKE CASES        24753

Now, let’s put them in a pie chart

Code

top_prods_pie <- ggplot(top_10_prods, aes(x = "", y = net_quantity, fill = Description)) +
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0) +
  theme_void() + 
  labs(title= "Top 10 Best selling Products")

top_prods_pie

Analysing Country-wise transactions

First I will filter the transactions on the basis of StockCode, so that only the transactions related to products (and not extra fees, postage etc.)

Code

retail.product_transactions <- mutated_retail %>% filter(str_detect(substr(StockCode, 1, 1), '^[0-9]'))
head(retail.product_transactions)

# A tibble: 6 × 12
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with 4 more variables: Type <chr>, Year <dbl>, Month <dbl>, Day <dbl>, and
#   abbreviated variable names ¹StockCode, ²Description, ³Quantity, ⁴UnitPrice,
#   ⁵CustomerID

Now, let us find the top 5 countries with maximum transactions

Code

retail.transactions_by_country <- retail.product_transactions %>% group_by(Country, InvoiceNo) %>% count
retail.transactions_by_country <- subset(retail.transactions_by_country, select = -c(n))
retail.transactions_by_country <- retail.transactions_by_country %>% group_by(Country) %>% count %>% arrange(desc(n))
retail.top_country <- head(retail.transactions_by_country, 5)
retail.top_country

# A tibble: 5 × 2
# Groups:   Country [5]
  Country            n
  <chr>          <int>
1 United Kingdom 20932
2 Germany          578
3 France           441
4 EIRE             347
5 Belgium          117

Here’s the plot for the same:

Code

transactions_by_country <- ggplot(retail.top_country, aes(x = Country, y = n)) +
  geom_bar(stat = "identity", fill = "Blue") +
  theme_minimal() + 
  labs(title="Top 5 Countries by transaction volume", 
         x="Country", y = "No of Transactions")

transactions_by_country

Looking at top countries with the maximum number of customers

Code

retail.customers_by_country <- retail.product_transactions %>% group_by(Country, CustomerID) %>% count
retail.customers_by_country <- subset(retail.customers_by_country, select = -c(n))
retail.customers_by_country <- retail.customers_by_country %>% group_by(Country) %>% count %>% arrange(desc(n))
retail.top_country <- head(retail.customers_by_country, 5)
retail.top_country

# A tibble: 5 × 2
# Groups:   Country [5]
  Country            n
  <chr>          <int>
1 United Kingdom  3944
2 Germany           95
3 France            88
4 Spain             30
5 Belgium           25

Code

customer_by_country <- ggplot(retail.top_country, aes(x = Country, y = n)) +
  geom_bar(stat = "identity", fill = "Orange") +
  theme_minimal() + 
  labs(title="Top 5 Countries by Customer volume", 
         x="Country", y = "No of Customers")

customer_by_country

Analysis based on total sales

There is another stat we can calculate per country, which is the total sales amount. First, I create another df, which is the

Code

retail.total_sales <- retail.product_transactions %>% 
  mutate(TotalSales = Quantity * UnitPrice)

retail.sales_by_country <- retail.total_sales %>% group_by(Country) %>% summarise(TotalSales = sum(TotalSales)) %>% arrange(desc(TotalSales)) %>% head(5)
retail.sales_by_country

# A tibble: 5 × 2
  Country        TotalSales
  <chr>               <dbl>
1 United Kingdom   8301320.
2 Netherlands       283480.
3 EIRE              259663.
4 Germany           200808.
5 France            182163.

Code

sales_by_country <- ggplot(retail.sales_by_country, aes(x = Country, y = TotalSales)) +
  geom_bar(stat = "identity", fill = "Maroon") +
  theme_minimal() + 
  labs(title="Top 5 Countries by Sales", 
         x="Country", y = "Sales")

sales_by_country

We will look at the sales per month now

Code

retail.monthly_sales <- retail.total_sales %>% group_by(Year, Month) %>% summarise(Sales = sum(TotalSales))
retail.monthly_sales

# A tibble: 13 × 3
# Groups:   Year [2]
    Year Month    Sales
   <dbl> <dbl>    <dbl>
 1  2010    12  760258.
 2  2011     1  580389.
 3  2011     2  500494.
 4  2011     3  680457.
 5  2011     4  482878 
 6  2011     5  732206.
 7  2011     6  724991.
 8  2011     7  677881.
 9  2011     8  702564.
10  2011     9 1013424.
11  2011    10 1062674.
12  2011    11 1432732.
13  2011    12  441363.

Code

ggplot(data=retail.monthly_sales, aes(x=Month, y=Sales)) +
  geom_bar(stat="identity") +
  theme_minimal() + 
  labs(title="Monthly Sales", 
         x="Month", y = "Sales")

We can see from the bar plot that the sales are highest during the months of November and December, which can be accounted for by the holiday season.

Code

head(retail.total_sales)

# A tibble: 6 × 13
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with 5 more variables: Type <chr>, Year <dbl>, Month <dbl>, Day <dbl>,
#   TotalSales <dbl>, and abbreviated variable names ¹StockCode, ²Description,
#   ³Quantity, ⁴UnitPrice, ⁵CustomerID

Reflection

Being inexperienced in the field of data science, I feel I have learned a lot by partaking in this course and doing this project.

My initial intention was to choose a sports dataset to analyse when I started this project. But, I went on with this dataset after having a look at it because of it’s simplicity. This very simplicity in this dataset meant that it wasn’t going to be easy to find categories in this dataset. After completing this analysis I still feel that had there been some more additional data to this set, there was a lot of information that could be extracted from the data. One such example is if we had the Categories of the products available, it would’ve led us to a lot of new information about sales trends for the categories. More information about the customers would’ve also been helpful.

For this project, I approached the analysis problem by looking at each column individually and trying to extract as much data/trends from the column as I could. After that I tried to couple these trends and get some relevant information.

I faced many challenges in this project and the class tutorials as well as online resources like the R documentation and the documentation of libraries like ggplot2 helped a lot.

I believe that I still have a lot to learn about Data Science, but this course has been a wonderful first step into the field and will immensely help me in this journey.

Bibliography

Dataset from Kaggle- https://www.kaggle.com/datasets/ulrikthygepedersen/online-retail-dataset

GGPlot2 Documentation - https://ggplot2.tidyverse.org/reference/

Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.

Wickham, H. (2010). A layered grammar of graphics. Journal of Computational I and Graphical Statistics, 19(1), 3-28.