Author

Siddharth Goel

Published

February 2, 2023

Introduction

Analysis of Online Retail Data is the most important factor in setting up a successful online business. Unlike offline stores, where footfall can be increased using physical marketing strategies, online stores mainly rely on online advertisements and data analysis for business growth.

While advertisements are key to getting new customers for the stores, data analysis can provide valuable information like top selling products, the best demographic, sales patterns and dead, top stock details. These bits of information can help the management decide the areas in which the business needs investment and which are the products that make the store’s maximum sales.

So, in this project, I have taken a small sample of the transactions of an online retail store and will try to analyse the data and identify some patterns which may help us answer some important business questions.

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(ggplot2)
library(dplyr)
library(psych)

Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha
Code
library(readr)
library(summarytools)
Warning: no DISPLAY variable so Tk is not available
system might not have X11 capabilities; in case of errors when using dfSummary(), set st_options(use.x11 = FALSE)

Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view
Code
library(stringr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Import the Data

Importing the online retail dataset

Code
retail <- readr::read_csv("_data/online_retail.csv")
head(retail)
# A tibble: 6 × 8
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with abbreviated variable names ¹​StockCode, ²​Description, ³​Quantity,
#   ⁴​UnitPrice, ⁵​CustomerID
Code
spec(retail)
cols(
  InvoiceNo = col_character(),
  StockCode = col_character(),
  Description = col_character(),
  Quantity = col_double(),
  InvoiceDate = col_datetime(format = ""),
  UnitPrice = col_double(),
  CustomerID = col_double(),
  Country = col_character()
)
Code
dim(retail)
[1] 541909      8

The input dataset has 541909 records of goods sold in an online retail store and has 8 columns.

Dataset Summary

The columns and their descriptions are as follows :

  1. InvoiceNo: A 6 digit number which is the invoice number of the transaction in which the item was sold. As there can be multiple items in a transaction, the InvoiceNo column won’t be unique.

  2. StockCode: Item code. A 5-digit integral number uniquely assigned to each product.

  3. Description: Item name.

  4. Quantity: The quantities of the item in the transaction.

  5. InvoiceDate: Invoice Date and time.

  6. UnitPrice: Product price per unit in sterling.

  7. CustomerID: A 5-digit integral number uniquely assigned to each customer.

  8. Country: The name of the country where the customer resides.

Now using the dfSummary method to get the summary of each column in the dataset.

Code
print(dfSummary(retail, 
                varnumbers = FALSE, 
                plain.ascii = FALSE, 
                style = "grid", 
                graph.magnif = 0.70, 
                valid.col= TRUE),
      method= 'render', 
      table.classes= 'table-condensed')

Data Frame Summary

retail

Dimensions: 541909 x 8
Duplicates: 5268
Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
InvoiceNo [character]
1. 573585
2. 581219
3. 581492
4. 580729
5. 558475
6. 579777
7. 581217
8. 537434
9. 580730
10. 538071
[ 25890 others ]
1114(0.2%)
749(0.1%)
731(0.1%)
721(0.1%)
705(0.1%)
687(0.1%)
676(0.1%)
675(0.1%)
662(0.1%)
652(0.1%)
534537(98.6%)
541909 (100.0%) 0 (0.0%)
StockCode [character]
1. 85123A
2. 22423
3. 85099B
4. 47566
5. 20725
6. 84879
7. 22720
8. 22197
9. 21212
10. 20727
[ 4060 others ]
2313(0.4%)
2203(0.4%)
2159(0.4%)
1727(0.3%)
1639(0.3%)
1502(0.3%)
1477(0.3%)
1476(0.3%)
1385(0.3%)
1350(0.2%)
524678(96.8%)
541909 (100.0%) 0 (0.0%)
Description [character]
1. WHITE HANGING HEART T-LIG
2. REGENCY CAKESTAND 3 TIER
3. JUMBO BAG RED RETROSPOT
4. PARTY BUNTING
5. LUNCH BAG RED RETROSPOT
6. ASSORTED COLOUR BIRD ORNA
7. SET OF 3 CAKE TINS PANTRY
8. PACK OF 72 RETROSPOT CAKE
9. LUNCH BAG BLACK SKULL.
10. NATURAL SLATE HEART CHALK
[ 4201 others ]
2369(0.4%)
2200(0.4%)
2159(0.4%)
1727(0.3%)
1638(0.3%)
1501(0.3%)
1473(0.3%)
1385(0.3%)
1350(0.2%)
1280(0.2%)
523373(96.8%)
540455 (99.7%) 1454 (0.3%)
Quantity [numeric]
Mean (sd) : 9.6 (218.1)
min ≤ med ≤ max:
-80995 ≤ 3 ≤ 80995
IQR (CV) : 9 (22.8)
722 distinct values 541909 (100.0%) 0 (0.0%)
InvoiceDate [POSIXct, POSIXt]
min : 2010-12-01 08:26:00
med : 2011-07-19 17:17:00
max : 2011-12-09 12:50:00
range : 1y 0m 8d 4H 24M 0S
23260 distinct values 541909 (100.0%) 0 (0.0%)
UnitPrice [numeric]
Mean (sd) : 4.6 (96.8)
min ≤ med ≤ max:
-11062.1 ≤ 2.1 ≤ 38970
IQR (CV) : 2.9 (21)
1630 distinct values 541909 (100.0%) 0 (0.0%)
CustomerID [numeric]
Mean (sd) : 15287.7 (1713.6)
min ≤ med ≤ max:
12346 ≤ 15152 ≤ 18287
IQR (CV) : 2838 (0.1)
4372 distinct values 406829 (75.1%) 135080 (24.9%)
Country [character]
1. United Kingdom
2. Germany
3. France
4. EIRE
5. Spain
6. Netherlands
7. Belgium
8. Switzerland
9. Portugal
10. Australia
[ 28 others ]
495478(91.4%)
9495(1.8%)
8557(1.6%)
8196(1.5%)
2533(0.5%)
2371(0.4%)
2069(0.4%)
2002(0.4%)
1519(0.3%)
1259(0.2%)
8430(1.6%)
541909 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-02-09

From the summary, we can see that there are 25900 unique invoice numbers, which implies that this data is of 25900 transactions from 2010-12-01 to 2011-12-09 12:50:00. We can also observe that 91.4% of the customers of this store are from the United Kingdom. Moreover, the mean quantity per transaction is 9.6 which indicates that this store is mostly used to cater to bulk ordering needs of the customers.

On reading the item descriptions of some of the items, it is very clear that most of the SKUs in this store are large quantity sets of items. ie.PACK OF 72 RETROSPOT CAKE. This is also an indicator that this store mainly caters to wholesale/bulk orders.

Tidying the dataset

Cleaning the null values

Checking the number of rows that has null data in the dataset.

Code
sum(is.na(retail))
[1] 136534

The total number of N/A values in the dataset is 136534, which is a bit high. Now, we will see whether any specific columns contain the bulk of these values or if the values are distributed across columns.

Code
na_cols <- colSums(is.na(retail))
colnames(retail) [na_cols > 0]
[1] "Description" "CustomerID" 

Only the columns Description and CustomerID have N/A values in them.

Checking the number of rows in each column that has null data.

Code
sum(is.na(retail$Description))
[1] 1454
Code
sum(is.na(retail$CustomerID))
[1] 135080

About 25% of the values in the column CustomerID have N/A values whereas only 0.3% of values in the column Description are N/A. So, we should be careful to form our analysis on Customers and should focus on other fields more.

Also, as there are 25% N/A values in the column CustomerID, we cannot remove the whole column as we still might moss out on important information. So, an important thing to look out for here will be entries with both CustomerID and Description as N/A. We can safely remove the entries if such a condition is met.

Code
retail.cust_desc_null <- retail %>% filter(is.na(CustomerID), is.na(Description), .preserve = TRUE)
head(retail.cust_desc_null)
# A tibble: 6 × 8
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536414    22139    <NA>         56 2010-12-01 11:52:00       0      NA United…
2 536545    21134    <NA>          1 2010-12-01 14:32:00       0      NA United…
3 536546    22145    <NA>          1 2010-12-01 14:33:00       0      NA United…
4 536547    37509    <NA>          1 2010-12-01 14:33:00       0      NA United…
5 536549    85226A   <NA>          1 2010-12-01 14:34:00       0      NA United…
6 536550    85044    <NA>          1 2010-12-01 14:34:00       0      NA United…
# … with abbreviated variable names ¹​StockCode, ²​Description, ³​Quantity,
#   ⁴​UnitPrice, ⁵​CustomerID
Code
count(retail.cust_desc_null)
# A tibble: 1 × 1
      n
  <int>
1  1454

An interesting thing we see here is that the UnitPrice is also 0 when Description and CustomerID columns are N/A.

To verify this, let us count the values which meet all 3 conditions

Code
retail %>% filter(is.na(CustomerID), is.na(Description), UnitPrice == 0, .preserve = TRUE) %>% count
# A tibble: 1 × 1
      n
  <int>
1  1454

The counts match, which means it will be safe to delete the entries which match all the three conditions.

Code
mutated_retail <- retail %>% subset(!(is.na(CustomerID) & is.na(Description) & UnitPrice == 0))
head(mutated_retail)
# A tibble: 6 × 8
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with abbreviated variable names ¹​StockCode, ²​Description, ³​Quantity,
#   ⁴​UnitPrice, ⁵​CustomerID
Code
nrow(mutated_retail)
[1] 540455

We can similarly drop all those records with a N/A CustomerID and UnitPrice 0

Code
mutated_retail <- retail %>% subset(!(is.na(CustomerID) & UnitPrice == 0))
head(mutated_retail)
# A tibble: 6 × 8
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with abbreviated variable names ¹​StockCode, ²​Description, ³​Quantity,
#   ⁴​UnitPrice, ⁵​CustomerID

Processing and Visualization

Analysing Invoice Numbers and Categorizing Transactions

Let us start our analysis by looking at the InvoiceNo column. This is an important field to be considered because the values in this column are repeated and we will use this field to get data about unique transactions.

Looking at the values of the column, most of the InvoiceNo values are 6 digit integers whereas other values start with either the character A or C. Let us find the count of these values and try to deep dive into the meaning of these.

Code
distinct_invoice_no <- distinct(mutated_retail, InvoiceNo, .keep_all = TRUE)
distinct_invoice_no['InvoiceStart'] <- substr(distinct_invoice_no$InvoiceNo, 0, 1)
unique(distinct_invoice_no$InvoiceStart)
[1] "5" "C" "A"
Code
# Counting the frequency of the first character of the InvoiceNo
invoice_freq <- distinct_invoice_no %>% group_by(InvoiceStart) %>% 
  summarise(total_count=n(),
            .groups = 'drop')
invoice_freq
# A tibble: 3 × 2
  InvoiceStart total_count
  <chr>              <int>
1 5                  19963
2 A                      3
3 C                   3836

We can observe from the above analysis that approx. 14.8% of the Invoice No values starts with C, which is a significant number. Now, let us try t find the significance of this count and the starting letters by looking at the data values where the InvoiceNo starts with A and C.

Code
retail.char_invoice_A <- mutated_retail %>% filter(startsWith(InvoiceNo, 'A'))

retail.char_invoice_C <- mutated_retail %>% filter(startsWith(InvoiceNo, 'C'))

head(retail.char_invoice_A)
# A tibble: 3 × 8
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 A563185   B        Adjust…       1 2011-08-12 14:50:00  11062.      NA United…
2 A563186   B        Adjust…       1 2011-08-12 14:51:00 -11062.      NA United…
3 A563187   B        Adjust…       1 2011-08-12 14:52:00 -11062.      NA United…
# … with abbreviated variable names ¹​StockCode, ²​Description, ³​Quantity,
#   ⁴​UnitPrice, ⁵​CustomerID
Code
head(retail.char_invoice_C)
# A tibble: 6 × 8
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 C536379   D        Discou…      -1 2010-12-01 09:41:00   27.5    14527 United…
2 C536383   35004C   SET OF…      -1 2010-12-01 09:49:00    4.65   15311 United…
3 C536391   22556    PLASTE…     -12 2010-12-01 10:24:00    1.65   17548 United…
4 C536391   21984    PACK O…     -24 2010-12-01 10:24:00    0.29   17548 United…
5 C536391   21983    PACK O…     -24 2010-12-01 10:24:00    0.29   17548 United…
6 C536391   21980    PACK O…     -24 2010-12-01 10:24:00    0.29   17548 United…
# … with abbreviated variable names ¹​StockCode, ²​Description, ³​Quantity,
#   ⁴​UnitPrice, ⁵​CustomerID

As we can see from the above generated tables, retail.char_invoice_A and retail.char_invoice_C. We notice that the values in which InvoiceNo starts with A are transactions where bad debt is adjusted. Whereas, the values where InvoiceNo starts with C have negative Quantity written in transactions. Which means, that these transactions are not sale transactions, but cancelled transactions in which the goods were returned to the seller.

I will plot this stat in a graph for better understanding

Code
invoice_status <- invoice_freq %>%
  mutate(Type = case_when(
    InvoiceStart == "C" ~ "Cancelled",
    InvoiceStart == "A" ~ "Adjust Debt",
    InvoiceStart == "5" ~ "Sell",
    ))

ggplot(data = invoice_status, aes(x = Type, y=total_count, fill = Type)) +
  geom_bar(stat = "identity") + 
  scale_fill_brewer(palette = "Oranges") +
  ggtitle("Transaction type vs their count") +
  xlab("Transaction type") +
  ylab("Count")

Also, let’s add a Type column to the dataset so it would be easier to filter out data if needed.

Code
mutated_retail <- mutated_retail %>%
  mutate(Type = case_when(
    startsWith(InvoiceNo,'C') ~ "Cancelled",
    startsWith(InvoiceNo,'A') ~ "AdjustDebt",
    startsWith(InvoiceNo,'5') ~ "Sale",
    ))

unique(mutated_retail$Type)
[1] "Sale"       "Cancelled"  "AdjustDebt"
Code
head(mutated_retail)
# A tibble: 6 × 9
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with 1 more variable: Type <chr>, and abbreviated variable names
#   ¹​StockCode, ²​Description, ³​Quantity, ⁴​UnitPrice, ⁵​CustomerID

One more adjustment that I would like to make to the data for future purposes is to split the InvoiceDate column into Year, Month and Day columns.

Code
mutated_retail <- mutated_retail %>%
  mutate(Year = as.numeric(format(InvoiceDate, format = "%Y"))) %>%
  mutate(Month = as.numeric(format(InvoiceDate, format = "%m"))) %>%
  mutate(Day = as.numeric(format(InvoiceDate, format = "%d")))

head(mutated_retail)
# A tibble: 6 × 12
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with 4 more variables: Type <chr>, Year <dbl>, Month <dbl>, Day <dbl>, and
#   abbreviated variable names ¹​StockCode, ²​Description, ³​Quantity, ⁴​UnitPrice,
#   ⁵​CustomerID

Now that we know that there are multiple types of transactions -> Cancelled and Sell, there is one interesting thing that we can do with this data => Calculate the most sought after products (stocks with net quantity > 0 => most bought products) and also the dead stock (stock with net quantity < 0 => the most returned product).

Code
retail.quantity <- mutated_retail %>% filter(str_detect(substr(StockCode, 1, 1), '^[0-9]')) %>% group_by(StockCode, Description) %>% summarise(net_quantity = sum(Quantity))
hot_stock <- retail.quantity %>% arrange(desc(net_quantity))
hot_stock
# A tibble: 4,150 × 3
# Groups:   StockCode [3,914]
   StockCode Description                        net_quantity
   <chr>     <chr>                                     <dbl>
 1 84077     WORLD WAR 2 GLIDERS ASSTD DESIGNS         53847
 2 85099B    JUMBO BAG RED RETROSPOT                   47359
 3 84879     ASSORTED COLOUR BIRD ORNAMENT             36381
 4 22197     POPCORN HOLDER                            36334
 5 21212     PACK OF 72 RETROSPOT CAKE CASES           36039
 6 85123A    WHITE HANGING HEART T-LIGHT HOLDER        35021
 7 23084     RABBIT NIGHT LIGHT                        30680
 8 22492     MINI PAINT SET VINTAGE                    26437
 9 22616     PACK OF 12 LONDON TISSUES                 26111
10 21977     PACK OF 60 PINK PAISLEY CAKE CASES        24753
# … with 4,140 more rows
Code
dead_stock <- retail.quantity %>% arrange(net_quantity)
dead_stock
# A tibble: 4,150 × 3
# Groups:   StockCode [3,914]
   StockCode Description                     net_quantity
   <chr>     <chr>                                  <dbl>
 1 21645     ASSORTED TUTTI FRUTTI ROUND BOX          -24
 2 79323W    WHITE CHERRY LIGHTS                       -8
 3 21144     PINK POODLE HANGING DECORATION            -7
 4 20703     BLUE PADDED SOFT MOBILE                   -6
 5 21412     VINTAGE GOLD TINSEL REEL                  -6
 6 35400     WOODEN BOX ADVENT CALENDAR                -6
 7 79323P    PINK CHERRY LIGHTS                        -4
 8 84750A    PINK SMALL GLASS CAKE STAND               -3
 9 85042     ANTIQUE LILY FAIRY LIGHTS                 -3
10 85063     CREAM SWEETHEART MAGAZINE RACK            -3
# … with 4,140 more rows

We now have the top selling products and the most returned products for this online store. Now, let us make it look a bit better. First, let’s find out the top 10 products in the store.

Code
top_10_prods <- head(hot_stock, 10)
top_10_prods
# A tibble: 10 × 3
# Groups:   StockCode [10]
   StockCode Description                        net_quantity
   <chr>     <chr>                                     <dbl>
 1 84077     WORLD WAR 2 GLIDERS ASSTD DESIGNS         53847
 2 85099B    JUMBO BAG RED RETROSPOT                   47359
 3 84879     ASSORTED COLOUR BIRD ORNAMENT             36381
 4 22197     POPCORN HOLDER                            36334
 5 21212     PACK OF 72 RETROSPOT CAKE CASES           36039
 6 85123A    WHITE HANGING HEART T-LIGHT HOLDER        35021
 7 23084     RABBIT NIGHT LIGHT                        30680
 8 22492     MINI PAINT SET VINTAGE                    26437
 9 22616     PACK OF 12 LONDON TISSUES                 26111
10 21977     PACK OF 60 PINK PAISLEY CAKE CASES        24753

Now, let’s put them in a pie chart

Code
top_prods_pie <- ggplot(top_10_prods, aes(x = "", y = net_quantity, fill = Description)) +
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0) +
  theme_void() + 
  labs(title= "Top 10 Best selling Products")

top_prods_pie

Analysing Country-wise transactions

First I will filter the transactions on the basis of StockCode, so that only the transactions related to products (and not extra fees, postage etc.)

Code
retail.product_transactions <- mutated_retail %>% filter(str_detect(substr(StockCode, 1, 1), '^[0-9]'))
head(retail.product_transactions)
# A tibble: 6 × 12
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with 4 more variables: Type <chr>, Year <dbl>, Month <dbl>, Day <dbl>, and
#   abbreviated variable names ¹​StockCode, ²​Description, ³​Quantity, ⁴​UnitPrice,
#   ⁵​CustomerID

Now, let us find the top 5 countries with maximum transactions

Code
retail.transactions_by_country <- retail.product_transactions %>% group_by(Country, InvoiceNo) %>% count
retail.transactions_by_country <- subset(retail.transactions_by_country, select = -c(n))
retail.transactions_by_country <- retail.transactions_by_country %>% group_by(Country) %>% count %>% arrange(desc(n))
retail.top_country <- head(retail.transactions_by_country, 5)
retail.top_country
# A tibble: 5 × 2
# Groups:   Country [5]
  Country            n
  <chr>          <int>
1 United Kingdom 20932
2 Germany          578
3 France           441
4 EIRE             347
5 Belgium          117

Here’s the plot for the same:

Code
transactions_by_country <- ggplot(retail.top_country, aes(x = Country, y = n)) +
  geom_bar(stat = "identity", fill = "Blue") +
  theme_minimal() + 
  labs(title="Top 5 Countries by transaction volume", 
         x="Country", y = "No of Transactions")

transactions_by_country

Looking at top countries with the maximum number of customers

Code
retail.customers_by_country <- retail.product_transactions %>% group_by(Country, CustomerID) %>% count
retail.customers_by_country <- subset(retail.customers_by_country, select = -c(n))
retail.customers_by_country <- retail.customers_by_country %>% group_by(Country) %>% count %>% arrange(desc(n))
retail.top_country <- head(retail.customers_by_country, 5)
retail.top_country
# A tibble: 5 × 2
# Groups:   Country [5]
  Country            n
  <chr>          <int>
1 United Kingdom  3944
2 Germany           95
3 France            88
4 Spain             30
5 Belgium           25
Code
customer_by_country <- ggplot(retail.top_country, aes(x = Country, y = n)) +
  geom_bar(stat = "identity", fill = "Orange") +
  theme_minimal() + 
  labs(title="Top 5 Countries by Customer volume", 
         x="Country", y = "No of Customers")

customer_by_country

Analysis based on total sales

There is another stat we can calculate per country, which is the total sales amount. First, I create another df, which is the

Code
retail.total_sales <- retail.product_transactions %>% 
  mutate(TotalSales = Quantity * UnitPrice)

retail.sales_by_country <- retail.total_sales %>% group_by(Country) %>% summarise(TotalSales = sum(TotalSales)) %>% arrange(desc(TotalSales)) %>% head(5)
retail.sales_by_country
# A tibble: 5 × 2
  Country        TotalSales
  <chr>               <dbl>
1 United Kingdom   8301320.
2 Netherlands       283480.
3 EIRE              259663.
4 Germany           200808.
5 France            182163.
Code
sales_by_country <- ggplot(retail.sales_by_country, aes(x = Country, y = TotalSales)) +
  geom_bar(stat = "identity", fill = "Maroon") +
  theme_minimal() + 
  labs(title="Top 5 Countries by Sales", 
         x="Country", y = "Sales")

sales_by_country

We will look at the sales per month now

Code
retail.monthly_sales <- retail.total_sales %>% group_by(Year, Month) %>% summarise(Sales = sum(TotalSales))
retail.monthly_sales
# A tibble: 13 × 3
# Groups:   Year [2]
    Year Month    Sales
   <dbl> <dbl>    <dbl>
 1  2010    12  760258.
 2  2011     1  580389.
 3  2011     2  500494.
 4  2011     3  680457.
 5  2011     4  482878 
 6  2011     5  732206.
 7  2011     6  724991.
 8  2011     7  677881.
 9  2011     8  702564.
10  2011     9 1013424.
11  2011    10 1062674.
12  2011    11 1432732.
13  2011    12  441363.
Code
ggplot(data=retail.monthly_sales, aes(x=Month, y=Sales)) +
  geom_bar(stat="identity") +
  theme_minimal() + 
  labs(title="Monthly Sales", 
         x="Month", y = "Sales")

We can see from the bar plot that the sales are highest during the months of November and December, which can be accounted for by the holiday season.

Code
head(retail.total_sales)
# A tibble: 6 × 13
  InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
  <chr>     <chr>    <chr>     <dbl> <dttm>                <dbl>   <dbl> <chr>  
1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
# … with 5 more variables: Type <chr>, Year <dbl>, Month <dbl>, Day <dbl>,
#   TotalSales <dbl>, and abbreviated variable names ¹​StockCode, ²​Description,
#   ³​Quantity, ⁴​UnitPrice, ⁵​CustomerID

Reflection

Being inexperienced in the field of data science, I feel I have learned a lot by partaking in this course and doing this project.

My initial intention was to choose a sports dataset to analyse when I started this project. But, I went on with this dataset after having a look at it because of it’s simplicity. This very simplicity in this dataset meant that it wasn’t going to be easy to find categories in this dataset. After completing this analysis I still feel that had there been some more additional data to this set, there was a lot of information that could be extracted from the data. One such example is if we had the Categories of the products available, it would’ve led us to a lot of new information about sales trends for the categories. More information about the customers would’ve also been helpful.

For this project, I approached the analysis problem by looking at each column individually and trying to extract as much data/trends from the column as I could. After that I tried to couple these trends and get some relevant information.

I faced many challenges in this project and the class tutorials as well as online resources like the R documentation and the documentation of libraries like ggplot2 helped a lot.

I believe that I still have a lot to learn about Data Science, but this course has been a wonderful first step into the field and will immensely help me in this journey.

Bibliography

Dataset from Kaggle- https://www.kaggle.com/datasets/ulrikthygepedersen/online-retail-dataset

GGPlot2 Documentation - https://ggplot2.tidyverse.org/reference/

Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.

Wickham, H. (2010). A layered grammar of graphics. Journal of Computational I and Graphical Statistics, 19(1), 3-28.