We use cookies to provide you with a better service. Carry on browsing if you're happy with this. You can find out more about cookies here

Accept

Predict Sales with Machine Learning for Efficient Inventory

Overview

Deep Learning (aka Artificial Neural Network or ANN) is undoubtedly the most sophisticated Machine Learning algorithm so far. It is designed to mimic how the human brain is believed to function. Deep Learning is purely Supervised Learning (with exception of Autoencoders) meaning that it needs pre-classified examples to learn from.

In most scenarios, data experts manually set the “response column”, which can be a class label or a numeric value. Supervised Machine Learning is great, but not without a caveat: It is both laborious and time consuming. On the other hand, providing fewer examples can adversely impact reliability.

So, the question here: is it possible to provide supervised training data with no supervision? Fortunately, the answer is yes (though not in every case).


Time Series

Time series is a set of data points, each of which represents the value of the same variable at different times (Oxford dictionary). Examples are shares prices, sales, temperature, air pressure and so on. Events are typically recorded at equal time or date intervals. In this blog, product sales will be used as a practical example.

In most cases, sales data are stored in Relational databases, which aren’t time series. A sample is illustrated in Table 1.

Trans_No
Date
Item_ID
Qty
33
20-03-17
10000144
1
128
20-03-17
10004227
2
113
20-03-17
10004227
1
184
20-03-17
10004708
1
184
20-03-17
10004715
1
19
20-03-17
20005405
20005405
33
20-03-17
80305170
1
33
20-03-17
80305170
1
96
20-03-17
80308225
1
120
20-03-17
80323792
4
34
20-03-17
80341574
1
193
20-03-17
80341574
1
204
20-03-17
80341574
1
204
20-03-17
80341574
1
204
20-03-17
80341574
1
257
20-03-17
80341574
3
275
20-03-17
80341574
1
3
20-03-17
80341574
1
74
20-03-17
80341574
1
88
20-03-17
80341574
1

Table 1: raw sales data as seen in relational database


Each data sample, or observation, represents sold items. One thing should be obvious here: items appear at random intervals. While this may be the norm from a sales manager’s perspective, it is quite problematic for the data scientist because prediction (in the sense of inventory management) is impossible unless the time intervals are identical.

This problem, however, can be easily tackled by data aggregation techniques. Table 2 illustrates the dataset after aggregation. In this case, item sales are summed up on week basis. Below is the R code that aggregates item sales (dataset loaded into H2O).

j <- 3 #Set last week to 3
i <- 1 #Set the previous x weeks, used to increment loop
while(i <= j){
  #Mask Week number in data
  iWeek <- i            
 
  #create mask to filter out transactions of weeks other than i
  mask_WeekNo <- (Sales[,"WeekNo"] == iWeek)
 
  #apply the mask. Save result to temporary data frame
  TempDF <- Sales[mask_WeekNo,]
 
  #get the sum of week (i) sales for each item
  ITEM_SALES = h2o.group_by(TempDF, by = c("ITEM_CODE"), sum("sum_Total_TRN_QTY"))
 
  #Round sales to the nearest integer
  ITEM_SALES$sum_sum_Total_TRN_QTY <- round(ITEM_SALES$sum_sum_Total_TRN_QTY / 1)
 
  #Rename QTY coloumn names to indicate the week number
  names(ITEM_SALES)[names(ITEM_SALES)=="sum_sum_Total_TRN_QTY"] <- paste0("Week_", toString(iWeek))
 
  #append the week sales to a new column and merge it
  Weekly_ITEM_SALES <- h2o.merge(Weekly_ITEM_SALES, ITEM_SALES, by.x = "ITEM_CODE", by.y = "ITEM_CODE", all.x = TRUE)
 
  #Move pointer to the next week
  i = i + 1
}

The output of the R code is presented in Table 2.


Item_ID
Item_Category
Week_1
Week_2
Week_3
000007
2
3
3
1
000010
10
NA
12
NA
000015
36
5
1
6
000022
24
50
62
23
000023
90
NA
NA
NA
000036
7
1
28
20
000043
73
1
NA
NA
000052
10
NA
2
NA
000053
13
2
5
7
000054
33
NA
NA
NA
000055
2
NA
NA
2
000063
2
NA
1
2
000070
27
NA
36
NA
000086
13
34
36
25
000087
13
22
21
27
000094
13
3
2
1
000101
73
2
4
6
000113
24
NA
3
NA
000116
36
1
2
2
000121
11
41
85
88

Table 2: semi processed data sample with basic aggregation


Table 2 represents more “proper” time series dataset and indeed is ready to train our ANN after removing noise and outliers (i.e. errors and abnormal events). Noise can be found in large datasets. For example, when item code is stored in the Quantity filed due to human error. Outliers may reflect sales promotions and out of stock situations.

At this point, the target column (also known as response column) is Week_3. This is pretty good news for a data scientist for two reasons:

  • The response column data is readily available.
  • There is plenty of training data. In fact, as many as the total number of inventory items.


You may wonder why we would bother to predict Week_3 when we already have it. Makes no sense at all! I will come back to this point in a minute.

Our model managed to score Validation Deviance of 230. Not sufficiently good for commercial use. Increasing the number of epochs or hidden layers do not always improve accuracy (instead risk overfitting). So, the question here is how can we improve the model's performance?


One way to do so is to provide more item attributes such as item subcategory, product family, and so on. In this blog, I assume no useful attributes are available (because many real datasets do not).  Fortunately, time series actually allows us to make something out of nothing (almost)! The same dataset in Table 2 can be used to create new statistical attributes, namely: min, max, mean and median. The new and enhanced dataset is illustrated in Table 3. Note that some items did not generate sales at all during the period in question, hence the “NA” values.


Item_ID
Item_Category
Week_1
Week_2
Week_3
Min
Max
Mean
Median
7
2
3
3
1
1
3
2.33333
3
10
10
NA
12
NA
12
12
12
12
15
36
5
1
6
1
6
4
5
22
24
50
62
23
23
62
45
50
23
90
NA
NA
NA
NA
NA
NA
NA
36
7
1
28
20
1
28
16.3333
20
43
73
1
NA
NA
1
1
1
1
52
10
NA
2
NA
2
2
2
2
53
13
2
5
7
2
7
4.66667
5
54
33
NA
NA
NA
NA
NA
NA
NA
55
2
NA
NA
2
2
2
2
2
63
2
NA
1
2
1
2
1.5
1.5
70
27
NA
36
NA
36
36
36
36
86
13
34
36
25
25
36
31.6667
34
87
13
22
21
27
21
27
23.3333
22
94
13
3
2
1
1
3
2
2
101
73
2
4
6
2
6
4
4
113
24
NA
3
NA
3
3
3
3
116
36
1
2
2
1
2
1.66667
2
121
11
41
85
88
41
88
71.3333
85

Table 3: data sample with additional features


Feeding the data from Table 3 into our model resulted in much improved accuracy with validation deviance of 53 only (compared to 230). That’s over 4 times improvement. This isn’t perfect and additional tuning might be required (I typically stop when validation deviance is below 10) but should make the point clear.


#Train a Deep Learning Model (ANN) on Weekly_ITEM_SALES
Predict_Sales <- h2o.deeplearning(x = 1:h2o.ncol(Weekly_ITEM_SALES), y = Week_3,
                        training_frame = Weekly_ITEM_SALES,
                        nfolds = 6,
                        fold_assignment = "Modulo",
                        ignore_const_cols = TRUE,
                        model_id = "Sales_Prediction.dl",
                        overwrite_with_best_model = TRUE,
                        standardize = TRUE,
                        activation = c("Rectifier"),
                        hidden = c(100),
                        epochs = 9000,
                        train_samples_per_iteration = -2,
                        adaptive_rate = TRUE,
                        stopping_rounds = 3,
                        score_interval = 3,
                        stopping_metric = c("deviance"),
                        missing_values_handling = c("MeanImputation"),
                        max_runtime_secs = 7200,
                        fast_mode = TRUE)

Now it’s time to go back to the question: why would we predict Week_3 when it’s already known?


It is essential to understand how Deep Learning works in the first place. Broadly speaking, Deep Learning uses the response column during the training phase to adjust the weights between the neurons iteratively until the prediction and response column are closely matched. How “close” is determined by a number of ways, such as Validation Deviance (for prediction) and Confusion Matrix (for classification).

In the prediction phase, the column Week_1 is discarded. Weeks 2 and 3 become input variables while week4 (the future) becomes the new response column.


Item_ID
Item_Category
Week_2
Week_3
Week_4
Min
Max
Mean
Median
7
2
3
1
3
1
3
2.33333
3
10
10
12
NA
10
12
12
12
12
15
36
1
6
5
1
6
4
5
22
24
62
23
23
23
62
45
50
23
90
NA
NA
NA
NA
NA
NA
NA
36
7
28
20
21
1
28
16.3333
20
43
73
NA
NA
1
1
1
1
1
52
10
2
NA
0
2
2
2
2
53
13
5
7
7
2
7
4.66667
5
54
33
NA
NA
NA
NA
NA
NA
NA
55
2
NA
2
2
2
2
2
2
63
2
1
2
2
1
2
1.5
1.5
70
27
36
NA
32
36
36
36
36
86
13
36
25
25
25
36
31.6667
34
87
13
21
27
27
21
27
23.3333
22
94
13
2
1
1
1
3
2
2
101
73
4
6
6
2
6
4
4
113
24
3
NA
1
3
3
3
3
116
36
2
2
2
1
2
1.66667
2
121
11
85
88
88
41
88
71.3333
85

Table 4: prediction. Note that Week_1 was discarded and Week_4 is the new response column


Processing Performance

It is hard to talk about Machine Learning without mentioning High Performance Computing (HPC). Anyone involved in Machine Learning appreciates this for an obvious reason: how much time do you spend frustratingly gazing at your screen while waiting for the progress bar to hit 100%? The rule of thumb here is that “faster” or “bigger” machines yield better performance. In practice, however, it is desirable to find a balance between cost and performance.

 For the purpose of this blog, the same computation with same dataset was run on two common but extremely different hardware platforms: a fairly old Intel i3 laptop with 6GB of RAM, and a beefy workstation with an X series i7 6800K with 32GB of RAM. R was used for data pre-processing and manipulation. The actual Deep Learning model was trained on the same machines using H2O. The performance is summarised in the figure below:


Figure 1: Training throughput. Higher values indicate better performance


It is important to understand the performance characteristics of the intended Machine Learning platform as well as the dataset at hand. For example, does a faster machine (in terms of GHz) perform better, or a machine with more cores at lower GHz? Is the workload CPU-bound or memory-bound? Very large datasets can overstress the disks (especially mechanical disks) causing IO bottlenecks. Large Hadoop clusters suffer due to synchronization, especially on relatively slow networks.

H2O is an efficient in-memory, Map-Reduce platform that benefits from larger Cache memory. It scales very nicely with more CPU cores.


Figure 2: H2O puts all 6 cores (12 threads) to work


For more insight on performance, the interested reader can refer to my previous blogs Scaling Performance and Impact of Virtualization on Machine Learning for very detailed discussion.