blog |

Predict Sales with Machine Learning for Efficient Inventory

Overview

Deep Learning (aka Artificial Neural Network or ANN) is undoubtedly the most sophisticated Machine Learning algorithm so far. It is designed to mimic how the human brain is believed to function. Deep Learning is purely Supervised Learning (with exception of Autoencoders) meaning that it needs pre-classified examples to learn from.

In most scenarios, data experts manually set the “response column”, which can be a class label or a numeric value. Supervised Machine Learning is great, but not without a caveat: It is both laborious and time consuming. On the other hand, providing fewer examples can adversely impact reliability.

So, the question here: is it possible to provide supervised training data with no supervision? Fortunately, the answer is yes (though not in every case).

Time Series

Time series is a set of data points, each of which represents the value of the same variable at different times (Oxford dictionary). Examples are shares prices, sales, temperature, air pressure and so on. Events are typically recorded at equal time or date intervals. In this blog, product sales will be used as a practical example.

In most cases, sales data are stored in Relational databases, which aren’t time series. A sample is illustrated in Table 1.

Trans_No	Date	Item_ID	Qty
33	20-03-17	10000144	1
128	20-03-17	10004227	2
113	20-03-17	10004227	1
184	20-03-17	10004708	1
184	20-03-17	10004715	1
19	20-03-17	20005405	20005405
33	20-03-17	80305170	1
33	20-03-17	80305170	1
96	20-03-17	80308225	1
120	20-03-17	80323792	4
34	20-03-17	80341574	1
193	20-03-17	80341574	1
204	20-03-17	80341574	1
204	20-03-17	80341574	1
204	20-03-17	80341574	1
257	20-03-17	80341574	3
275	20-03-17	80341574	1
3	20-03-17	80341574	1
74	20-03-17	80341574	1
88	20-03-17	80341574	1

Table 1: raw sales data as seen in relational database

Each data sample, or observation, represents sold items. One thing should be obvious here: items appear at random intervals. While this may be the norm from a sales manager’s perspective, it is quite problematic for the data scientist because prediction (in the sense of inventory management) is impossible unless the time intervals are identical.

This problem, however, can be easily tackled by data aggregation techniques. Table 2 illustrates the dataset after aggregation. In this case, item sales are summed up on week basis. Below is the R code that aggregates item sales (dataset loaded into H2O).

j <- 3 #Set last week to 3
i <- 1 #Set the previous x weeks, used to increment loop
while(i <= j){
#Mask Week number in data
iWeek <- i

#create mask to filter out transactions of weeks other than i
mask_WeekNo <- (Sales[,"WeekNo"] == iWeek)

#apply the mask. Save result to temporary data frame
TempDF <- Sales[mask_WeekNo,]

#get the sum of week (i) sales for each item
ITEM_SALES = h2o.group_by(TempDF, by = c("ITEM_CODE"), sum("sum_Total_TRN_QTY"))

#Round sales to the nearest integer
ITEM_SALES$sum_sum_Total_TRN_QTY <- round(ITEM_SALES$sum_sum_Total_TRN_QTY / 1)

#Rename QTY coloumn names to indicate the week number
names(ITEM_SALES)[names(ITEM_SALES)=="sum_sum_Total_TRN_QTY"] <- paste0("Week_", toString(iWeek))

#append the week sales to a new column and merge it
Weekly_ITEM_SALES <- h2o.merge(Weekly_ITEM_SALES, ITEM_SALES, by.x = "ITEM_CODE", by.y = "ITEM_CODE", all.x = TRUE)

#Move pointer to the next week
i = i + 1
}

The output of the R code is presented in Table 2.

Item_ID	Item_Category	Week_1	Week_2	Week_3
000007	2	3	3	1
000010	10	NA	12	NA
000015	36	5	1	6
000022	24	50	62	23
000023	90	NA	NA	NA
000036	7	1	28	20
000043	73	1	NA	NA
000052	10	NA	2	NA
000053	13	2	5	7
000054	33	NA	NA	NA
000055	2	NA	NA	2
000063	2	NA	1	2
000070	27	NA	36	NA
000086	13	34	36	25
000087	13	22	21	27
000094	13	3	2	1
000101	73	2	4	6
000113	24	NA	3	NA
000116	36	1	2	2
000121	11	41	85	88

Table 2: semi processed data sample with basic aggregation

Table 2 represents more “proper” time series dataset and indeed is ready to train our ANN after removing noise and outliers (i.e. errors and abnormal events). Noise can be found in large datasets. For example, when item code is stored in the Quantity filed due to human error. Outliers may reflect sales promotions and out of stock situations.

At this point, the target column (also known as response column) is Week_3. This is pretty good news for a data scientist for two reasons:

The response column data is readily available.
There is plenty of training data. In fact, as many as the total number of inventory items.

You may wonder why we would bother to predict Week_3 when we already have it. Makes no sense at all! I will come back to this point in a minute.

Our model managed to score Validation Deviance of 230. Not sufficiently good for commercial use. Increasing the number of epochs or hidden layers do not always improve accuracy (instead risk overfitting). So, the question here is how can we improve the model's performance?

One way to do so is to provide more item attributes such as item subcategory, product family, and so on. In this blog, I assume no useful attributes are available (because many real datasets do not). Fortunately, time series actually allows us to make something out of nothing (almost)! The same dataset in Table 2 can be used to create new statistical attributes, namely: min, max, mean and median. The new and enhanced dataset is illustrated in Table 3. Note that some items did not generate sales at all during the period in question, hence the “NA” values.

Item_ID	Item_Category	Week_1	Week_2	Week_3	Min	Max	Mean	Median
7	2	3	3	1	1	3	2.33333	3
10	10	NA	12	NA	12	12	12	12
15	36	5	1	6	1	6	4	5
22	24	50	62	23	23	62	45	50
23	90	NA	NA	NA	NA	NA	NA	NA
36	7	1	28	20	1	28	16.3333	20
43	73	1	NA	NA	1	1	1	1
52	10	NA	2	NA	2	2	2	2
53	13	2	5	7	2	7	4.66667	5
54	33	NA	NA	NA	NA	NA	NA	NA
55	2	NA	NA	2	2	2	2	2
63	2	NA	1	2	1	2	1.5	1.5
70	27	NA	36	NA	36	36	36	36
86	13	34	36	25	25	36	31.6667	34
87	13	22	21	27	21	27	23.3333	22
94	13	3	2	1	1	3	2	2
101	73	2	4	6	2	6	4	4
113	24	NA	3	NA	3	3	3	3
116	36	1	2	2	1	2	1.66667	2
121	11	41	85	88	41	88	71.3333	85

Table 3: data sample with additional features

Feeding the data from Table 3 into our model resulted in much improved accuracy with validation deviance of 53 only (compared to 230). That’s over 4 times improvement. This isn’t perfect and additional tuning might be required (I typically stop when validation deviance is below 10) but should make the point clear.

#Train a Deep Learning Model (ANN) on Weekly_ITEM_SALES
Predict_Sales <- h2o.deeplearning(x = 1:h2o.ncol(Weekly_ITEM_SALES), y = Week_3,
training_frame = Weekly_ITEM_SALES,
nfolds = 6,
fold_assignment = "Modulo",
ignore_const_cols = TRUE,
model_id = "Sales_Prediction.dl",
overwrite_with_best_model = TRUE,
standardize = TRUE,
activation = c("Rectifier"),
hidden = c(100),
epochs = 9000,
train_samples_per_iteration = -2,
adaptive_rate = TRUE,
stopping_rounds = 3,
score_interval = 3,
stopping_metric = c("deviance"),
missing_values_handling = c("MeanImputation"),
max_runtime_secs = 7200,
fast_mode = TRUE)

Now it’s time to go back to the question: why would we predict Week_3 when it’s already known?

It is essential to understand how Deep Learning works in the first place. Broadly speaking, Deep Learning uses the response column during the training phase to adjust the weights between the neurons iteratively until the prediction and response column are closely matched. How “close” is determined by a number of ways, such as Validation Deviance (for prediction) and Confusion Matrix (for classification).

In the prediction phase, the column Week_1 is discarded. Weeks 2 and 3 become input variables while week4 (the future) becomes the new response column.

Item_ID	Item_Category	Week_2	Week_3	Week_4	Min	Max	Mean	Median
7	2	3	1	3	1	3	2.33333	3
10	10	12	NA	10	12	12	12	12
15	36	1	6	5	1	6	4	5
22	24	62	23	23	23	62	45	50
23	90	NA	NA	NA	NA	NA	NA	NA
36	7	28	20	21	1	28	16.3333	20
43	73	NA	NA	1	1	1	1	1
52	10	2	NA	0	2	2	2	2
53	13	5	7	7	2	7	4.66667	5
54	33	NA	NA	NA	NA	NA	NA	NA
55	2	NA	2	2	2	2	2	2
63	2	1	2	2	1	2	1.5	1.5
70	27	36	NA	32	36	36	36	36
86	13	36	25	25	25	36	31.6667	34
87	13	21	27	27	21	27	23.3333	22
94	13	2	1	1	1	3	2	2
101	73	4	6	6	2	6	4	4
113	24	3	NA	1	3	3	3	3
116	36	2	2	2	1	2	1.66667	2
121	11	85	88	88	41	88	71.3333	85

Table 4: prediction. Note that Week_1 was discarded and Week_4 is the new response column

Processing Performance

It is hard to talk about Machine Learning without mentioning High Performance Computing (HPC). Anyone involved in Machine Learning appreciates this for an obvious reason: how much time do you spend frustratingly gazing at your screen while waiting for the progress bar to hit 100%? The rule of thumb here is that “faster” or “bigger” machines yield better performance. In practice, however, it is desirable to find a balance between cost and performance.

For the purpose of this blog, the same computation with same dataset was run on two common but extremely different hardware platforms: a fairly old Intel i3 laptop with 6GB of RAM, and a beefy workstation with an X series i7 6800K with 32GB of RAM. R was used for data pre-processing and manipulation. The actual Deep Learning model was trained on the same machines using H2O. The performance is summarised in the figure below:

Figure 1: Training throughput. Higher values indicate better performance

It is important to understand the performance characteristics of the intended Machine Learning platform as well as the dataset at hand. For example, does a faster machine (in terms of GHz) perform better, or a machine with more cores at lower GHz? Is the workload CPU-bound or memory-bound? Very large datasets can overstress the disks (especially mechanical disks) causing IO bottlenecks. Large Hadoop clusters suffer due to synchronization, especially on relatively slow networks.

H2O is an efficient in-memory, Map-Reduce platform that benefits from larger Cache memory. It scales very nicely with more CPU cores.

Figure 2: H2O puts all 6 cores (12 threads) to work

For more insight on performance, the interested reader can refer to my previous blogs Scaling Performance and Impact of Virtualization on Machine Learning for very detailed discussion.

see more posts