Predict Sales with Machine Learning for Efficient Inventory
Overview
Deep Learning (aka Artificial Neural Network or ANN) is undoubtedly the most sophisticated Machine Learning algorithm so far. It is designed to mimic how the human brain is believed to function. Deep Learning is purely Supervised Learning (with exception of Autoencoders) meaning that it needs pre-classified examples to learn from.
In most scenarios, data experts manually set the “response column”, which can be a class label or a numeric value. Supervised Machine Learning is great, but not without a caveat: It is both laborious and time consuming. On the other hand, providing fewer examples can adversely impact reliability.
So, the question here: is it possible to provide supervised training data with no supervision? Fortunately, the answer is yes (though not in every case).
Time Series
Time series is a set of data points, each of which represents the value of the same variable at different times (Oxford dictionary). Examples are shares prices, sales, temperature, air pressure and so on. Events are typically recorded at equal time or date intervals. In this blog, product sales will be used as a practical example.
In most cases, sales data are stored in Relational databases, which aren’t time series. A sample is illustrated in Table 1.
Trans_No | Date | Item_ID | Qty |
33 | 20-03-17 | 10000144 | 1 |
128 | 20-03-17 | 10004227 | 2 |
113 | 20-03-17 | 10004227 | 1 |
184 | 20-03-17 | 10004708 | 1 |
184 | 20-03-17 | 10004715 | 1 |
19 | 20-03-17 | 20005405 | 20005405 |
33 | 20-03-17 | 80305170 | 1 |
33 | 20-03-17 | 80305170 | 1 |
96 | 20-03-17 | 80308225 | 1 |
120 | 20-03-17 | 80323792 | 4 |
34 | 20-03-17 | 80341574 | 1 |
193 | 20-03-17 | 80341574 | 1 |
204 | 20-03-17 | 80341574 | 1 |
204 | 20-03-17 | 80341574 | 1 |
204 | 20-03-17 | 80341574 | 1 |
257 | 20-03-17 | 80341574 | 3 |
275 | 20-03-17 | 80341574 | 1 |
3 | 20-03-17 | 80341574 | 1 |
74 | 20-03-17 | 80341574 | 1 |
88 | 20-03-17 | 80341574 | 1 |
Table 1: raw sales data as seen in relational database
Each data sample, or observation, represents sold items. One thing should be obvious here: items appear at random intervals. While this may be the norm from a sales manager’s perspective, it is quite problematic for the data scientist because prediction (in the sense of inventory management) is impossible unless the time intervals are identical.
This problem, however, can be easily tackled by data aggregation techniques. Table 2 illustrates the dataset after aggregation. In this case, item sales are summed up on week basis. Below is the R code that aggregates item sales (dataset loaded into H2O).
j <- 3 #Set last week to 3 i <- 1 #Set the previous x weeks, used to increment loop while(i <= j){ #Mask Week number in data iWeek <- i #create mask to filter out transactions of weeks other than i mask_WeekNo <- (Sales[,"WeekNo"] == iWeek) #apply the mask. Save result to temporary data frame TempDF <- Sales[mask_WeekNo,] #get the sum of week (i) sales for each item ITEM_SALES = h2o.group_by(TempDF, by = c("ITEM_CODE"), sum("sum_Total_TRN_QTY")) #Round sales to the nearest integer ITEM_SALES$sum_sum_Total_TRN_QTY <- round(ITEM_SALES$sum_sum_Total_TRN_QTY / 1) #Rename QTY coloumn names to indicate the week number names(ITEM_SALES)[names(ITEM_SALES)=="sum_sum_Total_TRN_QTY"] <- paste0("Week_", toString(iWeek)) #append the week sales to a new column and merge it Weekly_ITEM_SALES <- h2o.merge(Weekly_ITEM_SALES, ITEM_SALES, by.x = "ITEM_CODE", by.y = "ITEM_CODE", all.x = TRUE) #Move pointer to the next week i = i + 1 } |
The output of the R code is presented in Table 2.
Item_ID | Item_Category | Week_1 | Week_2 | Week_3 |
000007 | 2 | 3 | 3 | 1 |
000010 | 10 | NA | 12 | NA |
000015 | 36 | 5 | 1 | 6 |
000022 | 24 | 50 | 62 | 23 |
000023 | 90 | NA | NA | NA |
000036 | 7 | 1 | 28 | 20 |
000043 | 73 | 1 | NA | NA |
000052 | 10 | NA | 2 | NA |
000053 | 13 | 2 | 5 | 7 |
000054 | 33 | NA | NA | NA |
000055 | 2 | NA | NA | 2 |
000063 | 2 | NA | 1 | 2 |
000070 | 27 | NA | 36 | NA |
000086 | 13 | 34 | 36 | 25 |
000087 | 13 | 22 | 21 | 27 |
000094 | 13 | 3 | 2 | 1 |
000101 | 73 | 2 | 4 | 6 |
000113 | 24 | NA | 3 | NA |
000116 | 36 | 1 | 2 | 2 |
000121 | 11 | 41 | 85 | 88 |
Table 2: semi processed data sample with basic aggregation
Table 2 represents more “proper” time series dataset and indeed is ready to train our ANN after removing noise and outliers (i.e. errors and abnormal events). Noise can be found in large datasets. For example, when item code is stored in the Quantity filed due to human error. Outliers may reflect sales promotions and out of stock situations.
At this point, the target column (also known as response column) is Week_3. This is pretty good news for a data scientist for two reasons:
- The response column data is readily available.
- There is plenty of training data. In fact, as many as the total number of inventory items.
You may wonder why we would bother to predict Week_3 when we already have it. Makes no sense at all! I will come back to this point in a minute.
Our model managed to score Validation Deviance of 230. Not sufficiently good for commercial use. Increasing the number of epochs or hidden layers do not always improve accuracy (instead risk overfitting). So, the question here is how can we improve the model's performance?
One way to do so is to provide more item attributes such as item subcategory, product family, and so on. In this blog, I assume no useful attributes are available (because many real datasets do not). Fortunately, time series actually allows us to make something out of nothing (almost)! The same dataset in Table 2 can be used to create new statistical attributes, namely: min, max, mean and median. The new and enhanced dataset is illustrated in Table 3. Note that some items did not generate sales at all during the period in question, hence the “NA” values.
Item_ID | Item_Category | Week_1 | Week_2 | Week_3 | Min | Max | Mean | Median |
7 | 2 | 3 | 3 | 1 | 1 | 3 | 2.33333 | 3 |
10 | 10 | NA | 12 | NA | 12 | 12 | 12 | 12 |
15 | 36 | 5 | 1 | 6 | 1 | 6 | 4 | 5 |
22 | 24 | 50 | 62 | 23 | 23 | 62 | 45 | 50 |
23 | 90 | NA | NA | NA | NA | NA | NA | NA |
36 | 7 | 1 | 28 | 20 | 1 | 28 | 16.3333 | 20 |
43 | 73 | 1 | NA | NA | 1 | 1 | 1 | 1 |
52 | 10 | NA | 2 | NA | 2 | 2 | 2 | 2 |
53 | 13 | 2 | 5 | 7 | 2 | 7 | 4.66667 | 5 |
54 | 33 | NA | NA | NA | NA | NA | NA | NA |
55 | 2 | NA | NA | 2 | 2 | 2 | 2 | 2 |
63 | 2 | NA | 1 | 2 | 1 | 2 | 1.5 | 1.5 |
70 | 27 | NA | 36 | NA | 36 | 36 | 36 | 36 |
86 | 13 | 34 | 36 | 25 | 25 | 36 | 31.6667 | 34 |
87 | 13 | 22 | 21 | 27 | 21 | 27 | 23.3333 | 22 |
94 | 13 | 3 | 2 | 1 | 1 | 3 | 2 | 2 |
101 | 73 | 2 | 4 | 6 | 2 | 6 | 4 | 4 |
113 | 24 | NA | 3 | NA | 3 | 3 | 3 | 3 |
116 | 36 | 1 | 2 | 2 | 1 | 2 | 1.66667 | 2 |
121 | 11 | 41 | 85 | 88 | 41 | 88 | 71.3333 | 85 |
Table 3: data sample with additional features
Feeding the data from Table 3 into our model resulted in much improved accuracy with validation deviance of 53 only (compared to 230). That’s over 4 times improvement. This isn’t perfect and additional tuning might be required (I typically stop when validation deviance is below 10) but should make the point clear.
#Train a Deep Learning Model (ANN) on Weekly_ITEM_SALES Predict_Sales <- h2o.deeplearning(x = 1:h2o.ncol(Weekly_ITEM_SALES), y = Week_3, training_frame = Weekly_ITEM_SALES, nfolds = 6, fold_assignment = "Modulo", ignore_const_cols = TRUE, model_id = "Sales_Prediction.dl", overwrite_with_best_model = TRUE, standardize = TRUE, activation = c("Rectifier"), hidden = c(100), epochs = 9000, train_samples_per_iteration = -2, adaptive_rate = TRUE, stopping_rounds = 3, score_interval = 3, stopping_metric = c("deviance"), missing_values_handling = c("MeanImputation"), max_runtime_secs = 7200, fast_mode = TRUE) |
Now it’s time to go back to the question: why would we predict Week_3 when it’s already known?
It is essential to understand how Deep Learning works in the first place. Broadly speaking, Deep Learning uses the response column during the training phase to adjust the weights between the neurons iteratively until the prediction and response column are closely matched. How “close” is determined by a number of ways, such as Validation Deviance (for prediction) and Confusion Matrix (for classification).
In the prediction phase, the column Week_1 is discarded. Weeks 2 and 3 become input variables while week4 (the future) becomes the new response column.
Item_ID | Item_Category | Week_2 | Week_3 | Week_4 | Min | Max | Mean | Median |
7 | 2 | 3 | 1 | 3 | 1 | 3 | 2.33333 | 3 |
10 | 10 | 12 | NA | 10 | 12 | 12 | 12 | 12 |
15 | 36 | 1 | 6 | 5 | 1 | 6 | 4 | 5 |
22 | 24 | 62 | 23 | 23 | 23 | 62 | 45 | 50 |
23 | 90 | NA | NA | NA | NA | NA | NA | NA |
36 | 7 | 28 | 20 | 21 | 1 | 28 | 16.3333 | 20 |
43 | 73 | NA | NA | 1 | 1 | 1 | 1 | 1 |
52 | 10 | 2 | NA | 0 | 2 | 2 | 2 | 2 |
53 | 13 | 5 | 7 | 7 | 2 | 7 | 4.66667 | 5 |
54 | 33 | NA | NA | NA | NA | NA | NA | NA |
55 | 2 | NA | 2 | 2 | 2 | 2 | 2 | 2 |
63 | 2 | 1 | 2 | 2 | 1 | 2 | 1.5 | 1.5 |
70 | 27 | 36 | NA | 32 | 36 | 36 | 36 | 36 |
86 | 13 | 36 | 25 | 25 | 25 | 36 | 31.6667 | 34 |
87 | 13 | 21 | 27 | 27 | 21 | 27 | 23.3333 | 22 |
94 | 13 | 2 | 1 | 1 | 1 | 3 | 2 | 2 |
101 | 73 | 4 | 6 | 6 | 2 | 6 | 4 | 4 |
113 | 24 | 3 | NA | 1 | 3 | 3 | 3 | 3 |
116 | 36 | 2 | 2 | 2 | 1 | 2 | 1.66667 | 2 |
121 | 11 | 85 | 88 | 88 | 41 | 88 | 71.3333 | 85 |
Table 4: prediction. Note that Week_1 was discarded and Week_4 is the new response column
Processing Performance
It is hard to talk about Machine Learning without mentioning High Performance Computing (HPC). Anyone involved in Machine Learning appreciates this for an obvious reason: how much time do you spend frustratingly gazing at your screen while waiting for the progress bar to hit 100%? The rule of thumb here is that “faster” or “bigger” machines yield better performance. In practice, however, it is desirable to find a balance between cost and performance.
For the purpose of this blog, the same computation with same dataset was run on two common but extremely different hardware platforms: a fairly old Intel i3 laptop with 6GB of RAM, and a beefy workstation with an X series i7 6800K with 32GB of RAM. R was used for data pre-processing and manipulation. The actual Deep Learning model was trained on the same machines using H2O. The performance is summarised in the figure below:
Figure 1: Training throughput. Higher values indicate better performance
It is important to understand the performance characteristics of the intended Machine Learning platform as well as the dataset at hand. For example, does a faster machine (in terms of GHz) perform better, or a machine with more cores at lower GHz? Is the workload CPU-bound or memory-bound? Very large datasets can overstress the disks (especially mechanical disks) causing IO bottlenecks. Large Hadoop clusters suffer due to synchronization, especially on relatively slow networks.
H2O is an efficient in-memory, Map-Reduce platform that benefits from larger Cache memory. It scales very nicely with more CPU cores.
Figure 2: H2O puts all 6 cores (12 threads) to work
For more insight on performance, the interested reader can refer to my previous blogs Scaling Performance and Impact of Virtualization on Machine Learning for very detailed discussion.