We use cookies to provide you with a better service. Carry on browsing if you're happy with this. You can find out more about cookies here


Scaling Performance

Obtaining data has never been easier. Individuals share their daily  activities on social networks. That includes places they visit, things  they like (or dislike), items they purchase, movies they watch and  photos and videos they take anywhere at any time with their smartphones’  high-res cameras. Organisations aren’t much different in principle.  They share data about events they organize or attend, data on their  products and services and data on their financial performance.  Researchers also have a lot to share. Data on scientific discoveries and  breakthroughs, inventions and the like can be found in publicly and  freely accessible databases.

There is simply more data at our disposal than we ever cared.  However, the problem remains how to turn this massive amount of data  into meaningful information, and ultimately into insights that can make a  difference?

To make the matter a tiny bit more inconvenient, not all data are  relevant – at least for the problem at hand – and not all relevant data  are useful. Relevant data often contain “noise” and “outliers” that  aren’t useful at best (mind you, we are more interested in outliers than  normal data at times). After all, garbage in, garbage out.

The first step in the long journey to attain an insight is to sort  the data out into “relevant” and “irrelevant”, in which case some  computation needs to be done (actually in many cases a lot of  computation).

With massive data size that keeps growing, fitting all of it in a computer is increasingly becoming more complicated.

This lengthy introduction should’ve made one point clear: the problem of scalability cannot be exaggerated.


Current Approaches to Scalability

Generally speaking, scalability approaches fall into two broad categories: Vertical and Horizontal.

Scaling up vertically refers to the process of adding additional  hardware resources to the same computer. In other words, making an  existing computer more capable. Horizontal scalability, on the other  hand adds additional “nodes” to an exiting “cluster”, making the cluster  as a whole more capable.

Both approaches allow more data to be crunched given the same amount  of time, and this is where their similarity ends. Each approach will be  discussed in turn.


Vertical Scalability

Upgrading a single computer is highly dependent on the architecture  of that particular computer. For instance, computers built for desktop  usage often include a single CPU socket, a few memory slots and limited  PCI-E lanes. They do incorporate plenty of external ports though.  Desktop computer platforms aren’t built with reliability in mind. They  lack the basic kinds of error recovery options such as ECC (Error  Correcting Code). As of the time of this blog, the most capable desktop  CPU from Intel is the i7-6950X Extreme Edition. It has 10 cores and  supports Hyperthreading. It can access up to 128GB of RAM in  quad-channel configuration.

Storage capacity is limited by the number of available ports  (typically SATA3) and by the available disk bays in the chassis,  whichever is smaller. While it is possible to attach fast Solid State  Disks (SSDs) to a Desktop computer, performance is likely to be capped  by the SATA3 controller itself, which can handle a data rate of 6Gbps  (appx 600MBps) max. Indeed, a single SSD nowadays can saturate the disk  controller. Adding more disks simply means the bandwidth is split up and  each disk runs at a lower speed.

It must be noted that a “seemingly” equivalent Mobile processor is  considerably less capable than its Desktop counterpart. Mobile  processors are designed to prioritize efficiency over performance. In  addition, laptops have fewer memory slots and works in dual-channel  configuration.


Server as Better Option

The server architecture is far more robust and reliable in every  single aspect compared to the Desktop architecture (apart from the price  of course). A typical server contains 2 CPU sockets (4 is becoming more  popular), plenty of memory slots and many more PCI-E lanes. For  example, the Intel Xeon E5-2699v4 has 22 Cores and supports two-way  Hyperthreading. It can access as much as 1.5TB of RAM in quad-channel  configuration with 76GBps total memory bandwidth per socket. Servers  don’t rely on the limited SATA controller. Instead, SAS (Serial Attached  SCSI) is used to control the SAS disks. SAS provides double the  bandwidth of SATA and supports more sophisticated features to increase  both performance and reliability. With more PCI-E lanes at disposal,  multiple SAS controllers can be added to a single server without  degrading I/O performance. Ultimately, servers are designed with  mission-critical tasks in mind. As such, they are more resilient in  handling long-running tasks without crashing.


The table below compares the main features of Desktop and Server platforms

Number of Sockets12 or more
Max Cores per Socket1022
Max RAM per Socket128GB1.5TB
ECC supportNoYes
PCI-E Lanes per Socket4040
Disk Controller / BandwidthSATA3 / 600MBpsSAS3 / 1200MBps
Disk Bays (typical)3 ~ 64 ~ 24


Horizontal Scalability

44 Cores (88 Threads), 3TB of RAM and 24 high performance SAS disks sounds behemothic, and indeed it does. However, there are situations  when even this much is insufficient. Crunching 10’s of Terabytes of data  can be so demanding that a single server can take so long, or even runs  out of memory. Another scenario is when the data mining platform is  shared between many data scientists and researchers, which is the common  (and more realistic) case in the academic and corporate environments.  In such situations, multiple servers are joined together to form a  “Cluster”. From an end user perspective, a cluster should look just like  a single computer with ample of computing resources. Such flexibility,  however, adds an additional layer of complexity as will be explained in a  moment.

In a solo server (or desktop) scenario, a single instance of the data  mining software accesses and processes the entire dataset. On the other  hand, multiple instances of the data mining software must run in the  cluster (an instance per cluster node). This introduces two new  problems:

  1. The dataset must be split evenly and distributed to all cluster nodes over the network. This is known as the Mapping phase.
  2. The outputs from all cluster nodes must be accumulated at a single cluster node. This is known as the Reduction phase.

The two problems above further create many hard-to-crack challenges.  For instance, splitting the dataset can lead to significant variance in  the output of each node in trendy datasets. This also adds an additional  layer of latency: the network. Inefficient network topology can  substantially impact the overall performance. However, the most  challenging problem is that the Reduction phase can only start when all  nodes have finished processing. A slow node can stale the entire  cluster.



A high-end desktop computer is a great way to start. It’s affordable  and can handle small datasets with acceptable performance. The mobile  platform (laptops and tablets) is inefficient and very limited and  should be avoided unless no alternatives are available. When the dataset  is too large for a desktop computer, or when the algorithm is very CPU  intensive (which is typical for Deep Learning algorithms), the first  alternative to seek should be a server with at least 2 CPU sockets and  plenty of RAM slots.

Because there is no one-solution-fits-all to address cluster  performance and scalability challenges, the cluster option should only  be considered when the task at hand exceeds the capability of a single  server. In this case, special attention must be paid to all detail: the  hardware, the data mining software and the dataset to be crunched.


So, What about GPUs?

The introduction of GP-GPUs (General Purpose Graphics Processing  Units) has made a major shift (or hype if you like) in the data mining  arena. Vendors like Nvidia have been making bold marketing claims around  their Deep Learning GPU-powered appliances.

GP-GPUs promise performance boost of as much as 3-digit of magnitude!  However, the notion “no one-solution-fits-all” still applies. GP-GPU’s  excel in scenarios where the “active” dataset (which can be the entire  dataset) can fit in the GPU’s local memory, which is very limited at  best, and when the algorithm is parallel enough to utilize the vast  number of small GPU cores. Proper assessment should be carried out  before investing in GPU’s enables data mining solutions.

Note: This article first appeared on CognitionX blog