We use cookies to provide you with a better service. Carry on browsing if you're happy with this. You can find out more about cookies here


Impact of Virtualization on Machine Learning

Almost every data scientist new to the field kicks off with the  resources they already have at hand: their Personal Computers. However,  they quickly run into the common problem of very poor performance (which  can be very frustrating). It’s not the Personal Computer’s fault.  Personal Computers are generally intended for light usage and mobility.

The next logical step is to move to the cloud. And why not indeed? The cloud is fast, cheap and very flexible. A typical Amazon EC2 GPU Instance (p2.xlarge) costs only $0.9 per hour at the time of writing this blog.  What made Public Cloud computing so cheap was the introduction of the  hypervisor, which is a software layer that sets between the hardware and  the operating system of the individual instances, or Virtual Machines.

This blogpost sheds some light on the impact of the hypervisor on Machine Learning applications.

The Test Environment

For the study to be meaningful, I set up two (almost) identical  environments and ran exactly the same tests with the same datasets. One  environment was a virtualized private cloud based on VMware vSphere 6.  The other was a Bare Metal cloud with no virtualization. The Metal cloud  was provided by Bigstep.

Each environment consisted of three servers running CentOS 7 64bit.  Each server contained 40 processors and 32GB of RAM (the Physical RAM  was higher on the Metal Cloud nodes. However, H2O was configured to use  30GB to match the virtualized environment). The virtualized cloud had  traditional mechanical disks spinning at 10k rpm while the Metal Cloud  had Solid State Disks. This difference did not affect the computing time  as will be clarified shortly.

For the actual Machine Learning computing, H2O was installed on all serves in both environments (H2O can be downloaded for free here).  An H2O cluster was formed to distribute computing workload. In total,  each environment enjoyed 120 processors and 96GB of RAM (to be precise,  2GB of RAM was reserved to the Operating System on each node, making the  actual amount available to H2O 30GB).

H2O is a very efficient memory-based Machine Learning platform. It  loads the entire dataset into the memory and compresses it. Thus, once  the dataset is loaded, disk performance becomes irrelevant. This is very  crucial for Big Data applications because disks can’t keep up with the  high performance of today’s processors, especially in clustered  environments.

The Test

In order to ensure consistency in the two environments, I used the  example “Airline Delay” that is available in H2O. Two datasets were  used; a large dataset consisting of 152 Million observations (about 15GB  in size) and a small subset of it, consisting of 2000 observations  (about 4MB in size). The small dataset was used to see how the two  environments behave when the dataset can fit into the cache memory  (spoiler alert: this turned out to be very interesting and totally  unexpected!).

The test itself consisted of three parts:

  1. Parsing the Data
  2. Training GLM Model
  3. Training Deep Learning Model

Parsing loads the datasets from disks into memory; converts the  dataset into H2O’s native format and then performs in-memory  compression. The first step is very IO intensive, while the rest is both  processor and memory intensive. In clustered environments, the data is  parsed in parallel. Each node loads only a portion of the data  (typically 1/number of nodes) .

H2O’s Generalized Linear Models (GLM) estimates regression models for  outcomes following exponential distributions. It’s largely  single-threaded, meaning that only one processor out of the 120 can be  used at a time.

H2O’s Deep Learning is based on a multi-layer feed-forward Artificial  Neural Network that is trained with stochastic gradient descent using  back-propagation. Typical of ANN, it’s embarrassingly parallel and  indeed fully hammered all available processors.

It is important to note that I had to disable “early stopping” option  to force H2O to perform the same amount of computation while training  the network. This was a necessary measure due to the stochastic nature  of the Deep Learning Implementation.

Each test was repeated three times and the best time was recorded.

Test 1: 15GB file, DL Fast Mode: True, Number of Epochs: 10

Test 1: 15GB file, DL Fast Mode: True, Number of Epochs: 10

Parsing large files is very IO intensive task. In this respect, even a  relatively fast Hard Disk (spinning at 10k rpm compared to 7.2k rpm on  desktops and 5.4k rpm on laptops) is showing its age.

The Fast Mode in H2O’s Deep Learning enables minor approximation in  back-propagation. This basically means that less computation is  performed on each observation while training the Neural Network.  Effectively, this makes the test memory bound. The Bare  Metal cluster managed to crunch 260 thousands samples per second  against 198 thousands for the virtualized cluster, which is about 25%  (you can think of 30 processors being wasted by the hypervisor!).

The performance gap was much more visible in the GLM test, which as indicated previously, is largely single-threaded.

Test 2: 15GB file, DL Fast Mode: False, Number of Epochs: 0.1


Note: For this test, the number of epochs was reduced to 0.1 only.

Here, the Deep Learning test was repeated but with Fast Mode disabled  to force H2O to perform more computation on every observation. Thus,  this test is largely processor bound. The performance gap was reduced to  about 15% thanks to the hardware-enabled virtualization features of the  Intel Xeon processors. Intel Virtualization Technology bundle (VT-x,  VT-d, VT-x with Extended Page Tables) significantly reduces  virtualization overhead by offloading certain workloads from the  hypervisor to dedicated functional units in the processor itself.

Test 3: 4MB file, GLM


The outcome of this test was very interesting (rather  controversial!). The Bare Metal cluster performed exactly as one would  expect, taking about 2 seconds to train the model. The virtualized  cluster had enormous trouble converging the model, thus taking  phenomenally longer. During the test, I observed that the busy thread  kept jumping from one node to another. Since this behavior was not  observed in the Bare Metal cluster, it is clearly attributed to the  Scheduler of the hypervisor. Interestingly, H2O took noticeably less  time to train the same GLM model using the larger dataset!


Most cloud computing providers rely on virtualization to deliver  cheap virtual machines to consumers; data scientists included.  Originally, virtualization was developed to increase the efficiency and  utilization of computing resources in typical business environments. The  virtualization overhead can be justified in these environments because  computing resources are normally underutilized. However, for Machine  Learning tasks where computing resources can be pushed to the extreme,  the impact of virtualization can be overwhelming and unpredictable. Data  scientists who are considering using virtualized cloud platform such as  Amazon EC2, Microsoft Azure and Google for Deep Learning or similar  workloads should consider buying extra resources to make up for the  overhead of the hypervisor.

An alternative to the hypervisor it to use Containers such as Docker.  While this may eliminate the overhead of the hypervisor, performance  can still suffer due to hardware resources oversubscription. Containers  are also considered less secure than virtual machines.

The ultimate option for Machine Learning tasks remains dedicated  hardware, either in-premises or off- premises such as Bigstep’s Bare  Metal cloud.

Note: This article first appeared on CognitionX blog