Which is faster for Machine Learning and AI: CPU or GPU?
“If you were plowing a field, which would you rather use: [eight] strong oxen or [32768] chickens?”*. Seymour Cray (1925 ~ 1996).
The short answer is: it all depends.
You probably have come across countless articles that praise the power of the GPUs (Graphics Processing Units) and how they are much capable than CPUs (Central Processing Units). While this is true in many use cases, it’s quite the contrary in many others.
In order to determine where GPUs can be preferred over CPUs (or the other way around), it is helpful to have a basic understanding of the architectures that underpin each of these platforms.
The original architecture of CPUs assumed that software would be executed sequentially, line after line. As such, the Organisation, Instruction Set and Memory Management were designed accordingly. The emphasis was to ensure that every instruction is executed as fast as possible. Practically, however, there is a limit to how fast a CPU can operate. The fastest CPUs from Intel and AMD today are “officially” clocked well below 5GHz. This limitation can be felt very clearly in Multimedia, Gaming and Big Data applications where the fastest CPUs still perform poorly.
In order to overcome this performance bottleneck, a new architecture was proposed. In sharp contrast to the traditional CPU architecture, emphasis was put on executing as many instructions as possible in parallel at the cost of slow sequential execution. This had implications on the Memory Management, in which case the GPU memory delivers much higher bandwidth (more Gigabytes per second) but stretches the latency significantly (the time it takes for data to be transferred). Hence, the GPU architecture is commonly described as “Gather/Scatter”. GPUs are also clocked at a considerably lower frequency, typically around 1GHz only to keep energy consumption and heat under control.
Today, a typical high-end CPU has 24 Cores with Hyperthreading and 85GBps memory bandwidth (Intel Xeon E7-8894 v4). A high end GPU has 3584 Cores and over 700GBps memory bandwidth (Nvidia Tesla P100).
To summarise, the architectural trade-offs have put the CPUs and the GPUs in contradicting positions. To determine which one is most appropriate for a given task, Amdahl’s Law can be of great help. In essence, not every task can fully benefit from parallel execution (for various reasons). Ideally, the “sequential” part should be assigned to the CPU, while the parallelised part to the GPU.
The key question here is how much of a program is sequential, and how much is parallel?
From a Machine Learning perspective, algorithms such as Artificial Neural Network and Random Forest exhibit very high degree of parallelism, making them excellent fit for GPUs. On the other hand, algorithms such as GLM tend to be very sequential and would perform very poorly on GPUs.
After all, 32758 hens could beat 8 strong oxen, but only if all of the hens were put to work, simultaneously.
* this is a slightly modified version.