TPU vs GPU vs Cerebras vs Graphcore: A Fair Comparison between ML Hardware

29 min readJul 23, 2020

Over the last years, several hardware platforms have been developed to accelerate the machine learning training process. This includes hardware platforms from giant companies, like Nvidia’s GPU and Google’s TPU, and startup companies, like Graphcore’s IPU and Cerebras WaferScale Engine (WSE). Unfortunately, companies use biased measurement methods and do not ensure a fair apples-to-apples comparison when comparing their new designs against competitors’ solutions. To address this issue, MLPerf is a recent effort from industry and academia to build standardized benchmarks to compare different deep learning hardware and systems. However, MLPerf only considers the raw performance (i.e., training time) as a key measurement and does not take into account other important metrics, such as compute efficiency (hardware utilization), power efficiency (performance/watt), area efficiency (performance/rack), and cost efficiency (performance/dollar).

A summary of this article contents can be found in these slides

Performance per Dollar per Watt per Unit

In this article, I try to analyze and compare the state-of-the-art deep learning hardware platforms found in the market today. My analysis is based on publicly available information from vendors and other sources. Instead of relying only on training performance as a key metric, I try to look into efficiency and scalability metrics. That is, we do not care how big the chip is, or how many theoretical TFLOPS the hardware provides, what we really care about are: How efficient is the hardware? How many TFLOPS are achievable from the theoretical performance? How the performance looks like when the chips are given the same cost, area, and power budgets? and How is the system able to scale and run large DL models? These are the key questions that customers and machine learning scientists are caring about. In reality, the majority of ML scientists are either working in academic labs or ML startups with a limited budget, so they are interested to select the most efficient hardware with the lowest purchase and maintenance cost. Before I show the efficiency comparison, I would like first to discuss the design philosophy of Nvidia’s GPU and Google’s TPU (data parallelism) vs Graphcore’s IPU and Cerebras WSE (model parallelism).

Design Philosophy (Model vs Data Parallelism):

There are two kinds of parallelism to exploit in deep learning (DL) training: Data Parallelism (improving training throughput) and Model Parallelism (improving training iteration latency). A hybrid system of model and data parallelism can be applied.

Data Parallelism

The data parallelism relies on weak performance scaling by increasing the batch size in the minibatch-SGD algorithm and assigning a sub-batch on each node. The whole model has to be duplicated on each node, and the model is executed layer-by-layer in forward and backward propagation to calculate the error and model gradients. An all-reduce synchronization has to occur at the end of each iteration to accumulate the model gradients (i.e., MPI_Allreduce operation). Nvidia’s GPU and Google’s TPU are built to rely more on large-scale data parallelism. They are running on a large number of nodes associated with large DRAM capacity to save the duplicated model and deploying high-speed interconnect to ensure fast all-reduce synchronization. Nvidia relies on high-speed NvSwitch and Infiniband technologies for interconnecting the nodes while it is unknown what Google’s TPU employs. To ensure a large scaling, a tedious process is required to tune the hyperparameters and increase the batch size without affecting the training accuracy. The larger the batch size is, the higher throughput and training performance will be achieved.

Model Parallelism

The model parallelism exploits the intra-layer (GEMM operation) and inter-layer (pipeline) parallelism by running multiple layers in parallel and transferring data between layers in a dataflow manner. In this scenario, there is only one version of the model to be updated. Graphcore’s IPU and Cerebras WSE are built to efficiently exploit the model parallelism. The design philosophy is to map the whole model on a large on-chip SRAM (a few of GBs) and avoid expensive DRAM and IO transfers. Cerebras builds a single big die (chip) using wafer-scale technology to accommodate the largest possible model. The Cerebras compute units are connected with an on-wafer 2D mesh topology. Graphcore, on the other hand, builds multiple relatively-smaller chips connected via high-speed custom fabric (IPU-links) with a bi-directional ring topology.

Hybrid Parallelism

Thanks to CUDA unified memory addressing and NVswitch technology, model parallelism can also be exploited on Nvidia’s GPU by running the model’s layers (i.e. matrix multiplication operations) on multiple GPUs [53], but probably it won’t be as efficient as Graphcore and WSE dataflow chips. Similarly, Graphcore and Cerebras are still able to support data parallelism but this is only possible for a small batch size that can be fitted on the SRAM capacity. Moreover, Graphcore and Cerebras claim they can also support large scale data parallelism, using multiple servers connected via ethernet, however, I believe it won’t be as efficient as TPUs and GPUs that have better IO interconnect (e.g. Nvidia Mellanox Infiniband), large-scale tightly-coupled nodes (e.g. Google’s TPU POD and Nvidia’s SuperPOD) and highly-tuned software and hardware synchronization (e.g. Nvidia’s NCCL and Mellanox SHARP technology). In short, each hardware can still exploit the other parallelism type, however, it won’t be as efficient as its counterparts.

By the time this article was written, Graphcore announced its second-generation IPU (IPU2). Interestingly, they improved the IPU to efficiently support large-scale data parallelism and large models. First, the second-generation IPU comes with DDR4 external memory (up to 450 GB per node) to overcome the limited on-chip SRAM capacity. Second, a custom high-speed fabric within IPU-POD is used to efficiently scale-out IPUs for a large scale of data-parallelism similar to GPU and TPU PODs.

Batch Size and Accuracy Tradeoff

The tradeoff between the batch size and training accuracy is debatable. Increasing the batch size can affect the target accuracy by missing the minimum error point [46,47]. Thus, the MLPerf defines a quality-threshold for each benchmark (for example, 74.9% Top-1 accuracy for ResNet-50 training [37]). That is, the companies can tune the modifiable hyperparameters and scale the batch size, as long as they meet the specified accuracy. This hyperparameter optimization process is expensive and time-consuming. On the other hand, many startups, including Cerebras and Graphcore, are trying to avoid scaling the batch size to achieve better training accuracy than the large-scale data parallelism approaches and relying more on model parallelism to improve training time.

Batch Size Trick

Increasing the number of chips in data parallelism training by (1) increasing the batch size or (2) decreasing the sub-batch per chip

In data parallelism training, increasing the throughput (i.e. #chips) can be achieved in two ways, (1) scaling the batch size, or (2) decreasing the sub batch per chip. Increasing the batch size makes the training to converge slower, and decreasing the sub-batch may affect the hardware utilization.

The image is adopted from Google based on MLPerf v0.6 submission. It shows TPU-v3 outperforms Nvidia GPU V100 over Transformer and SSD workloads using a large-scale TPU system (TPU Pod). As we can read from the image caption, the number of TPU and GPU chips is not the same.

The MLPerf takes into account the ability of submitters to tune the batch and sub-batch sizes to increase the system throughput. For example, as shown in the MLPerf website here, when GPUs and TPUs run similar batch sizes and have the same amount of chips, they almost show the same training performance in SSD and Transformer benchmarks. Interestingly, Google announced here (the image is attached above), they outperform GPU by 84% on these workloads. In fact, the primary reason behind this is not because the TPU chip is more powerful than the GPU, but, because Google was better than Nvidia in tuning hyperparameters, the batch, and sub-batch sizes for the underline system and thus increasing the training throughput. As demonstrated in the above image caption, the number of TPU and GPU chips is not the same.

As we can see, MLPerf takes into account the system-level and software implications, and thus, it does not tell us which hardware platform is more efficient. In the next sections, I am going to do an apples-to-apples comparison, focusing on efficiency metrics.

Chip-to-Chip Comparison:

In this section, I will compare the state-of-the-art chips from each vendor. I will do a chip-to-chip comparison, then I will show server-to-server and rack-to-rack comparisons in the subsequent sections. I compare Google’s TPU-v3, Nvidia’s Volta V100, Graphcore’s Colossus first-generation IPU (IPU1), and Cerebras WSE chips. I also include the recently-announced Nvidia’s Ampere architecture A100 and Graphcore 2nd-generation IPU (IPU2).

The table above shows the raw metrics (the first 11 entries) that are announced by the vendors or estimated from other public resources. Also, I show the efficiency metrics (entries 12–19) that are calculated based on the raw metrics and other public benchmarking resources. I am going to discuss these efficiency metrics in detail in the following points.

(1) Compute Efficiency (entry 12)

GEMM achievable performance on TPU-v3, V100, A100, and Graphcore IPU1 chips. Source: [32, 23, 25, 12]

The hardware vendors typically announce the theoretical peak TFLOPS (entry# 5). However, It is important to measure how much TFLOPS we are able to achieve from the theoretical performance. This metric shows how efficient the hardware architecture and memory system to feed up the MAC units with data over time. The achievable performance is calculated when running a large size of dense matrix multiplication (GEMM) operation to fully utilize the underline hardware resources. We considered GEMM operation as it is the backbone of ML workloads. In the next section, when comparing server-to-server, I will, instead, consider the ML training time to measure hardware performance. The achievable GEMM TFLOPS of TPU, GPU, and IPU1 are plotted in the above image and are adopted from [32, 23, 25, 12] respectively. We report 16-bit mixed-precision FLOPS for all the hardware and we used the results of the optimized linear algebra library released by the hardware vendors (for example, CuBLAS for Nvidia and PopLin for Graphcore).

TPU has the highest hardware utilization, thanks to the systolic array architecture, and is able to achieve 80–100% of the theoretical performance depending on the GEMM input size [32]. A matrix dimension (or batch size) of at least 16K is required to achieve 98% of performance.
GPU is able to achieve 70% to 93% of the theoretical TFLOPS. Although Nvidia’s GPU can achieve 99% utilization on 32-bit single-precision operations [25], this is not the case in 16-bit tensor cores. The performance is ideal when matrix dimensions are aligned with (i.e. a multiple of) tile boundaries and SM count. In this case, up to 88% (110 TFLOPS) is achievable in V100 and 93% (290 TFLOPS) in A100 when running 4K and 8K matrices respectively. See references [23,25] for further details.
For Cerebras, the titanic chip comes with 2.5 PFLOPS of theoretical peak performance. A recent study [2] shows that Cerebras can attain 33% of peak performance when solving a linear system of equations of a finite stencil, claiming higher utilization than GPU cluster for the same problem. However, we do not have enough information about the achievable performance when running GEMM or ML workloads.
In Graphcore IPU1, only 50% of the performance (58 TFLOPS) is achievable [12] when executing GEMM, lower than GPU and TPU counterparts. This low achievable performance of the IPU1 chip can be owed to poor instruction scheduling of the backend libraries and compiler. In IPU2, the compute efficiency has improved to 61% (154 TFLOPS).

(2) Energy Efficiency (entries 13-14)

In this metric, I measure the energy efficiency (TFLOPS/watts) by calculating achievable TFLOPS of GEMM divided by Max TDP (i.e. entry13= entry12 /entry11). I conservatively use the theoretical max TDP, however, for accurate evaluation, we should have measured the exact power dissipation consumed by hardware in real-time, but this metric was not reported and is hard to evaluate. I also calculate the theoretical energy efficiency in entry14 (i.e. entry5/entry11).

Interestingly, and thanks to Tensor Cores, Tesla V100 has similar energy efficiency to TPU-v3. However, if we take into account that TPU is fabricated on an older technology node (see entry #1, 16nm vs 12nm), then TPU architecture is properly more energy-efficient than GPU V100 by a small margin (an estimated 25%).
Graphcore IPU1 is 62% more energy-efficient than V100 and TPU, even though the Graphcore chip was fabricated on a lower technology node (at TSMC 16 nm) versus the advanced V100 node (at TSMC 12nm). Graphcore efficiency is primarily due to the energy-efficient on-chip memory accesses.
Cerebras WSE shows the lowest theoretical energy efficiency, even when compared to the chips fabricated on the same technology node.
Thanks to the advanced 7nm technology and other architecture improvements, the Nvidia Ampere A100 and Grapchore IPU2 are the most energy-efficient chips among all (3x more energy efficient).

(3) Memory/Model Size (entry 15)

The memory size determines the maximum matrices dimensions, ML model, and batch size that can run on the hardware (model size * sub-batch size < memory size). Due to the higher density of DRAM vs SRAM, GPU and TPU chips can execute larger matrices, ML models, and batch sizes compared to Cerebras and Graphcore IPU1 chips which do not contain any off-chip DRAM and rely on limited SRAM as their main memory. The largest square matrix operands fitting on 300-MB IPU1 are 2,944×2,944, while on a 32-GB GPU they are roughly ∼50,000× ∼50,000 [12]. So, it seems the on-chip energy-efficient accesses of IPU do not come for free, as it limits the data input size.

The Graphcore IPU2 overcomes the limited SRAM capacity by augmenting the IPU2 chip with 112GB of DDR4 external memory. The runtime framework is responsible for transferring data between DDR4 and on-chip SRAM transparently (this technology is known as Exchange Memory [58]). However, we do not know the achievable IPU’s GEMM performance when the matrices reside in the external DRAM memory. The achievable performance 50% and 61% of IPU listed in the table are collected when matrices are in SRAM. It is worth noting that the IPU’s DDR4 has a much lower bandwidth compared to TPU’s and GPU’s HBM (see entry#8), so I expect the IPU2 performance goes significantly down if the data is allocated in the DRAM.

(4) Memory Efficiency (entries 16-17)

In this metric, I measure how efficient the memory system to exploit the data locality found in GEMM operations (i.e. achieving the highest possible TFLOPS with the lowest memory bandwidth budget). This is determined by calculating the achievable performance of GEMM, dividing by the memory DRAM bandwidth (FLOP/DRAM byte, i.e. entry12/entry8). This is also known in the literature as “operational intensity”. Again, I conservatively use the theoretical memory bandwidth, however, for accurate evaluation, we should have measured the exact attained memory bandwidth during run-time.

The V100 GPU and TPU achieve similar memory efficiency as they both have similar performance at the same memory bandwidth budget.
The A100 achieves 2.6x more TFLOPS while the memory bandwidth has only increased 1.7x from the previous generation, making A100 50% more memory efficient than V100, thanks to the new architecture improvements in Ampere’s cache hierarchy [18].
Compared to Graphcore, GPUs are significantly more memory efficient when measuring the metric FLOP/SRAM byte (entry 17 = entry12/ entry10). The Graphcore has higher on-chip local memory BW (45 TB/sec) compared to GPU’s L2 cache BW (3 TB/sec in Volta and 7 TB/sec in Ampere). The multi-level memory and cache hierarchy of GPUs (DRAM->L2->L1->Sharedmem->Reg File) are efficiently able to exploit the data locality in dense GEMM operations, reducing the bandwidth requirements of the memory lower levels and improving overall energy efficiency.

(5) Area Efficiency (entries 18-19)

To reduce the technology node effect, we consider the metric (TFLOPS/BTran) for area efficiency. In this case, TPU has the highest performance density. This is due to the fact that TPU employs domain-specific systolic arrays for matrix multiplication. On the other side, GPU allocates considerable transistor area for other non-ML domains (e.g. graphics, x-bar interconnect, TLB, HPC 64-bit precision, etc.), and Graphcore dedicates a large area to its SRAM, decreasing the allocated space for compute units.

Server-to-Server Comparison:

Typically, ML scientists run their experiments on multiple chips to increase computing power. Thus, the ML hardware vendors efficiently combine multiple chips together along with a high-speed interconnect fabric, multi-socket CPUs, and system storage in a server box. In this section, I will compare the server performance of each vendor. In specific, I include (1) TPU v3 server with two shelves, each shelve has two boards, each board contains four TPU chips, a total of 16x TPU chips, (2) Nvidia DGX-2H V100 server with 16x V100 chips, (3) Nvidia DGX A100 serve with 8x A100 chips, (4) Cerebras CS-1 server with one WSE chip, and (5) Graphcore DELL EMS IPU server with 8x IPU1 card, each card has 2 IPU1 chips, a total of 16x IPU1 chips. (6) Graphcore IPU-POD-16 with 4x 1U IPU M2000 server machines (4U in total) and each machine has 4x IPU2 chips, a total of 16x IPU2 chips. It is worth noting that the Graphcore IPU2 has a disaggregated, flexible and modular configuration in which you can seamlessly build and connect 1x and up to 16x IPU M2000 server machines in one big server. For the purpose of this comparison, we selected the pre-packaged 4x IPU M2000 server (IPU-POD-16) connected to 1U of the DELL R6526 CPU server as described in the IPU datasheet [59].

From left to right: (1) TPU server (shelf) with two boards, each board has four chips. (2) Nvidia A100 DGX-1 server with 8x A100 chips. (3) Cerebras CS-1 server with one WSE chip. (4) Graphcore DELL EMS IPU server with 8x cards, each card has 2 IPU1 chips, 16 IPU1 chips in total. (5) Graphcore IPU M2000 1U server machine with 4x IPU2 chips. Note that the images/servers are not on the same scale.

Server-to-server comparison. Notes: (1) the Google TPU server power and size are adapted from [33]. (2) The Nvidia DGX V100, DGX A100, Cerebras CS-1, Graphcore IPU1, and IPU2 server details are adapted from [16, 17, 1, 7, 59] respectively. (3) TPU-v3 and V100 GPU ResNet MLPerf results are adopted from MLPerf website itself [37], whereas the A100 result is estimated from the Nvidia website [38]. The data shown of A100 is for A100–40GB card as the results of A100–80GB are not published yet. (4) TPU and GPU cloud prices are adopted from Google Cloud, whereas Graphcore price is adopted from Cirrascale cloud services. (5) The DGX-A100 server theoretical performance (2.5 PFLOPS) is without sparsity as we are interested in training performance. (6) the performance/cloud price (entry#23) is calculated by = Resent performance/(chip cloud price * #chips), i.e. = entry#14/(entry#12*entry#1)

The table above shows the raw metrics (the first 13 entries) that are announced by hardware vendors or estimated from other public resources. I also show the efficiency metrics (entries 14–23) that are calculated based on the raw metrics and other public benchmarking resources. I am going to discuss these metrics in detail in the following points.

(1) ML Training Achievable Performance (entry 14)

16x GPU server (DGX-2H) vs 16x TPU v3 server normalized performance on MLPerf-train benchmarks. The data is collected from the MLPerf website. All the TPU results are using Tensorflow. All the GPU results are using Pytorch, except ResNet is using MxNet

I compare the training throughput (images/sec) of MLPerf-ResNet-50 v1.5 benchmark. I collect the results from the MLPerf v0.6 and Nvidia websites. To remove the batch size effect, I ensured to select the performance when both TPU-v3 and GPU servers have the same amount of chips (16x chips), the same total memory size (512 GB of DRAM), and almost similar compute throughput (2 PFLOPS).

Based on the results submitted by vendors, the TPU-v3 outperforms GPU V100 by 23% on ResNet training. However, it is important to note that this is not the general case, as the best hardware depends largely on the DL model itself (as shown in the above image). For example, the GPU V100 shows relative performance or even outperforms TPU on object detection, transformers, and recurrent DL models. For complete results, please see the MLPerf website. For the rest of the article, I use the ResNet-50 benchmark (entry 14) for comparison as it is widely popular.
The DGX A100 server comes with fewer chips than the DGX-2 V100 server. At the time that this article was written, the DGX A100 server with 8 chips was only available, maybe Nvidia will later release a new DGX A100 server with 16 chips. Although the DGX A100 server has one-half the number of chips DGX-2 V100 server has, it still comes within 87% of performance on ResNet-50 training benchmark (see entry 14).
Unfortunately, Cerebras and Graphcore have not yet submitted their MLPerf results in order to compare with TPU and GPU. On the Graphcore website here, they claim the 16x IPU server outperforms an undefined GPU server by 1.3x on ResNext-50 benchmark.

(2) Model/Memory size (entry 15)

Recent NLP models along with the number of model parameters. The recent GPT-3 has 170B parameters. The NLP scientists aim to build a model with 1 Trillion parameters. Image: Source

The memory capacity determines the model and batch sizes that can be run on the hardware (model size * sub-batch size < memory size). Thanks to the high DRAM density, GPU and TPU have larger memory capacity, in the order of 32 GBs. More importantly, and thanks to the CUDA unified addressing and non-blocking communication of NVswitch, Nvidia GPUs have the capability to access the DRAM memories of each other transparently. This allows multiple GPUs to be combined together and run larger models beyond a single GPU memory capacity (up to 512 GB in DGX-2 V100, and 640 GB in DGX A100). This is achieved by exploiting the model parallelism and partition model parameters over multiple GPUs. Thus, there is no wonder that all the large and recently-published NLP breakthroughs were all done on GPUs, including Nvidia’s Megatron-LM (8.3B parameters), Microsoft Turing-NLP (17B parameters), and the giant Open-AI’s GPT-3 (170B parameters). A single batch of GPT-3 model (batch size = 1) is estimated to require 400GB of memory! Moving forward, the NLP scientists aim to build a 1 Trillion parameter model.

TPU is only limited to 16 GB of memory size because the current version of TPU does not support unified memory addressing. The TPU chip comes with 32 GB, but each chip contains two separate cores, and each core has a dedicated 16 GB of memory. For more information about the TPU training model, please see this detailed article [32] from Harvard University.

UPDATE (Jan 2021): based on the recent papers from Google (GShard [56] and switch transformers [57]), TPUs can still train large NLP models as well (up to 1 trillion parameter model) by exploiting model parallelism (partitioning the independent MoE layers over multiple nodes) and synchronize the TPUs via their fast interconnect technologies. For further details, see [56,57].

On the other side, the limited SRAM capacity of Graphcore IPU1 and Cerebras is clearly a bottleneck to run these big models, even a single batch of the Megatron-LM (8.3B parameters model with 23 GB of data) cannot be fit in the limited 18GB of Cerebras or the 5GB of Graphcore IPU1 memories.

As mentioned earlier, to enable large DL model training and reduce the memory capacity gap with GPUs, Graphcore IPU2 servers come with high-capacity DDR4 memory modules, up to 1.8 TB of DRAM memory along with 14.4 GB of on-chip SRAM memory is available on the IPU-POD-16 machine.

(3) Performance/Area (entries 16–18)

In these metrics, I calculate how much performance we can obtain per area. This tells us how good is the integration and packaging technology of the server. The area measurement here is the standard rack form factor U (see entry# 4). I measure three metrics: the compute transistor density (the amount of computing transistor per unit factor), the theoretical PFLOPS performance per unit factor, and the ResNet training throughput per unit factor.

Interestingly, Graphcore has the highest compute transistor density and the highest theoretical PFLOPS/U. It was expected that Cerebras should be the highest on these metrics because of its wafer-scale technology, however, it seems the complex and large cooling and packaging system that is associated with the WSE chip has wiped out most of the saved space the WSE provides, making the 46,225 mm2 chip comes in 15U server. It is worth mentioning that both Cerebras and Grapchore IPU1 are built on the same technology node (16nm). Further, the Graphcore server comes with a built-in CPU and system storage, whereas Cerebras does not come with any system components, instead, the customer can connect to the Cerebras CS-1 server with another front-end system server, as was shown in Neocortex system of Pittsburgh Supercomputing Center. Of course, the on-wafer scale-up interconnect of the Cerebras WSE should provide higher bandwidth than other off-chip interconnect technologies (see entry#7, WSE has the highest TFLOPS/scale-up interconnect BW), and thus achieving better scale and compute efficiency, however, we do not have any performance data (for example, MLPerf) to support this conclusion.
TPU outperforms both the GPU V100 and A100 by 2x and 1.5x respectively on the ResNet training throughput per area. Thanks to TPU’s dense packaging, liquid cooling, the advanced fabric interconnect, and separating the system storage in a different rack from the compute rack, 8x TPU chips can be fit in only a 2U server shelf. On the other hand, the NVswitchs and the large system and storage components that DGX provides have set up the 16x V100 chips in a 10U server and 8x A100 chips in a 6U server. It is worth mentioning that other denser A100 server option is available from other OEM vendors, for example, Supermicro provides 8x A100 chips in only 4U server and at a lower price than Nvidia’s DGX.

(4) Performance/Power (entries 19–21)

In these metrics, I calculate the energy efficiency of the servers. Similarly, I calculate three metrics: the power density (watts per unit factor), the theoretical TFLOPS performance per watts, and the ResNet training throughput per watts. I used the compute TDP power reported in entry#10 (the power consumed only by computing resources) for measurement rather than the system TDP power. This is because servers come with different CPU and storage configurations as shown in entry#9.

For the power density metric, lower values are better. Therefore, the A100 server has the lowest, and thus the best, power density.
Theoretically, Graphcore IPU1 has higher TFLOPS/watts compared to TPU-v3 and GPU V100. When we move to 7nm technology, the IPU2 server is still 2x more energy-efficient than the A100 server.
Based on practical results submitted to MLPerf ResNet training, TPU-v3 server is more energy-efficient by roughly 23% over V100 server, whereas A100 server is better by 50%.

As I mentioned earlier, for accurate evaluation, we should measure the exact power dissipation consumed by hardware during training time rather than relying on the max theoretical TDP; however, this metric is hard to evaluate.

(5) Performance/Dollar (entries 22–23)

As an ML scientist, you would like to get the best training performance with the lowest possible budget. All the hardware are available for on-premise purchase (entry# 13) except for TPU that is only available in the Google Cloud. Thus, I calculate the performance/dollar metrics (entries 22–33) based on the cloud price.

Entry# 12 shows the cloud rent price per chip. In the cloud, TPU is 20% cheaper than GPU, and Grapchore is almost 2X cheaper than both. TPU and GPU cloud prices are adopted from Google Cloud Platform (on-demand non-preemptible price), whereas Graphcore cloud price is adopted from Cirrascale cloud services. GPU price on Google cloud is similar to other cloud providers (e.g. AWS and Azure).
Entry# 13 shows the on-premise purchase, the Cerebras chip is the most expensive one, with a cost estimation of 2M dollars per server [3]. In case the TPU servers are available for on-premise purchase, I guess they will be more expensive than GPU counterparts, as they employ advanced liquid cooling and packaging that will increase the total integration cost.
Graphcore IPU1 and IPU2-POD-16 servers come with only 105K and 130K USD respectively for on-premise purchase and have the lowest cloud price. Thus, Graphcore has the cheapest theoretical PFLOPS per dollar.
For the achievable ResNet training throughput/dollar (entry# 23), the TPU-v3 can achieve 50% more throughput than the GPU V100 at the same cloud price, thanks to the fact that TPU is 25% faster in ResNet training and it is 20% cheaper in the Google cloud. However, TPU-v3 is only 6% more price efficient than A100. Again, I would like to stress the point that this is not the general case, and the best hardware may vary from one DL model to another, as was shown in the training performance metric. Also, Nvidia has nothing to do with the cloud price as this is only determined by the cloud provider. This gives an advantage to Google (as a cloud provider) to control the cloud price and ensure TPU is always cheaper than GPU.

(6) Programmability and Software Ecosystem

It is worth mentioning that GPUs are programmable general-purpose accelerators that can be used in different domains, such as gaming, data visualization, data analytics, HPC, deep learning training, and inference. In fact, we should not interpret GPU as a “Graphics Processing Unit” anymore; we should call it a “General Purpose Unit” accelerator. This one-for-all strategy is very beneficial for the cloud providers in a way that they have one piece of silicon that can be rented to different customers from different domains, improving the overall cloud resource utilization, and thus increasing the profit margin. This is in contrast to having a domain-specific accelerator for each domain, especially if the performance gain from the accelerator is not large enough (compared to the programmable GPU) to justify the deployment cost in the data center

Further, the GPU has a more powerful software ecosystem. GPUs are accessible from almost all the DL frameworks (Tensorflow, Pytorch, etc.), the popular CUDA programming allows the scientists to write their own kernel and layer implementation, and last but not least, a complete list of high-level ML frameworks (Nvidia’s TensorRT, Triton, Jarvis, Merlin, etc.) empower the users with rich domain-specific APIs, increasing the work productivity and reducing time-to-market.

It is obvious that Nvidia’s GPU is leading the programmability and software stack aspects. This is something that Nvidia CEO is always stressing on during his GTC conference talks. While startups have recently invested more in their software stacks to reduce this gap, I believe Nvidia will remain leading in this aspect and having more mature tools due to the relatively large number of software developers they have.

Rack-to-Rack Comparison:

When increasing the batch size, a larger system is required beyond a single server. In this scenario, multiple servers/nodes are combined together in a rack, similar to the datacentre and supercomputer scaling.

From left to right: (1) TPU v3 rack (2) Nvidia V100 DGX-2 server rack (two racks are shown) (3) Nvidia A100 DGX-1 server rack (two racks are shown). (4) Cerebras CS-1 server racks. (5) IPU server rack, images source: [21,35,1]

The table above shows the key characteristics of different racks from different vendors. The efficiency metrics (Performance/unit/dollar/watts) are similar to those obtained in server comparison. Thus, I did not recompute these metrics again in the table. If the vendor does a good job at the chip- and server-level, they should achieve the same efficiency at the rack and system levels. Interestingly, TPU-v3 and Graphcore IPU2 racks have cracked more chips than Nvidia, 128x chips in TPU and IPU racks vs 64x chips in V100, and A100 racks. It is worth mentioning that, based on the TPU rack image and other public resources [33], the TPU rack seems to be wider than the standard rack dimensions.

Pod-to-Pod Comparison:

Multiple racks can be put together to form a Pod (Google calls it Pod while Nvidia calls it SuperPOD). The Pod contains 4 –100s of tightly coupled racks, depending on vendor configuration. For example, a TPU-v3 Pod contains eight racks. The table below shows the key characteristics of different Pods from different vendors.

From left to right: (1) TPU v3 Pod (2) Nvidia SuperPOD and (3) IPU-POD. images source: [30,27,65]

The most important factor when it comes to a large-scale system is to ensure the nodes have enough network speed to communicate efficiently. In data-parallelism training, this is very critical for the all-reduce synchronization operation that occurs after each training iteration. Google and Nvidia are aware of that, and each vendor is using a different approach. While Google is using undisclosed custom fabric with 2D toroidal mesh topology, Nvidia is relying on Mellanox 100–200 Gb/sec Infiniband with fat-tree topology using Infiniband switches (see entries 8–9 for further details in the rack table). The Cerebras and Graphcore IPU1 employ standard 100 Gb/sec ethernet links and switches. In the second-generation IPU, Graphcore has harnessed ethernet tunneling and built an IPU-Gateway-Link of 100 Gb/sec. They can use a switch-less 3D-ring topology or an ethernet switch-based network to connect up to 512 racks.

Performance Scaling

The image above shows the performance scaling as we increase the number of chips, in both TPU v3 and GPU V100, for ResNet-50 and Transformer models. All the data points are adopted from MLPerf v0.6 website. Both TPU and GPU show sub-linear performance scaling in ResNet-50 and Transformer models. Similar trends were also observed for other benchmarks. In TPU, when we move from 16x chips to 128x chips (8x increase in throughput), MLPerf-ResNet shows just 6x speedup. Moreover, the scaling gap increases as we increase the number of chips. For example, when moving to 1024x chips (64x increase in throughput), MLPerf-ResNet shows only a 32x speedup. The primary causes of this gap are two-fold. First, as we increase the number of chips, the training performance becomes limited by communication. Second, starting from the 128 chips point, the batch size becomes constant (=32K), and the sub-batch per chip decreases instead to increase the throughput. As a result, this decreases the amount of work per TPU, and reduces hardware utilization. As shown on the right side of the image, the performance scaling gap of the Transformer model is even much worse, as the maximum batch size achieved is 2K.

UPDATE (Jan 2021): Based on David Patterson's slides [34], the TPU is able to achieve 77% perfect scaleup at 1024 chips on ResNet50 training, not 52% as in MLPerf 0.6 results. This is because MLPerf includes evaluation time. See the below slide from [34] for further details. Thus, in reality, the scaling of TPUs and GPUs are probably better than what was plotted above.

Published TPU scaling vs MLPerf TPU scaling on ResNet50 training. Source: [34]

Other Startups:

A list of some startups and giant companies working on ML training hardware. For a complete list of ML hardware startups, please see this link.

In this article, I mainly focused on ML training hardware, not inference. I also discussed the vendors that have already shipped products to the market and there is enough public information about it. There are dozens of startups and companies working on ML hardware out there, including the recently-acquired Intel Habana labs, Huawei’s Ascend 910, and the skyrocketing Sambanova Systems; however, they have not shipped their products yet (as of July 2020). For a complete list of ML hardware startups, please see this link.

The AI chip market is anticipated to be around 90B USD in 2025 [41]. This includes both inference and training. All the startups and giant companies seek to take a large portion of the cake. This really reminds us of what happened in the past. In the 1990s, there were a bunch of startups working on graphics accelerators until we reached the plateau. Eventually, few startups succeeded, some failed and some others were acquired by the successful startups. The market was, ultimately, dominated by two companies (Nvidia and AMD’s ATI). In the 2000s, networking was a major concern, and this has excited engineers to build more efficient networking hardware (e.g. Infiniband technology, switches, wireless and software-defined networks). This time, several startups have succeeded, and the market share was divided among them, including, Mellanox, Juniper, and Arista, among others. We are undoubtedly witnessing these days the third wave of hardware startups, and this time the hot topic is machine learning. So, let's see how the ML market is going to end!

Conclusion and Final Remarks:

In this article, I tried to perform a fair comparison between different ML hardware found today in the market, by doing an apple-to-apple match, focusing on efficiency metrics (performance per dollar/unit/watts).

The conclusions of my analysis can be listed as follows:

Theoretically speaking, Graphcore seems to be the most efficient, however, it is not obvious how much achievable performance we can obtain when running standardized ML workloads, like MLPerf. In fact, based on a recent study, Graphcore shows low hardware utilization when running GEMM operations.
Practically speaking, and based on the data submitted to MLPerf v0.6, TPU looks to be more efficient for ResNet training than GPU, however, this is not the general case and the best hardware depends largely on the DL model itself. For example, Nvidia’s GPU is more efficient in object detection workload.
GPU and TPU have an advantage in running larger DL models.
Nvidia GPU provides a more powerful software ecosystem and programmability.
I argue that MLPerf should consider the efficiency metrics as key measurements rather than relying only on training time.

Finally, these are some remarks that I would like to point to:

The ML training problem seems to be limited by (sorted from the most important to the least): (1) Memory capacity: memory capacity can affect the hardware functionality and limit the range of models the user can run. (2) Communication: this includes, on-chip, scale-up, and scale-out interconnects. If the compute and memory resources cannot communicate efficiently, then It does not matter how many FLOPs the hardware provides, as the compute resources will be idle most of the time. (3) Compute: and yes, of course, we will still be limited by computing power.
Data vs Model parallelism, which one is better? It is hard to answer this question. Scaling the batch size does not work all the time, and relying only on model-parallelism techniques can always be outperformed by a high throughput system. Thus, the hardware platform should support both efficiently and not tradeoff one for the other.

Data Sheet and Slides:

An excel sheet of all the data reported in this article can be found here.

A PDF slides that summarize the contents of this article can be found here.

Updates:

Update (Feb 2020): By the time that this article was written, Graphcore announced its second-generation IPU in July 2020. Interestingly, they addressed the two shortcomings that I discussed in the article. First, the second-generation IPU comes with DDR4 external memory to overcome the limited on-chip SRAM capacity. Second, a custom high-speed fabric within IPU-POD is used to efficiently scale-out IPUs for a large scale of data-parallelism similar to GPU and TPU PODs. I hope these new advances encourage the Graphcore folks to submit their MLPerf results. The author has updated this article accordingly to include IPU2.

Cerebras and Google will also present their next generation of WSE and TPU at the next Hotchip conference in August 2020. The author is going to update this article accordingly to ensure it is up-to-date with the recent advances.