In the artificial intelligence chip, especially in the training chip market, Nvidia is a well-deserved overlord.

Thanks to the company’s years of investment in general-purpose CPU and CUDA ecosystems, Nvidia has won the vast majority of the open market for AI training chips. As far as I know, except for Google’s own TPU, which is quite large, other AI chips are basically unable to shake Nvidia’s position in the training market. But because this is a huge market, many manufacturers are investing in it, hoping to break this pattern and get a piece of the pie. Graphcore from the UK is one of the strongest challengers.

In the middle of this year, Graphcore released a new generation of IPU chip Colossus MK2 GC200 IPU, and also brought a system solution IPU-Machine: M2000 (IPU-M2000) equipped with four MK2 IPUs. According to reports, this product can also be expanded to 1024 IPU-PODs, that is, 512 racks, and a maximum of 64,000 MK2 IPUs, thereby expanding the computing power of 16bit FP to 16 ExaFLOPs.

Judging from the basic data, the IPU has brought an unprecedented threat to Nvidia. In a benchmark test recently released by Graphcore, the system based on this chip is also ahead of the king in the field of AI chips in many applications.

  Multi-dimensional leading GPU

Jin Chen, chief engineer of Graphcore China and an AI algorithm scientist, told reporters at a recent media conference that Graphcore’s latest AI computing systems, the IPU-M2000 and the vertically scalable IPU-POD64, are used in the training of various popular models. and inference are better than Nvidia’s A100 (based on DGX).

  First look at the training aspect:

From the data provided by Jin Chen, we can see that in the end-to-end training of BERT-Large (Transformer-based natural language processing model), the training time required for NVIDIA DGX-A100 is 69.5 hours. Coming to IPU-POD64, the end-to-end training time of PopART BERT-Large is only 13.2 hours.

  NVIDIA’s strongest challenger: Graphcore conquers developers with data

“In this way, compared to a DGX-A100, BERT-Large can achieve a 5.3-fold improvement on IPU-POD64,” Jin Chen told reporters. She further pointed out that compared to three DGX-A100s, one IPU-POD64 can also achieve a 1.8x improvement. “The power and price of one IPU-POD64 and three DGX-A100s are basically the same, but they can achieve nearly twice the performance improvement, which is a very significant performance advantage,” Jin Chen added.

  NVIDIA’s strongest challenger: Graphcore conquers developers with data

According to Jin Chen, the training performance of Graphcore’s IPU in Deep Voice 3 is also significantly improved compared to NVIDIA GPU. As shown in the graph above, the throughput of the IPU-M2000 is 13.6 times that of the latest GPU.

NVIDIA’s strongest challenger: Graphcore conquers developers with data

When it comes to machine vision training, IPU is not inferior.

As shown in the figure above, in the training of the more familiar ResNet-50, the IPU-M2000 has a 2.6 times throughput improvement compared to the A100. In the training of ResNet-101, compared with A100, IPU-M2000 achieved a 3.7 times improvement in throughput.

NVIDIA’s strongest challenger: Graphcore conquers developers with data

In the training of EfficientNet-B4, EfficientNet achieves an 18x performance improvement over A100 on IPU-M2000. From Jin Chen’s introduction, we know that such a leap can be achieved mainly because EfficientNet is composed of separable depthwise convolutions, and its convolution kernel is relatively small, which reduces your scheduling overhead and operator utilization. There may be a better representation on the IPU. And ResNet-50 is basically composed of convolutions.

NVIDIA’s strongest challenger: Graphcore conquers developers with data

“If the operator is small and there are many operators, the scheduling overhead on the GPU will also introduce the overhead of interacting with the data in the HDM memory, which may lead to their performance degradation. This also proves the new generation of models from the side. Going to IPU is actually more universal.” Jin Chen said.

After understanding the advantages of IPUs in training, let’s take a look at their outstanding performance in inference. First introduced is the inference performance of IPU on EfficientNet. Jin Chen said that the model size developed by Google in 2019 has 8 levels, of which B0 is a model with a smaller model size and has a parameter magnitude of 5 trillion. The B7 is the largest model, with a parameter magnitude of about 60 trillion to 70 trillion.

NVIDIA’s strongest challenger: Graphcore conquers developers with data

“Under the two different frameworks of PyTorch and TensorFlow, the throughput of EfficientNet-B0 on ​​one IPU-M2000 can reach the level of ‘10,000’, and the delay is much less than 5 milliseconds. In the latest On the GPU, even when the delay is maximized, its throughput is far less than the throughput level in units of ‘10,000’, which fully reflects the delay advantage of the IPU,” Jin Chen told reporters .

NVIDIA’s strongest challenger: Graphcore conquers developers with data

“In BERT-Large reasoning, when both IPU and GPU are at the lowest latency, IPU-M2000 can also achieve 3.4 times higher throughput than A100.” Jin Chen further pointed out.

NVIDIA’s strongest challenger: Graphcore conquers developers with data

According to Jin Chen, the IPU-M2000 has achieved far better latency and throughput performance than the GPU in LSTM inference and ResNeXt-101 inference. Among them, the former achieves a throughput improvement of more than 600 times with lower latency, and the latter improves the throughput by 40 times and shortens the latency by 10 times.

NVIDIA’s strongest challenger: Graphcore conquers developers with data

Commenting on the test results, Matt Fyles, senior vice president of software at Graphcore, said that this comprehensive set of benchmarks shows that Graphcore’s IPU-M2000 and IPU-POD64 outperform GPUs on many popular models. He further noted that benchmarks for novel models such as EfficientNet are particularly instructive, as they demonstrate that AI is moving more and more toward the specialized architecture of IPUs rather than the traditional designs of graphics processors.

 Hardware and software are the foundation

From the previous introduction, we can see the strength of Graphcore IPU and systems based on IPU. There is no doubt that Graphcore’s uniquely designed IPU chip is the foundation. But in addition, Graphcore builds an extensible hardware ecosystem based on this chip and a software ecosystem that makes developers’ work easier, which is the strength of Graphcore’s challenge to Nvidia.

Lu Tao, senior vice president and general manager of Graphcore China, told reporters that Graphcore’s IPU-POD64 is a solution consisting of 16 IPU-M2000s, and the solution has been delivered globally. One of the advantages of this solution is the decoupling of x86 and IPU computing.

“In IPU-M2000, we use IPU-Fabric to enable it to achieve dynamic matching with x86 server chips. When you use IPU for computer vision applications, you can set the x86 ratio a little higher. In When doing natural language processing business, you may be able to use one x86 server to drive one IPU-POD64, or even two IPU-POD64s. This is the role of decoupling,” said Tao Lu as an example.

In addition, IPU-POD64 is very rare in the market at present, and it can scale both vertically and horizontally at the same time. It is a very good AI computing platform product.

NVIDIA’s strongest challenger: Graphcore conquers developers with data

According to Lu Tao, the so-called vertical expansion means that IPU-POD64 can realize software transparency from one IPU-M2000 to one IPU-POD16 (4 IPU-M2000s), and then to one IPU-POD64 (16 IPU-M2000s). extension. In other words, the software you compiled in an IPU-M2000 can be applied to the expanded IPU-POD64.

When talking about why the expansion is limited to IPU-POD64, Lu Tao said that this is the result of their exchanges with many leading Internet companies. In the latter’s view, most of the current single workloads will not exceed IPU-POD64 at most. That is to say, for the current most mainstream workloads, one IPU-POD64 can make most engineers do not need to worry about distributed machine learning, distributed machine learning framework and distributed communication. Software transparent extension.

“In contrast, if you expand a machine such as the DGX-A100 from 1 to 4, you need to use a distributed machine learning framework to transform your algorithm model accordingly before it can be applied to on the new system,” Lu Tao continued.

From the perspective of horizontal expansion, multiple IPU-POD64s can support AI computing clusters consisting of up to 64,000 IPUs. This provides developers with more options.

After introducing the hardware expansion capability of Graphcore’s IPU, Lu Tao also shared the company’s progress in software. First, he mentioned that when the company released Poplar SDK 1.4, it also released a production version of PyTorch for IPU.

NVIDIA’s strongest challenger: Graphcore conquers developers with data

“In the PyTorch code, Graphcore introduces a lightweight interface called PopTorch. Through this interface, users can make a lightweight package based on their current PyTorch model. After the package, the IPU and The model is executed on the CPU. In the current Poplar SDK version 1.4, we support both model parallelism and data parallelism.”

NVIDIA’s strongest challenger: Graphcore conquers developers with data

He further pointed out that this encapsulation enables the model to generate an intermediate model representation format that is compatible with the IPU and PyTorch, which can be compiled by PopART and then generate a binary file that can be executed on the IPU. “The Poplar SDK version 1.4 supports the scale-out model from 1 IPU to 64 IPUs. The next-generation Poplar SDK may be able to scale out to 128 IPUs,” added Jin Chen.

In addition to further improving its own software, Graphcore is also conducting open source-related cooperation with Microsoft and Alibaba Cloud. In the words of Lu Tao, the purpose of their cooperation with the two is to enable users to achieve the smoothest possible migration between GPU and IPU from the perspective of AI compilation.

Jin Chen also explained that NNFusion is a work done by Microsoft Research Asia. Its purpose is to allow users to avoid some repetitive work when developing on different chip platforms interface, allowing models to run seamlessly on chips from different hardware vendors.

NVIDIA’s strongest challenger: Graphcore conquers developers with data

“In the middle of the above figure, ideally, NNFusion can do cross-platform work, which can integrate models generated by TensorFlow and models generated by PyTorch or other frameworks. Users only need to use an NNFusion interface to connect to different AI chips. training or reasoning,” Jin Chen told reporters.

As for Alibaba Cloud’s HALO, its original intention is the same as that of NNFusion. It wants to build an overall framework that spans the AI ​​framework upwards and connects to chips from different hardware manufacturers through a common hardware interface such as ODLA. According to Jin Chen, Alibaba Cloud’s original intention is to process different models such as TensorFlow models, ONNX models, or PyTorch models, and then run it on the system or cluster with one click.

When asked about his views on competition, Lu Tao told reporters that, in his opinion, the only challenger to Graphcore in the AI ​​chip market is Nvidia. This is mainly due to the software and hardware ecosystem of AI-accelerated computing including GPU and CUDA that they have built together with developers and communities over the past years. But even so, they are still full of confidence in the future, on the one hand, because Graphcore’s processors fully reflect their value in different places; on the other hand, Graphcore is also solving some problems that GPUs cannot.

“At present, there is definitely still a gap between us and NVIDIA in terms of volume and ecology. But as long as we run faster in the areas we focus on, the distance between us will become shorter and shorter, and even exceed in some areas. Nvidia. Graphcore hopes that in the next few years, it can truly achieve the status of another leading enterprise other than Nvidia in the batch deployment of AI training and inference in the data center, in terms of shipment and volume. Our short- and medium-term goals,” Lu Tao said at the end.

The Links:   G150XG03 V3 G215HVN01.1