Accelerate Deep Learning Performance with Intel® Xe Graphics and the Intel® Distribution of OpenVINO™ toolkit

MaryT_Intel · ‎12-07-2020

Key Takeaways

Having a powerful hardware architecture is just part of the solution. In fact, a proper software framework is required to unleash this power for developers and end-users.
Besides overall architecture improvements, such as an increase in execution units, higher frequencies, and memory bandwidth, the Xe-LP introduces hardware acceleration for low-precision inference on DNN models, which is fully enabled with the OpenVINO™ toolkit.
One of the signature features of the OpenVINO™ toolkit is “multi-device” execution. With the use of this feature, developers can run inference on a combination of “compute devices” on one system in a transparent way to maximize inferencing performance
In using the multi-device plugin, we can harness the compounding effect of hardware and software improvements. This multi-device mode utilizes the available CPU and integrated GPU for full system utilization.

Overview

Intel processors are often associated with powerful x86 cores with most powered with integrated graphics. Integrated graphics represent a potential target for computation offload, which means you can move computations to the Intel integrated GPU built-right-in while using the CPU side of the processor for interactive tasks or low latency functions. In fact, AI inference workloads can take advantage of this kind of computation offload. You can use the runtime in the Intel® Distribution of OpenVINO™ toolkit to run inference tasks on integrated graphics as if it was any other supported target, such as a CPU.

As part of the previously announced 11^th gen Intel® Core™ Processor (formerly code named Tiger Lake), Intel delivers the first representatives with X^e-LP microarchitecture, also known as Gen12LP. Besides overall architecture improvements, such as an increase in execution units, higher frequencies, and memory bandwidth, the X^e-LP introduces hardware acceleration for low-precision inference on DNN models, which is fully enabled with the OpenVINO™ toolkit.

Let’s walk through how it may be beneficial for you to use the new Intel® GPU architecture, and as well as, show you specific use cases for deep learning.

X^e-LP Microarchitecture Overview

Historically, Intel graphics chips are divided into different generations (GenX). Each generation is sub-divided into tiers of increasing performance, denominated as GTx. The 11^th gen Intel® Core™ Processor has a GT2 chip, which is the most performant Intel GPU with the X^e-LP microarchitecture at the moment of writing this blog.

The high-end Gen12 GT2 GPU has 96 execution units, or EU (compared to 64 EUs in Gen11 GT2 or 24 EUs in Gen9 GT2 GPUs). Each of those units has SIMD8-wide floating-point and integer arithmetic logic units (ALUs). In addition, the X^e-LP GPUs have a larger L3 cache (3.8 MB), and separate shared local memory (768 KB) that is not a part of the L3 cache anymore.

The improvements do not stop at the number of EUs and the increase of cache sizes. The X^e-LP GPUs can operate at higher frequencies at the same voltage, which impacts the performance and power efficiency of all workloads.

Many GPU kernels suffer from low occupancy; meaning, the GPU is underutilized. To address this issue, the X^e-LP GPU allows running two concurrent execution contexts in parallel which can improve the performance in such cases.

And the most important feature for neural networks inference is a new instruction added in X^e-LP which is called DP4A. The specification of this instruction is similar to the Vector Neural Network Instructions (VNNI) available on Intel CPUs and allows us to do 64 operations per EU per clock for INT8 precision.

Figure 1. Architecture Day 2020 presentation showing diverse data types for AI

The compute throughput for the DP4A instruction is twice as high compared to the throughput for FP16 multiply-add (MAD) instruction that allows getting significant performance improvements for inference. This instruction makes Intel GPUs suitable for running networks quantized to 8 bits with a significant increase in performance.

Low Precision GPU Runtime

The Intel® Distribution of OpenVINO™ toolkit is a software tool suite that accelerates applications and algorithms with high-performance, deep learning inference deployed from the edge to the cloud. In the 2020.1 release, we introduced a re-design of the low-precision flow; however, the runtime supported only CPU devices. The latest standard release, 2021.1 release, adds low-precision inference runtime for Gen12 GPUs.

The low precision pipeline implementation for GPU is aligned with the CPU flow. Practically, that means that users can generate quantized IR for CPU devices using the Post-Training Optimization Tool (POT) or Neural Network Compression Framework (NNCF), and then just run it on the integrated GPU to benefit from the DP4A instruction in Gen12 ISA.

One of the key advantages of the “Fake-Quantize”-based approach is model portability. The same quantized IR can be used for both CPU and GPU plugins including generations before the X^e-LP.

Figure 2. The simplified model flow in the GPU plugin of the Intel® Distribution of OpenVINO™ toolkit.

If a current GPU device doesn’t support Intel DL Boost technology, then low-precision transformations are disabled automatically, thus the model is executed in the original floating-point precision. Since the plugin is not able to skip FakeQuantize operations inserted into IR, it executes them explicitly, and, of course, this has a negative impact on the execution time, but the internal graph optimizer tries to fuse FakeQuantize operations with the other ones as much as possible to minimize the overhead of the quantization on such GPUs, and deliver performance close to the non-quantized version of the model.

Multi-Device and Asynchronous Execution

One of the signature features of the OpenVINO™ toolkit is multi-device execution. With the use of this feature, developers can run inference on a combination of devices in a transparent way as if it was a single accelerator.

The 11^th gen Intel® Core™ Processors are no exception to this, you can run the network on the CPU and GPU combination to get combined throughput. Moreover, this is now applicable to networks that are quantized to 8 bits.

Due to the support of multiple hardware contexts, it is now possible to benefit from running multiple inference requests simultaneously. We have described the advantages of this execution model in our previous blog, but for the GPU, it had limited applicability due to limitations on one hardware context.

Performance Results

Let’s have a look at the GPU performance of 11^th gen Intel® Core™ Processors that can be achieved using the Intel® Distribution of OpenVINO™ toolkit and compare it with one of the most widespread Intel® GPU - Gen9 GT2.

Gen9 vs Gen12

Figure 3. Speed up comparison of GPUs in 9^th gen Intel® Core™ Processors (formerly code named Coffee Lake) vs 11^th gen Intel® Core™ Processors (formerly code named Tiger Lake) in throughput mode. See performance benchmarks for reference.

Figure 4. Frames per second (FPS) comparison of GPUs in 9^th gen Intel® Core™ Processors (formerly code named Coffee Lake) vs 11^th gen Intel® Core™ Processors (formerly code named Tiger Lake) in throughput mode. See performance benchmarks for reference.

Throughput Mode

Now, let’s check how much performance we get from the throughput mode on Gen12 GPU.

Figure 5. Performance improvement factors in throughput model vs latency model for Intel® Core™ i5-1145G7E iGPU. See performance benchmarks for reference.

As you can see for the big models with a lot of compute operations, we have only ~10% of extra performance, since GPU resources are well utilized by a single stream. But when it comes to more lightweight models, we can see significant improvements in performance from using multiple compute streams.

Multi-device Mode

Figure 6. Multi-device mode efficiency for GPU and CPU, Intel® Core™ i5-1145G7E, comparing the performance of individual devices vs using the multi-device mode for INT8 precision. See performance benchmarks for reference.

In using the multi-device plugin, we can harness the compounding effect of hardware and software improvements. This multi-device mode utilizes the available CPU and GPU for full system utilization.

Conclusion

Having a powerful hardware architecture is just part of the solution. A proper software framework is required to unleash this power for developers and end-users. We have done a tremendous amount of work to enable all features in the OpenVINO™ toolkit on our new GPUs and investments are paying off in the form of high performance, improved throughput, and cross-platform portability.

Get the Intel® Distribution of OpenVINO™ toolkit today and start deploying high-performance, deep learning applications with a write-once-deploy-anywhere efficiency. Latest performance numbers can be found in our documentation. If you have any ideas in ways we can improve the product, we welcome contributions to the open-sourced OpenVINO™ toolkit. Finally, join the conversation to discuss all things Deep Learning and OpenVINO™ toolkit in our community forum.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.