Faster, Easier Optimization with Intel® Neural Compressor

MaryT_Intel · ‎11-04-2021

At a Glance

New Intel Neural Compressor makes it simpler to optimize trained deep learning (DL) models.
Intel Neural Compressor provides low-precision quantization, pruning, knowledge distillation, and other compression techniques.
Engineers can increase inference performance and engineering productivity.

Introduction

Good news for deep learning (DL) engineers who want to simplify post-training optimization across multiple frameworks. Intel has launched Intel® Neural Compressor to help developers increase performance and productivity for DL compression on Intel platforms. Formerly known as Intel Low Precision Optimization Tool (LPOT), Intel Neural Compressor now provides pruning, knowledge distillation, and other compression techniques along with low-precision quantization.

Intel Neural Compressor software helps deliver the value of Intel hardware advancements for DL, including Intel Deep Learning Boost (Intel DL Boost) and Intel Advanced Matrix Extensions (Intel AMX). Engineers can explore various neural network compression technologies across different DL frameworks with a few lines of code. Beginning and experienced DL engineers can unlock Intel’s performance features and deliver fast, efficient inference.

Intel Neural Compressor is part of Intel® oneAPI AI Analytics Toolkit (AI Kit). An open-source Python library running on Intel CPUs and GPUs, the tool delivers unified interfaces across DL frameworks for diverse network compression technologies. This tool:

Supports automatic, accuracy-driven tuning strategies to help users quickly find the best-quantized model.
Implements different weight-pruning algorithms to generate pruned models with predefined sparsity goals.
Provides knowledge distillation to distill knowledge from the teacher model to the student model.

This article describes the major compression technologies the Intel Neural Compressor tool supports and lists validated models for each. We also share results from organizations that are using the tool to improve inference performance with minimal loss of accuracy.

Intel® Neural Compressor Architecture

Figure 1 is a high-level diagram of the Intel Neural Compressor tool. The tool is built above the deep learning frameworks, taking an FP32 framework model as input and yielding a compressed framework model for deployment. The Neural Compressor supports popular network compression technologies, such as quantization, mix precision, sparsity/pruning, and knowledge distillation, through well-designed, user-facing APIs. You can also create a pipeline to execute these compression methods automatically and sequentially, combining compression methods.

Quantization

Intel Neural Compressor supports post-training static quantization, post-training dynamic quantization, and quantization-aware training by unifying the different quantization APIs across different deep learning frameworks. The tool also provides an auto-tuning mechanism to help the user quickly find the best combination of performance and accuracy.

Figure 2 shows the workflow that the Intel Neural Compressor’s quantization component uses to generate a quantized model. The tool queries the framework’s quantization capability and constructs an operator-wise (also known as layer-wise) quantization tuning space.

With some strategies, the component selects one quantization configuration combination for each operator and generates a quantized mode by invoking a framework quantization interface. The component evaluates the accuracy of this quantized model and determines whether it meets the predefined accuracy goal.

If the quantized model doesn’t meet the accuracy goal, the tuning strategy selects the next quantization configuration combination and generates a new quantized model. This flow continues until the component finds a quantized model that meets the accuracy goal.

As an alternative to auto-tuning, Intel Neural Compressor has a performance-only mode that directly generates an INT8 model without tuning. This feature can be useful for benchmarking purposes.

For detailed examples, refer to the Intel Neural Compressor GitHub repo.

Validated INT8 Models

Figure 3 shows partial performance and accurate data on validated quantized models produced by Intel Neural Compressor. As Figure 3 indicates, most users can expect to meet a relative 1 percent accuracy goal with a promising performance speedup. A full model list is here.

Pruning

In addition to quantization, Intel Neural Compressor supports magnitude pruning, pattern lock, and gradient sensitivity pruning. Users can combine approaches, for example, pruning and then post-training quantization, and pruning during quantization-aware training. Here’s a summary of these approaches:

Basic magnitude. The algorithm prunes the weight by the lowest absolute value at each layer with a given sparsity target.
Gradient sensitivity. The algorithm prunes the head, intermediate layers, and hidden states in a natural language processing (NLP) model according to an importance score calculated by following the paper FastFormers.
Pattern lock. The algorithm takes a sparsity model as input and starts to fine-tune the model. It locks the sparsity pattern by freezing those zero values in the weight tensor after weight update during training.
Pruning followed by post-training quantization. The algorithm executes unstructured pruning and then executes post-training quantization.
Pruning during quantization-aware training. The algorithm executes unstructured pruning during quantization-aware training.

Validated Pruned Models

Table 1 shows partial performance and accurate data on validated pruned models produced by Intel Neural Compressor. The full results are here.

Knowledge Distillation

Intel Neural Compressor’s knowledge distillation component learns features from a teacher model to a student model. Figure 4 depicts the knowledge distillation workflow.

Validated Distilled Models

Table 2 provides partial performance and accurate data on validated distilled models produced by Intel Neural Compressor. More distilled models are in progress.

Backends

Intel Neural Compressor is built on popular frameworks, including Intel Optimized TensorFlow, Stock TensorFlow, PyTorch, MXNet, ONNX-runtime, and a built-in acceleration library named Engine.

Engine

Engine is a high-performance, lightweight, open-source, domain-specific inference acceleration library for deep learning. The first supported model domain is NLP. Engine has accelerated many popular NLP models, such as BERT, DistilBert, Roberta, and Albert, and plans to support more from the Hugging Face AI community.

Figure 5 shows the architecture diagram of the Engine acceleration library, including its two major components: model convertor and model executor. The model converter parses a framework model, generates graph IR, performs graph optimization, and produces an optimized model. The model executor is used to deploy the optimized model in a bare-metal runtime environment.

Productivity Improvements

Based on customer feedback and approximately 200 model-enabling experiences, Intel Neural Compressor can significantly improve development efficiency, increasing productivity by nearly 10 percent. The tool provides these improvements by having a well-designed pipeline, avoiding complex calibration methods, developing handcrafted quantization recipes, and performing blind tuning. Users need only write a few code lines to launch the quantization process and generate a quantized model. Figure 6 summarizes the resulting productivity improvements.

Real-World Examples

Example 1. Particle Physics at CERN¹

CERN, the European Organization for Nuclear Research, is one of the world's largest and most respected centers for scientific research. CERN’s openlab team is using 3D Generative Adversarial Network (GAN) to simulate calorimeter detectors creating particle showers and measuring the energy of particles produced in the collisions.

CERN used TensorFlow Lite int8 as the reference implementation to compare with Intel Neural Compressor. The CERN team found LeakyReLU wasn’t supported in INT8 quantization of TensorFlow Lite. TensorFlow Lite also didn’t support transposed convolutional layers for up-sampling. After using Intel Neural Compressor, the team got up to 1.8x performance speedup with less accuracy loss than with TensorFlow Lite. Figure 7 summarizes CERN’s results.

Example 2. NLP for Alibaba’s End-to-End AI Platform²

Transformer is a key model used in Alibaba’s end-to-end Machine Learning (ML) Platform for AI (PAI). The platform is widely used in real-world NLP tasks, serving millions of users through Alibaba’s online service. Low latency and high throughput are keys to Transformer’s success, and 8-bit low precision is a promising technique to meet such requirements.

Intel DL Boost offers powerful capabilities for 8-bit low-precision inference on AI workloads. Using Intel Neural Compressor to access the capabilities of Intel DL Boost, Alibaba optimized 8-bit inference performance while significantly reducing accuracy loss. See Figure 8.

Example 3. 3D Face Reconstruction at Tencent Games³

Tencent Games is working with Intel to build a new 3D digital face reconstruction solution. To deliver the performance and capabilities for this solution, Tencent Games is using 3^rd Gen Intel Xeon Scalable processors’ built-in AI acceleration through Intel DL Boost. Together with Intel Neural Compressor, these technologies significantly improve inference efficiency while ensuring accuracy and precision.

By quantizing the Position Map Regression Network from FP32 inference down to INT8, Tencent Games improves inference efficiency and provides a practical solution for 3D digital face reconstruction. Figure 9 shows the increase in inference efficiency that Tencent Games achieved.

Ecosystem Adoption

Popular deep learning projects are actively exploiting the benefits of Intel Neural Compressor. Hugging Face, ONNX Model Zoo, and Tencent Cloud are three examples.

Hugging Face

Hugging Face is an open-source provider of NLP technologies. The Hugging Face community has announced the Optimum library, which provides performance optimization tools based on efficient AI hardware and collaboration with hardware partners. Hugging Face says that its goal is to turn ML engineers into ML optimization wizards.

Hugging Face integrated Intel Neural Compressor into this library as a foundational component. Intel’s collaboration with Hugging Face yields hardware-specific optimized model configurations and artifacts, which will be available to the AI community via the Hugging Face Model Hub. Using Intel Neural Compressor, Optimum, and hardware-optimized models can speed the work of developing efficient production workloads, which represent much of the aggregate energy spent on ML.

Figure 10 shows how to easily quantize Transformers for Intel Xeon CPUs with Optimum. For more examples please refer to here.

ONNX Model Zoo

ONNX Model Zoo is a collection of pre-trained models in the ONNX format contributed by community members. Before Intel Neural Compressor, all the models were pre-trained with FP32. None were quantized models taking advantage of Intel DL Boost, including the AVX-512 Vector Neural Network Instructions (VNNI). Intel Neural Compressor provided quantization support to ONNX Model Zoo by uploading quantized models and instructing users how to quantize such models.

Table 3 shows the accuracy and performance improvements of published quantized models. These results show that the models ran an average of approximately twice as fast with minimal loss of accuracy.

Table 3. Accuracy Data and Performance Data for Models with ONNX Model Zoo

To learn more about Intel Neural Compressor and download the source code, visit https://github.com/intel/neural-compressor

Tencent Cloud

Tencent Cloud is a secure, reliable, and high-performance cloud compute service provided by Tencent. Published Intel® Neural Compressor (INC) image on Tencent Cloud market allows broad users to benefit from the performance acceleration by Intel’s Hero HW features (DL Boost VNNI and AMX) on Intel® Xeon® Scalable processors. The performance speedup and accuracy shown in Figure 3 could be reproduced with this image on Intel® Xeon® Scalable Processor instances in Tencent Cloud. This image keeps being updated to provide more examples and features.

Configuration Details

Validated Models of Intel® Neural Compressor on TensorFlow 2.5.0 and PyTorch 1.9.0+cpu Throughput Performance and Accuracy Loss on 3rd Generation Intel® Xeon® Processor Scalable Family

Test Configuration: Test by Intel as of 09/29/2021, Framework: TensorFlow v2.5.0 and PyTorch 1.9.0+cpu. Platform: Intel Xeon Platinum 8280 processor, Cascade Lake architecture. 1 node, 2 sockets, 28 cores/socket, 56 threads/socket. HT On. Intel Turbo Boost On. System DDR Memory Configuration: 12 slots / 16GB / 2933. BIOS: SE5C620.86B.02.01.0013.121520200651, OS: CentOS Linux 8.3, Kernel: 4.18.0-240.22.1.el8_3.x86_64, Datatype: INT8

CERN 3DGAN Model on TensorFlow 2.3.0 Throughput Performance on 3rd Generation Intel® Xeon® Processor Scalable Family

Test Configuration: Framework: TensorFlow v2.3. Platform: Intel Xeon Platinum 8280 processor, Cascade Lake architecture. 1 node, 2 sockets, 28 cores/socket, 56 threads/socket. HT On. Intel Turbo Boost On. System DDR Memory Configuration: 12 slots / 16GB / 2933. OS: CentOS Linux 7, Kernel: 3.10.0-957.el7.x86_64.

Alibaba PAI NLP Transformer Model on PyTorch 1.7.1 Throughput Performance on 3rd Generation Intel® Xeon® Processor Scalable Family

Baseline Configuration: Test by Intel as of 03/19/2021, 2-node, 2x Intel® Xeon® Platinum 8269C Processor, 26 cores, HT On, Turbo ON, Total Memory 192 GB (12 slots/ 16 GB/ 2933 MHz), BIOS: SE5C620.86B.02.01.0013.121520200651(0x4003003), CentOS 8.3, 4.18.0-240.1.1.el8_3.x86_64, gcc 8.3.1 compiler, Transformer Model, Deep Learning Framework: PyTorch 1.7.1, https://download.pytorch.org/whl/cpu/torch-1.7.1%2Bcpu-cp36-cp36m-linux_x86_64.whl, BS=1, Customer Data, 26 instances/2 sockets, Datatype: FP32/INT8

New Configuration: Test by Intel as of 03/19/2021, 2-node, 2x Intel® Xeon® Platinum 8369B Processor, 32 cores, HT On, Turbo ON, Total Memory 512 GB (16 slots / 32GB/ 3200 MHz), BIOS: WLYDCRB1.SYS.0020.P92.2103170501 (0xd000260), CentOS 8.3, 4.18.0-240.1.1.el8_3.x86_64, gcc 8.3.1 compiler, Transformer Model, Deep Learning Framework: PyTorch 1.7.1, https://download.pytorch.org/whl/cpu/torch-1.7.1%2Bcpu-cp36-cp36m-linux_x86_64.whl, BS=1, Customer Data, 32 instances/2 sockets, Datatype: FP32/INT8

Tencent 3D Digital Face Reconstruction Model on TensorFlow 2.4.0 Throughput Performance on 3rd Generation Intel® Xeon® Processor Scalable Family

Baseline Configuration: Test by Intel as of 3/19/2021, 2S Intel® Xeon® Platinum 82XX Processor (Cascade Lake), 24-core/48-thread, TurboBoost on, Hyper-Threading on; memory: 12*16GB DDR4 2933; storage: Intel® SSD *1; NIC: Intel® Ethernet Network Adapter X722*1; BIOS: SE5C620.86B.0D.01.0438.032620191658(ucode:0x5003003)；OS: CentOS Linux 8.3; Kernel: 4.18.0-240.1.1.el8_3.x86_64; network model: PRN network; data set: Dummy and 300W_LP，12 instances/2 Sockets; deep learning framework: Intel® Optimization for TensorFlow 2.4.0; OneDNN:1.6.4; LPOT:1.1; model data format: FP32;

Test Configuration 1: Test by Intel as of 3/19/2021, 2S Intel® Xeon® Platinum 83XX Processor (Ice Lake), 36-core / 72-thread, TurboBoost on, Hyper-Threading on; memory: 16*16GB DDR4 3200; storage: Intel® SSD *1; NIC: Intel® Ethernet Controller X550T *1; BIOS: SE5C6200.86B.3020.P19.2103170131 (ucode: 0x8d05a260)；OS: CentOS Linux 8.3; Kernel: 4.18.0-240.1.1.el8_3.x86_64; network model: PRN network; data set: Dummy and 300W_LP，18 instances/2 Sockets；deep learning framework: Intel® Optimization for TensorFlow 2.4.0; OneDNN:1.6.4; LPOT:1.1; model data format: FP32;

Test Configuration 2: Test by Intel as of 3/19/2021, 2S Intel® Xeon® Platinum 83XX Processor (Ice Lake), 36-core / 72-thread, TurboBoost on, Hyper-Threading on; memory: 16*16GB DDR4 3200; storage: Intel® SSD *1; NIC: Intel® Ethernet Controller X550T *1; BIOS: SE5C6200.86B.3020.P19.2103170131 (ucode: 0x8d05a260)；OS: CentOS Linux 8.3; Kernel: 4.18.0-240.1.1.el8_3.x86_64; network model: PRN network; data set: Dummy and 300W_LP，18 instances/2 Sockets；deep learning framework: Intel® Optimization for TensorFlow 2.4.0; OneDNN:1.6.4; LPOT:1.1; model data format: INT8.

Quantized ONNX Model Zoo Models on ONNX-Runtime 1.8.0 Latency and Accuracy Loss on 3rd Generation Intel® Xeon® Processor Scalable Family

Test Configuration: Test by Intel as of 09/29/2021, Framework: ONNX-runtime 1.8.0. Platform: Intel Xeon Platinum 8280 processor, Cascade Lake architecture. 1 node, 2 sockets, 28 cores/socket, 56 threads/socket. HT On. Intel Turbo Boost On. System DDR Memory Configuration: 12 slots / 16GB / 2933. BIOS: SE5C620.86B.02.01.0013.121520200651, OS: CentOS Linux 8.3, Kernel: 4.18.0-240.22.1.el8_3.x86_64, Datatype: INT8

¹Source: Reduced Precision Strategies for Deep Learning: 3DGAN Use Case, Florian Rehm, et al., 4^th IML Machine Learning Workshop (October 21^st, 2020) (https://indico.cern.ch/event/852553/contributions/4059283/attachments/2126838/3581708/Rehm_Florian-IML-Reduced_Precision.pdf)

²For further detail, please see https://www.intel.com/content/www/us/en/artificial-intelligence/posts/alibaba-lpot.html

³For further detail, please see https://www.intel.com/content/www/us/en/artificial-intelligence/posts/tencent-3d-digital-face-reconstruction.html

⁴Source: How to easily quantize Transformers for Intel Xeon CPUs with Optimum (https://huggingface.co/blog/hardware-partners-program#%F0%9F%92%A1-how-intel-is-solving-quantization-and-more-with-neural-compressor)

Disclaimers

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex. Your costs and results may vary.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software, or service activation.

Intel doesn’t control or audit third-party data. Consult other sources to evaluate accuracy.