Easily Optimize Deep Learning with 8-Bit Quantization

Stephanie_Maluso · ‎03-08-2022

Discover how to use the Neural Network Compression Framework of the OpenVINOTM toolkit for 8-bit quantization in PyTorch

Authors: Alexander Kozlov, Yury Gorbachev, Alexander Suslov, Vasily Shamporov, and Nikolay Lyalyushkin

If you come to my house, you might hear fairytales. All over the world, there are different voice activated assistants that can do extraordinary things. My daughter loves to get ours to read stories about wonderful worlds.

It’s almost like we are living in a sci-fi world ourselves. I am amazed to think how fast voice recognition has become mainstream. Twenty years ago, films with computers that understood speech seemed far-fetched. Today, we routinely use voice recognition to play music or set alarms. With the continued development of hardware and software, I can imagine that in three years’ time, adults might be able to enjoy entertaining conversations with their voice assistants.

There have been similar advances in computer vision, which is now being used for safety, security, and smart city applications. As with speech recognition, the applications can be more responsive if the deep learning inference is carried out on the edge device. The network connection to a data center introduces an unavoidable lag.

Edge devices have limited resources, though, so the deep learning models need to be optimized to get the best performance.

One approach is quantization, converting the 32-bit floating point numbers (FP32) used for parameter information to 8-bit integers (INT8). For a small loss in accuracy, there can be significant savings in memory and compute requirements.

With lower precision numbers, more of them can be processed simultaneously, increasing application performance. The theoretical maximum performance boost when quantizing from FP32 to INT8 is 4x.

INT8 is supported on current and recent Intel® CPUs and Intel® integrated GPUs, but is not supported by legacy hardware.

Overcoming the challenges of quantization

There are two challenges with quantization:

How to do it easily. In the past, it has been a time-consuming process.
How to maintain accuracy.

Both of these challenges are addressed by the Neural Network Compression Framework (NNCF). NNCF is a suite of advanced algorithms for optimizing machine learning and deep learning models for inference in the Intel® Distribution of OpenVINO^TM toolkit. NNCF works with models from PyTorch and TensorFlow.

One of the main features of NNCF is 8-bit uniform quantization, using recent academic research to create accurate and fast models. The technique we will be covering in this article is called quantization-aware training (QAT). This method simulates the quantization of weights and activations while the model is being trained, so that operations in the model can be treated as 8-bit operations at inference time. Fine tuning is used to restore the accuracy drop from quantization. QAT has better accuracy and reliability than carrying out quantization after the model has been trained.

Unlike other optimization tools, NNCF does not require users to change the model manually or learn how the quantization works. It is highly automated. You just need to wrap the model using NNCF specific calls and do the usual fine-tuning on the original training dataset.

How to do uniform 8-bit quantization using NNCF

For this step-by-step example, we will use the ResNet-18 image classification model for PyTorch from the Torchvision library. It has been pretrained using ImageNet and can be used in a wide range of applications. Aside from classification tasks (e.g. classifying dog species), it can be part of a pipeline for object detection, person identification, or image segmentation, for example.

You can download a Jupyter Notebook containing the following steps here.

Step 1: Install prerequisites

Create a separate Python virtual environment and install the following prerequisites into it:

$pip install nncf[torch]
$pip install openvino openvino-dev

Step 2: Import NNCF from your Python code

Import NNCF by adding the following Python instructions to your training program:

import torch
import torchvision
import nncf # Important - should be imported directly after torch
from nncf import NNCFConfig
from nncf.torch import create_compressed_model
from nncf.torch.initialization import register_default_init_args

Step 3: Prepare the model and data

We assume that the user has their own training pipeline for the original FP32 model with steps for model loading, data preparation, and the training loop.

model = torchvision.models.resnet18(pretrained=True) 
train_loader, val_loader = create_data_loaders(...) # placeholder for DataLoader

nncf_config_dict = {
    "input_info": {
      "sample_size": [1, 3, 64, 64]
    },
    "compression": {
        "algorithm": "quantization", # specify the algorithm here
    }
}

# Load a configuration file to specify compression
nncf_config = NNCFConfig.from_dict(nncf_config_dict)

# Provide data loaders for compression algorithm initialization, if necessary
nncf_config = register_default_init_args(nncf_config, train_loader)

# Apply the specified compression algorithms to the model
compression_ctrl, model = create_compressed_model(model, nncf_config)

The OpenVINO^TM Inference Engine uses quantization rules inserted in the model during training to convert the model to INT8 during inference.

The call to create_compressed_model inserts operations that simulate the 8-bit quantization during training. This simulation helps the fine-tuning process to adjust the model to restore the accuracy deviation caused by the quantization process.

Step 4: Fine tune the model as usual

Next, a regular fine-tuning process is used to improve accuracy. Normally, several epochs of tuning are required with a small learning rate, the same as is typically used at the end of the training of the original model. Here is a simple example of the fine tuning code in our training program. We haven’t changed this code.

total_epochs = 5
for epoch in range(total_epochs): 
    for i, (images, target) in enumerate(train_loader):
        output = model(images)
        loss = criterion(output, target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

No other changes in the training pipeline are required.

Step 5: Export model to ONNX

To use the PyTorch model in the OpenVINO Inference Engine, we first need to convert the model to ONNX.

We will use an NNCF helper function to export the quantized model to ONNX format. For image classification models, the API is simple. For some other models, you may need to add additional functionality for the export, for example, a dummy forward function.

When the fine tuning finishes, we call this code to export the fine-tuned model to the ONNX format:

compression_ctrl.export_model('resnet18_int8.onnx')

Step 6: Export ONNX models to the OpenVINO™ Intermediate Representation (IR)

The OpenVINO Intermediate Representation (IR) is the file format used by the OpenVINO Inference Engine. We can now convert the ONNX model to the OpenVINO IR format by calling the OpenVINO^TM Model Optimizer tool.

The two files that represent the model are saved to the current directory. We add the mean values to the model and scale the output with the standard deviation using --mean_values and --scale_values arguments. These values were used to normalize input during the training and represent the mean and standard deviation of color intensity of all the training images.

Using these Model Optimizer options, there is no need to normalize the input data on deployment. The pre-processing will be part of the model.

$mo \
--input_model resnet18_int8.onnx \
--input_shape "[1, 3, 64, 64]" \
--mean_values "[123.675, 116.28, 103.53]" \
--scale_values "[58.395, 57.12, 57.375]"

See the Model Optimizer Developer Guide for more information.

Measuring the performance with the OpenVINO toolkit

Now we have created an optimized model that will run with 8-bit precision in the OpenVINO Inference Engine.

As a last step, we will measure the inference performance of the original FP32 and new INT8 models. To do this, we use the Benchmark Tool, the inference performance measurement tool in the OpenVINO toolkit. It measures the inference performance using randomly generated data so that there is no overhead introduced by data loading. By default, the Benchmark Tool runs inference for 60 seconds in asynchronous mode on the CPU. It returns the inference speed as latency (milliseconds per image) and throughput (frames per second) values.

For more accurate performance, we recommended running The Benchmark Tool (benchmark_app) in the terminal after closing other applications. Use:

$benchmark_app -m model.xml -d CPU

to benchmark asynchronous inference on the CPU for one minute.

Change CPU to GPU to benchmark on Intel GPU. Run:

$benchmark_app --help

to see an overview of all command line options.

To test the model we exported in Step 6, we use:

$benchmark_app -m resnet18_int8.xml -d CPU

You can benchmark the original FP32 or FP16 OpenVINO IR model the same way to compare the results.

Using an Intel® Xeon® Platinum 8280 processor with Intel® Deep Learning Boost technology, the INT8 optimization achieves 3.62x speed up (see Table 1). In a local setup using an 11th Gen Intel® Core™ i7–1165G7 processor with the same instruction set, the speedup was 3.63x. These numbers were measured using the OpenVINO benchmarking infrastructure with OpenVINO toolkit version 2021.4.2.

Hardware for inference	FP32	INT8	Speed up
Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz	1578FPS	5714 FPS	3.62x
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz	97 FPS	353 FPS	3.63x

Table 1: The speedup achieved by INT8 optimization using the Neural Network Compression Framework. FPS is frames per second.
Hardware configurations:
- Xeon 8280 based platform: 1-node, 2x Intel Xeon 8280 CPU on Intel reference platform with 384 GB (12 slots/ 32GB/ 2934) total DDR4 memory, Ubuntu 18.04.6 LTS, 5.0.0-23-generic.
- Core i7-1165G7 based platform: 1-node, 1x Intel Core i7-1165G7 CPU on Intel reference platform with 8 GB (1 slots/ 1GB/ 3200) total DDR4 memory, Microsoft Windows 10 Enterprise, 10.0.19042 N/A Build 19042.
Image Classification Inference: ResNet-18, BS=1, INT8 With OpenVINO 2021.4.2, test by Intel on 2/3/2022.

Conclusion

In this article, we demonstrated how to use NNCF 8-bit Quantization Aware Training to accelerate the inference of PyTorch models. As we have shown, the process is simple and does not require significant changes in the training code. The flow for the deployment with OpenVINO remains the same as for floating-point models.

In future articles, we will show how to use NNCF for model inference optimization within TensorFlow and will introduce more advanced optimization techniques that can also help to achieve a speedup.

Notices & Disclaimers

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Results have been estimated or simulated.