Model Optimization Pipeline for Inference Speedup with OpenVINO™ Toolkit

MaryT_Intel · ‎08-23-2021

Key Takeaways

Learn how to get maximum performance with OpenVINO™ toolkit on various types of HW applying generic optimization techniques.

Published: Jan 31, 2021

Overview

Deep Neural Network (DNNs) model optimization is a huge trend in Artificial Intelligence (AI) domain, and it’s mostly caused by the need to deploy DNNs in real-world scenarios on resource-constrained hardware. This hardware is mainly represented by the power-limited CPUs, embedded GPUs, or special DSP accelerators, and used to host AI applications within resource constrained domains like Internet of Things (IoT). Moreover, some platforms may have all these chips on-board which adds an additional constraint to the deployment of Deep Learning (DL) model: the model should work on all the types of supported accelerators. It means that from the variety of optimization methods we can consider only those which satisfy this constrain. Let’s look at the examples.

Knowledge Distillation (KD) - this method allows transferring learned data representation to a more lightweight or optimized model. Despite being hardware-agnostic it has some drawbacks, e.g. the problem of finding the optimal lightweight student model, especially, for some arbitrary task and how to control the accuracy during the distillation. However, it’s worth noting that this method is often used in conjunction with other optimization methods to boost the resulting model accuracy.

Neural Architecture Search (NAS) - the method is getting more popular and is aimed to find the best suitable topology in terms of accuracy-performance trade-off. It can consider hardware features and different limitations, e.g. computational budget, accuracy drop or support of INT8 inference, etc. However, it still requires powerful training hardware and many GPU hours to find the optimal configuration. Moreover, the result is not guaranteed, especially from the accuracy standpoint. The most important drawback of NAS methods is that they require manual adaptation to each new domain, problem and expert knowledge in the model design. Thus, they cannot be fully automated.

Unstructured Pruning (Sparsity) – this method, also known as sparsity, adds zeros into the weights and activations in an unstructured fashion, therefore, special hardware modifications are required to leverage this sparsity and get real benefit in performance. Obviously, this is not applicable to the general-purpose hardware and thus cannot be used as a cross-hardware optimization method.

Filter Pruning – the method is aimed to optimize mainly convolutions as the most computationally expensive operations in the convolutional DNN. The idea is close to the previous method – we zero weights of the model, but it also has a significant difference – we consider the weight structure when setting to zero. For example, the Filter Pruning method assumes zeroing the whole convolutional filters that correspond to the channels in the output tensor so that they can be removed from the network at all at post-training time. The essence of the method is to shrink the width of the model, i.e. number of channels, thereby, reducing the total number of calculations. It makes this method hardware-agnostic.

Quantization – is a well-known method that was heavily exploited in many domains. It helps to reduce the amount of memory to represent weights and activations, hence reducing the latency, and increase the power efficiency of the hardware by using lower precision and requiring lower bits to do the same calculations. The essence of the method is to approximate floating-point operations by their integer analogs that are more efficient. Currently, 8-bit quantization is the most popular method to accelerate DL models because it allows substantially improving performance (theoretically, up to 4x) while preserving the accuracy at the same time. And now most of contemporary hardware support 8-bit calculations: from heavy discrete GPUs and CPUs to low-power accelerators for the edge. We can safely say that this method is also hardware-agnostic.

Based on this exploration, we can conclude that NAS and Distillation methods produce completely new models which can be not suitable if we consider a fully automated optimization of the DNN model in the low-code or no-code setup. As for the sparsity method, it requires HW support for efficient execution. At the same time, Filter Pruning and 8-bit quantization methods have a substantial capacity for scalability because they are hardware-agnostic and can be applied to the arbitrary convolutional models in an automatic way, i.e. without modification of the original model. The latter fact also makes these methods are more attractive for employment.

Optimization Pipeline

As it was mentioned above the most widely applicable methods for cross-hardware optimization are Filter Pruning and 8-bit quantization. The important thing is that both methods can be applied together independently because the Filter Pruning method changes the topology while quantization lowers precision of computation. The main problem is in what order they should be applied. Considering that the Filter Pruning is more accuracy-sensitive method and can lead to a higher accuracy drop, we believe that it makes sense to apply it first. Another reason for that is since the Filter Pruning removes the whole filters and output channels it inevitably affects the ranges of possible values of weights and activations, and thus it has impact on quantization parameters. Based on these assumptions we propose the following model optimization pipeline in OpenVINO™ toolkit ecosystem (see Fig. 1):

Figure 1. Optimization pipeline: (a) Model is wrapped by NNCF and the Filter Pruning algorithm is applied to it w/ fine-tuning in PyTorch; (b) Model with zero filters is exported to ONNX format and such filters are physically removed from the model; (c) ONNX model is converted to OpenVINO™ Intermediate Representation; (d) The model in IR is quantized with Post-training Optimization Toolkit.

As the first step we apply the Filter Pruning method implemented in the Neural Network Compression Framework (NNCF) that is an OpenVINO™ toolkit associated product and is aimed at in-framework models optimization with fine-tuning. Currently, NNCF has frontends to PyTorch and TensorFlow frameworks. NNCF is fully open-sourced and can be installed using pip tool. It is easy to integrate into a custom training code and it contains multiple examples of pruning of widely used model on popular datasets. At this step the weights of pruning filters are set to zero.
After the fine-tuning the model can be exported from PyTorch to the ONNX format and zero filters are physically removed from the model reducing its computation complexity.
As the next step the model is converted to the OpenVINO™ Intermediate Representation (IR) so that the user can make sure that the model is supported by OpenVINO™ toolkit.
As the last step, 8-bit post-training quantization is applied using the Post-Training Optimization Tool of OpenVINO™ toolkit. This step is quite fast because no fine-tuning is applied here. The only things required for quantization are the model in IR and some representative calibration dataset.

After these steps we can achieve a significant inference speedup with OpenVINO™ toolkit (e,g, noticeable boost in performance for ResNet-50 on ImageNet), at some accuracy degradation, as demonstrated in the results below, is negligible in many cases.

Below we provide more detailed description on the used optimization methods.

Filter Pruning method in the NNCF

Our filter pruning algorithm consists of two steps: eliminating less important filters and then fine-tuning the model to recover the accuracy.

Currently, two techniques to select filters to prune are supported in NNCF:

Magnitude-based pruning. The method, described in this paper, assumes that filters with smaller L_p norm have relatively smaller impact on activations and hence to the final model predictions. Consequently, they can be removed without high impact on the accuracy.
Geometric median pruning. The method, described in the following paper, aims to prune filters that can be best decomposed into a combination of remaining ones and, thus, substituted by them. Therefore, deleting these filters does not have a negative impact on model performance. It is shown that, in general, this method performs slightly better than the magnitude-based approach.

Both techniques can be also used with the progressive pruning ratio which is increasing during the pruning process. This helps to make the pruning process more stable while retaining model capacity and accuracy.

Figure 2. Pruning of elementwise addition of two convolutions outputs. This operation imposes additional constrains on the pruning filters (output channels): 1) Number of filters should be equal in both Conv_1 and Conv_2; 2) Position of the pruned filters should be the same to be able to remove them from the model.

An important part of filter pruning is preliminary model architecture analysis to group dependent channels that should be pruned together. For example, it is necessary, in the case of two convolution outputs summed up together, to prune the same filters in both convolutions. Otherwise, none of them can be pruned (see Fig. 2). In NNCF, dependent channels are grouped in case of elementwise operations and Convolution + Depth-wise Convolution combinations.

8-bit Quantization

Figure 3. Quantization modes visualization: symmetric and asymmetric.

Currently, 8-bit quantization is the de-facto standard for DL model optimization. As it was mentioned, OpenVINO™ toolkit has capabilities for 8-bit quantization represented in Post-Training Optimization Tool (POT). The POT is aimed at transforming models to the representation that can be interpreted as fixed-point model by the OpenVINO™ toolkit runtime components and executed in low precision. This is achieved by introducing special FakeQuantize operations in the model and this transformation is done automatically so there is no need in model modification from the user side. The FakeQuantize operation has a rich semantic and can represent various quantization schemes, e.g. symmetric and asymmetric quantization (see Fig. 3), per-channel and per-tensor parameters.

For more details about model optimization with Post-Training Optimization Tool, please refer to the following resources:

Results

We applied the described optimization flow to some representative set of models to showcase performance increase and model size reduction after optimization. In all experiments, we used the Geometric Median criterion from NNCF for pruning filter selection. We applied it to the models and fine-tuned them for a substantial number of epochs. For ImageNet dataset we tuned for 100 epochs using SGD optimizer with Nesterov momentum, starting from learning rate 0.1 and decaying it every 20 epochs.

After that, we converted models to ONNX and then to OpenVINO™ IR representation and applied post-training quantization with POT tool. In all experiments we used DefaultQuantization algorithms to get a fully quantized model and maximum performance gain from the quantization.

Figure 1 shows performance gain after pruning and 8-bit quantization method while Table 1 shows decrease in model size after applying both methods. This reduction is mostly caused by the fact that we store weights of quantized models in 8 bit representation as we wrote about it in one of our previous posts.

Figure1. Performance results. All the numbers were collected with OpenVINO Release 2021.1 on Intel(R) Core(TM) i9-10920X CPU @ 3.50GHz.

Model (Dataset)	Average pruning rate (weights/FLOPs)	Accuracy drop for pruned + quantized model	OpenVINO IR size reduction of pruned + quantized model vs. original model in FLOAT16 precision
Googlenet (ImageNet)	47/54%	1.11%	2.38x
ResNet-18 (ImageNet)	21%/24%	0.75%	2.27x
ResNet-34 (ImageNet)	29%/31%	0.75%	2.63x
ResNet-50 (ImageNet)	37%/44%	0.77%	2.51x
SSD-300 (Pascal VOC)	56%/57%	0.62%	3.61x
UNet (Mapillary)	49%/42%	1.12%	2.04x

Conclusion

We introduced a new optimization pipeline with NNCF framework, Post-training Optimization Tool and Intel® Distribution of OpenVINO™ toolkit that is aimed to subsequently apply Filter Pruning and INT8 quantization methods to get highly optimized DL models. One of the important advantages of the proposed pipeline is that it is hardware-agnostic, i.e. can be effectively used to optimize models for various types of DL hardware, such as CPU, GPU, or special DL accelerators. Even though the pipeline is two-stage and requires the usage of two different tools, it can be automated because it can be applied to any arbitrary CNN and does not require a change of the model structure from the user side. We showed that applying the proposed pipeline it is possible to substantially improve the inference performance and reduce the size of the model.

If you have any ideas in ways we can improve the product, we welcome contributions to the open-sourced OpenVINO™ toolkit. Finally, join the conversation to discuss all things Deep Learning and OpenVINO™ toolkit in our community forum.

Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex .

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.