Key Takeaways
 Learn how to get maximum performance with OpenVINO™ toolkit on various types of HW applying generic optimization techniques.
Published: Jan 31, 2021
Overview
Deep Neural Network (DNNs) model optimization is a huge trend in Artificial Intelligence (AI) domain, and it’s mostly caused by the need to deploy DNNs in realworld scenarios on resourceconstrained hardware. This hardware is mainly represented by the powerlimited CPUs, embedded GPUs, or special DSP accelerators, and used to host AI applications within resource constrained domains like Internet of Things (IoT). Moreover, some platforms may have all these chips onboard which adds an additional constraint to the deployment of Deep Learning (DL) model: the model should work on all the types of supported accelerators. It means that from the variety of optimization methods we can consider only those which satisfy this constrain. Let’s look at the examples.
Knowledge Distillation (KD)  this method allows transferring learned data representation to a more lightweight or optimized model. Despite being hardwareagnostic it has some drawbacks, e.g. the problem of finding the optimal lightweight student model, especially, for some arbitrary task and how to control the accuracy during the distillation. However, it’s worth noting that this method is often used in conjunction with other optimization methods to boost the resulting model accuracy.
Neural Architecture Search (NAS)  the method is getting more popular and is aimed to find the best suitable topology in terms of accuracyperformance tradeoff. It can consider hardware features and different limitations, e.g. computational budget, accuracy drop or support of INT8 inference, etc. However, it still requires powerful training hardware and many GPU hours to find the optimal configuration. Moreover, the result is not guaranteed, especially from the accuracy standpoint. The most important drawback of NAS methods is that they require manual adaptation to each new domain, problem and expert knowledge in the model design. Thus, they cannot be fully automated.
Unstructured Pruning (Sparsity) – this method, also known as sparsity, adds zeros into the weights and activations in an unstructured fashion, therefore, special hardware modifications are required to leverage this sparsity and get real benefit in performance. Obviously, this is not applicable to the generalpurpose hardware and thus cannot be used as a crosshardware optimization method.
Filter Pruning – the method is aimed to optimize mainly convolutions as the most computationally expensive operations in the convolutional DNN. The idea is close to the previous method – we zero weights of the model, but it also has a significant difference – we consider the weight structure when setting to zero. For example, the Filter Pruning method assumes zeroing the whole convolutional filters that correspond to the channels in the output tensor so that they can be removed from the network at all at posttraining time. The essence of the method is to shrink the width of the model, i.e. number of channels, thereby, reducing the total number of calculations. It makes this method hardwareagnostic.
Quantization – is a wellknown method that was heavily exploited in many domains. It helps to reduce the amount of memory to represent weights and activations, hence reducing the latency, and increase the power efficiency of the hardware by using lower precision and requiring lower bits to do the same calculations. The essence of the method is to approximate floatingpoint operations by their integer analogs that are more efficient. Currently, 8bit quantization is the most popular method to accelerate DL models because it allows substantially improving performance (theoretically, up to 4x) while preserving the accuracy at the same time. And now most of contemporary hardware support 8bit calculations: from heavy discrete GPUs and CPUs to lowpower accelerators for the edge. We can safely say that this method is also hardwareagnostic.
Based on this exploration, we can conclude that NAS and Distillation methods produce completely new models which can be not suitable if we consider a fully automated optimization of the DNN model in the lowcode or nocode setup. As for the sparsity method, it requires HW support for efficient execution. At the same time, Filter Pruning and 8bit quantization methods have a substantial capacity for scalability because they are hardwareagnostic and can be applied to the arbitrary convolutional models in an automatic way, i.e. without modification of the original model. The latter fact also makes these methods are more attractive for employment.
Optimization Pipeline
As it was mentioned above the most widely applicable methods for crosshardware optimization are Filter Pruning and 8bit quantization. The important thing is that both methods can be applied together independently because the Filter Pruning method changes the topology while quantization lowers precision of computation. The main problem is in what order they should be applied. Considering that the Filter Pruning is more accuracysensitive method and can lead to a higher accuracy drop, we believe that it makes sense to apply it first. Another reason for that is since the Filter Pruning removes the whole filters and output channels it inevitably affects the ranges of possible values of weights and activations, and thus it has impact on quantization parameters. Based on these assumptions we propose the following model optimization pipeline in OpenVINO™ toolkit ecosystem (see Fig. 1):
Figure 1. Optimization pipeline: (a) Model is wrapped by NNCF and the Filter Pruning algorithm is applied to it w/ finetuning in PyTorch; (b) Model with zero filters is exported to ONNX format and such filters are physically removed from the model; (c) ONNX model is converted to OpenVINO™ Intermediate Representation; (d) The model in IR is quantized with Posttraining Optimization Toolkit.
 As the first step we apply the Filter Pruning method implemented in the Neural Network Compression Framework (NNCF) that is an OpenVINO™ toolkit associated product and is aimed at inframework models optimization with finetuning. Currently, NNCF has frontends to PyTorch and TensorFlow frameworks. NNCF is fully opensourced and can be installed using pip tool. It is easy to integrate into a custom training code and it contains multiple examples of pruning of widely used model on popular datasets. At this step the weights of pruning filters are set to zero.
 After the finetuning the model can be exported from PyTorch to the ONNX format and zero filters are physically removed from the model reducing its computation complexity.
 As the next step the model is converted to the OpenVINO™ Intermediate Representation (IR) so that the user can make sure that the model is supported by OpenVINO™ toolkit.
 As the last step, 8bit posttraining quantization is applied using the PostTraining Optimization Tool of OpenVINO™ toolkit. This step is quite fast because no finetuning is applied here. The only things required for quantization are the model in IR and some representative calibration dataset.
After these steps we can achieve a significant inference speedup with OpenVINO™ toolkit (e,g, noticeable boost in performance for ResNet50 on ImageNet), at some accuracy degradation, as demonstrated in the results below, is negligible in many cases.
Below we provide more detailed description on the used optimization methods.
Filter Pruning method in the NNCF
Our filter pruning algorithm consists of two steps: eliminating less important filters and then finetuning the model to recover the accuracy.
Currently, two techniques to select filters to prune are supported in NNCF:
 Magnitudebased pruning. The method, described in this paper, assumes that filters with smaller L_p norm have relatively smaller impact on activations and hence to the final model predictions. Consequently, they can be removed without high impact on the accuracy.
 Geometric median pruning. The method, described in the following paper, aims to prune filters that can be best decomposed into a combination of remaining ones and, thus, substituted by them. Therefore, deleting these filters does not have a negative impact on model performance. It is shown that, in general, this method performs slightly better than the magnitudebased approach.
Both techniques can be also used with the progressive pruning ratio which is increasing during the pruning process. This helps to make the pruning process more stable while retaining model capacity and accuracy.
Figure 2. Pruning of elementwise addition of two convolutions outputs. This operation imposes additional constrains on the pruning filters (output channels): 1) Number of filters should be equal in both Conv_1 and Conv_2; 2) Position of the pruned filters should be the same to be able to remove them from the model.
An important part of filter pruning is preliminary model architecture analysis to group dependent channels that should be pruned together. For example, it is necessary, in the case of two convolution outputs summed up together, to prune the same filters in both convolutions. Otherwise, none of them can be pruned (see Fig. 2). In NNCF, dependent channels are grouped in case of elementwise operations and Convolution + Depthwise Convolution combinations.
8bit Quantization
Figure 3. Quantization modes visualization: symmetric and asymmetric.
Currently, 8bit quantization is the defacto standard for DL model optimization. As it was mentioned, OpenVINO™ toolkit has capabilities for 8bit quantization represented in PostTraining Optimization Tool (POT). The POT is aimed at transforming models to the representation that can be interpreted as fixedpoint model by the OpenVINO™ toolkit runtime components and executed in low precision. This is achieved by introducing special FakeQuantize operations in the model and this transformation is done automatically so there is no need in model modification from the user side. The FakeQuantize operation has a rich semantic and can represent various quantization schemes, e.g. symmetric and asymmetric quantization (see Fig. 3), perchannel and pertensor parameters.
For more details about model optimization with PostTraining Optimization Tool, please refer to the following resources:
 Posttraining optimization best practices
 Enhanced lowprecision pipeline to accelerate inference with OpenVINO™ toolkit
Results
We applied the described optimization flow to some representative set of models to showcase performance increase and model size reduction after optimization. In all experiments, we used the Geometric Median criterion from NNCF for pruning filter selection. We applied it to the models and finetuned them for a substantial number of epochs. For ImageNet dataset we tuned for 100 epochs using SGD optimizer with Nesterov momentum, starting from learning rate 0.1 and decaying it every 20 epochs.
After that, we converted models to ONNX and then to OpenVINO™ IR representation and applied posttraining quantization with POT tool. In all experiments we used DefaultQuantization algorithms to get a fully quantized model and maximum performance gain from the quantization.
Figure 1 shows performance gain after pruning and 8bit quantization method while Table 1 shows decrease in model size after applying both methods. This reduction is mostly caused by the fact that we store weights of quantized models in 8 bit representation as we wrote about it in one of our previous posts.
Figure1. Performance results. All the numbers were collected with OpenVINO Release 2021.1 on Intel(R) Core(TM) i910920X CPU @ 3.50GHz.
Model (Dataset) 
Average pruning rate (weights/FLOPs)

Accuracy drop for pruned + quantized model

OpenVINO IR size reduction of pruned + quantized model vs. original model in FLOAT16 precision 
Googlenet (ImageNet) 
47/54% 
1.11% 
2.38x 
ResNet18 (ImageNet) 
21%/24% 
0.75% 
2.27x 
ResNet34 (ImageNet) 
29%/31% 
0.75%

2.63x 
ResNet50 (ImageNet) 
37%/44% 
0.77% 
2.51x 
SSD300 (Pascal VOC) 
56%/57%

0.62%

3.61x 
UNet (Mapillary) 
49%/42%

1.12% 
2.04x 
Conclusion
We introduced a new optimization pipeline with NNCF framework, Posttraining Optimization Tool and Intel® Distribution of OpenVINO™ toolkit that is aimed to subsequently apply Filter Pruning and INT8 quantization methods to get highly optimized DL models. One of the important advantages of the proposed pipeline is that it is hardwareagnostic, i.e. can be effectively used to optimize models for various types of DL hardware, such as CPU, GPU, or special DL accelerators. Even though the pipeline is twostage and requires the usage of two different tools, it can be automated because it can be applied to any arbitrary CNN and does not require a change of the model structure from the user side. We showed that applying the proposed pipeline it is possible to substantially improve the inference performance and reduce the size of the model.
If you have any ideas in ways we can improve the product, we welcome contributions to the opensourced OpenVINO™ toolkit. Finally, join the conversation to discuss all things Deep Learning and OpenVINO™ toolkit in our community forum.
Notices and Disclaimers
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex .
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and noninfringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.