Reduce Application Footprint with the Latest Features in Intel® Distribution of OpenVINO™ toolkit

MaryT_Intel · ‎09-01-2020

Key Takeaways

Learn how to improve performance of deep learning solutions while reducing footprint of the final application to make it easier to redistribute.
Get started quickly with tools within the Intel® Distribution of OpenVINO™ toolkit like the Deployment Manager in order to rapidly reduce application footprint.
Discover capabilities by reviewing model and application size benchmarks included in the blog.

Download OpenVINO™ toolkit

Reduce Appl Footprint-TITLE.png

Authors:

Yury Gorbachev, Senior Principal Engineer, Intel

Alexander Kozlov, Deep Learning R&D Engineer, Intel

Ilya Lavrenov, Software Engineer, Intel

Advances in AI and Deep Learning make it possible to solve complicated tasks from intelligent video processing and text analysis to enhancing simple features like user interfaces by introducing voice-activated functions, text correction or suggestions, and other features. The Intel® Distribution of OpenVINO™ toolkit speeds up AI workloads like computer vision, audio, speech, and language and is optimized for Intel’s line-up of scalable CPU, integrated GPU, VPU and FPGA processors.

However, the addition of AI functionality comes at cost. Not only in terms of the data science work to produce models, but in increased processing capacity during inference and increased application footprint since it requires redistribution of the model as well as runtime binaries. In addition, depending on the use case, inference can require a substantial amount of memory to execute.

One of the objectives of the OpenVINO™ toolkit’s design is to make Deep Learning inference as lightweight as possible. We are not only improving performance but also working on reducing the footprint of the final application to make it easier to redistribute. Let’s dive into recent changes that we have done in this area.

Key components in the redistributable inference

Deep Learning applications require key components to perform inference. It consists of the application logic itself (which is the key focus on Intel® Distribution of OpenVINO™ toolkit), Deep Learning model(s) and runtime libraries. Models are typically trained in one of the popular frameworks, such as TensorFlow and PyTorch, and optimized with use of OpenVINO™ toolkit’s developer tools, such as the Model Optimizer and the Post-training Optimization Tool, while only the runtime libraries are copied from the OpenVINO™ toolkit distribution.

The relationship between the final application and distribution is shown in figure below.

Reduce Appl Footprint-Fig 1.png

It is important to mention the following:

The entire toolkit’s package size is substantial (i.e., ~199MB) due to all the features that are included. However, in runtime, only a small part is needed (i.e., runtime libraries); therefore, there’s no need for developer tools and redistributable models.
The toolkit’s runtime, Inference Engine, is designed with a plugin approach. Depending on the deployment scenario, you might need just a subset of those plugins. For example, if you are not using pre-processing or heterogeneous logic, you can exclude those plugins. Moreover, if you use a model in the latest version of the Intermediate Representation (IR), you can exclude legacy support. To help you minimize your package size for deployment, Intel® Distribution of OpenVINO™ toolkit comes with a Deployment Manager.

By following this approach, you can keep only the necessary components. However, the size of those necessary components is also important, so we have reduced the size as well.

Reducing Model Size for Deployment

A deep learning model contains two portions—weights and activations. The size of the model is primarily dictated by the size of its weights, which are mandatory for the model to function. However, the precision of weights can be changed, bringing substantial improvements to its size. Beginning with the Intel® Distribution of OpenVINO™ toolkit 2020.2 release, such improvements were introduced to the Intermediate Representation (IR), which is an internal format that represents the model’s weights and activations, for all supported and distributed models.

Floating-point Model Representation

We observed that lowering the precision of model weights from float32 to float16 does not affect its accuracy. During the inferencing phase, such weights can be cast back to the float32 data type on hardware which does not support float16 (e.g., 3^rd generation Intel® Xeon® Scalable Processors). Using this method, we can reduce the model size twice without any visible impact on performance and accuracy.

Quantized Model Representation

The possible compression rate of quantized models can be much higher. For example, quantizing weights to int8 precision allows a potential reduction of the model size by a factor of four; however, it requires additional modifications of the model representation, and these modifications were introduced in the latest release of Intel® Distribution of OpenVINO™ toolkit.

As we wrote in another blog on “Enhanced Low-Precision Pipeline to Accelerate Inference with the Intel® Distribution of OpenVINO™ toolkit”, the Intel® Distribution of OpenVINO™ toolkit represents quantized models in the Intermediate Representation format using the FakeQuantize operation.

Reduce Appl Footprint-Fig 2.png

Fig.1. An example of the quantized compressed model

This operation is expressive and allows for the mapping of values from arbitrary input and output ranges. It means that while producing the quantized model, we can preliminarily quantize its weights and store them in int8 precision, adjusting the input quantization parameters accordingly. This mechanism has been implemented in the OpenVINO toolkit’s Model Optimizer and Post-training Optimization Tool.

For backward compatibility, we inserted a “Convert” operation that performs the transformation of the weights back to floating-point precision at model load time. Combining the compression of quantized weights to int8 precision with the storage of other parameters in float16 allows us to achieve the aforementioned 4x model size reduction.

Table 1: Comparison of the size of float32 vs. compressed models. * - some of the layers of the model are kept in the floating-point precision.

Reductions in Runtime Components

While the model size is somewhat controllable during design, runtime libraries are something that you get from distribution and a part of your model that you’ll need to pack into your application distributable. Hence, we reduced model size within those libraries from our regular releases of the Intel® Distribution of OpenVINO™ toolkit.

One year ago, the OpenVINO™ toolkit’s Inference Engine consisted of a single, shared library, which included all of the functionality even though some of the building blocks could be unused for particular scenarios. Based on that, the library was split into multiple, smaller libraries with each library representing a dedicated building block which you can choose to use or not.

Fig.2. A simplified diagram for Inference Engine runtime dependencies. The arrow represents strong dependency, while the dotted arrow represents runtime dependency. Blue ellipse marks mandatory dependencies, orange ellipse marks runtime loadable dependencies, gray ellipse marks component usage depends on the use case, and finally, white ellipse marks third party dependencies.

Inference Engine library represents core runtime Inference Engine functionality.
Inference Engine Transformations library contains optimizations passes for CNN graph.
Inference Engine Legacy library contains old network representation and compatibility code which is needed to convert from new nGraph-based representation.
Inference Engine IR, ONNX readers represent plugins which are runtime loadable by Inference Engine Core library in case if IR or ONNX files are passed by user.
Inference Engine Preprocessing library now represents the plugin which is runtime-loadable by Inference Engine plugins if a user sets preprocessing (e.g., color conversion and resize algorithm). Otherwise, the library is not needed and can be skipped when creating the deployment package.
Inference Engine Low Precision Transformations library is linked to the plugins directly if a plugin supports int8 data type—as an example, the FPGA, MYRIAD, GNA plugins not support the int8 flow and doesn't have to be linked against this library.

To optimally execute the inference procedure, the Inference Engine library implements basic threading routines as well as complex task schedulers with the TBB library under the hood. The TBB dependency, by default, contains debug information for the binaries, which gives an extra overhead to the total Inference Engine runtime size, but the debug information is not needed for productizing applications. The stripped TBB binaries included in the Intel® Distribution of OpenVINO™ toolkit, starting from the 2020.2 release, which results in the TBB library only taking 0.39 MB—a 5x reduction in size when compared to the original 2.12 MB containing debug symbols.

The nGraph library is a key component in the OpenVINO™ toolkit, which is responsible for model representation. This component allows to create or modify networks in the runtime. In the 2020.4 release, we separated the ONNX importer from the nGraph library, which results in a reduction in size.

The CPU plug-in (also known as the MKL-DNN plug-in) library is another key component in OpenVINO™ toolkit, which is responsible for model inference on Intel CPUs. It contains target-specific graph optimizations, layers implementations, threading, and memory allocation logic. Several compute-intensive routines like Convolution, FullyConnected, and other CPU plug-ins use the OneDNN fork library as a third-party component. This component is statically linked into the main CPU plug-in library. Another common operation in DL inference workloads is matrix multiplication which was supported via another third-party component, the Intel® Math Kernel Library. The Intel MKL has a large enough size since it solves various mathematical problems. To save disk space, instead of using the full version of MKL libraries, the OpenVINO™ toolkit redistributes custom dynamic libraries (called “mkltiny”) with a reduced list of functions which were built using official functionality. Despite this, the Intel MKL dependency still takes up a significant size of the distribution (see Table 2).

In the 2020.2 release, the oneDNN fork was migrated to 0.21.3 version of the original repository. This version includes optimizations for sgemm routine which allows us to achieve comparable performance to Intel MKL. In addition to this several optimizations were implemented inside the plug-in which allowed to finally get rid of Intel MKL dependency and fully rely on the plug-in and oneDNN fork capabilities. As a result, we were able to get about 1.8x binary size reduction for libraries responsible for CPU inference (see Table 2). We also managed to keep the same functionality and the same (or sometimes, even better) performance for all workloads in our validation and testing. However, to make sure this will not degrade in a specific user’s scenario, we provide the cmake option GEMM=MKL, which allows users to build the CPU plug-in from sources with an Intel MKL dependency.

Library name	OV 2020.1 release (MB)	OV 2021.1 release (MB)
mkltiny	29.1	-
MKLDNNPlugin	25.0	31.2

Table 2: CPU specific runtime libraries sizes (Ubuntu 16.04 gcc 5.5.0)

To summarize, the table below outlines the minimal runtime sizes for several target devices (see Table 3):

Target device	Inference Engine runtime (MB)
CPU	41.73
GPU (w/o OpenCL library)	17.86
MYRIAD (w/o USB library)	13.41

Table 3: Minimal set of Inference Engine runtime per target device without a preprocessing library (Ubuntu 16.04 gcc 5.5.0)

Migration to Microsoft Visual Studio C++ compiler on Windows

Due to historical reason Intel C++ compiler was used to compile OpenVINO binaries on Microsoft Windows platform. While it gives good performance out of the box, it also increasing the binary size due to loop unrolling, multiple copies of the same code depending on input parameters and so on. Moving to Microsoft Visual studio C++ compiler required few rewrites of the code fragments to keep performance level while reducing binary size. So, since 2021.1 OpenVINO binaries on Windows are compiled with Microsoft Visual studio C++ compiler and it gives additional 2.5x decrease in runtime size.

Custom Compiled Runtimes

While the OpenVINO™ toolkit contains all needed components which were verified on target platforms, you can build runtime libraries from sources yourself. The open-sourced version allows you to compile runtime with certain options that will additionally reduce size.

You can use ENABLE_LTO option which enables Link Time Optimizations (LTO) during the Inference Engine to be built on Unix-based systems. The default value is off because it takes a significant time to link the runtime binaries, but the OpenVINO™ toolkit release packages are built with this option enabled to minimize runtime size. Generally, LTO gives a 24% decrease in runtime size for components where it’s enabled.

Conclusion

Advances in AI and Deep Learning make it possible to solve complicated tasks. However, the addition of AI functionality comes at cost, not only in terms of the data science work to produce models, but in increased processing capacity during inference and increased application footprint since it requires the redistribution of the model and as well as runtime binaries. Besides, depending on your use case, inferencing can require a substantial amount of memory to execute. In this blog, we shared how we not only improved performance but also reduced the footprint of the final application to make it easier to redistribute. Get the Intel® Distribution of OpenVINO™ toolkit today and start deploying high-performance, deep learning applications with a write-once-deploy-anywhere efficiency. If you have any ideas in ways we can improve the product, we welcome contributions to the open-sourced OpenVINO™ toolkit. Finally, join the conversation to discuss all things Deep Learning and OpenVINO™ toolkit in our community forum.

_{Notices & Disclaimers}

_{Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.}

_{Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.}

_{Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.}

_{Your costs and results may vary.}

_{Intel technologies may require enabled hardware, software or service activation.}

_{© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.}

_{Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.}