Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
413 Discussions

OpenVINO™ Execution Provider + Model Caching = Better First Inference Latency for your ONNX Models

0 0 3,435

Authors: Devang Aggarwal, N Maajid Khan

Choosing the right type of hardware for deep learning tasks is a critical step in the AI development workflow. Here at Intel® we provide developers, like yourself, with a variety of hardware options to meet your compute requirements. From Intel® CPUs to Intel® GPUs, there are a wide array of hardware platforms available to meet your needs. When it comes to inferencing on different hardware, the little things matter. For example, the loading of deep learning models can be a lengthy process and can lead to a bad user experience on application startup.

Are there ways to achieve better first inference latency on such devices?

Short answer is, yes, there are ways; one is to handle the model loading time. Model loading performs several time-consuming device-specific optimizations and network compilations, which can also result in developers seeing a relatively higher first inference latency. These delays can lead to a bad user experience on application startup. This problem can be solved through a mechanism called Model Caching. Model Caching solves the issue of model loading time by caching the model directly into a cache directory. Reusing cached networks can significantly reduce load network time.

With this feature, if the device specified by LoadNetwork supports import/export network capability, a cached blob is automatically created inside the caching folder during the runtime.

Depending on your device, total time for loading network on application startup can be significantly reduced. Also note that the very first LoadNetwork when cache is not yet created takes slightly longer time to “export” the compiled blob into a cache file. But from the subsequent runs, the application can leverage the cached model by directly importing the cached model at runtime.

Figure 1 Model Caching Workflow with OpenVINO™ ToolkitFigure 1 Model Caching Workflow with OpenVINO™ Toolkit


Developers can now leverage model caching through the OpenVINO™ Execution Provider for ONNX Runtime, a product that accelerates inferencing of ONNX models using ONNX Runtime API’s while using the OpenVINO™ toolkit as a backend. With the OpenVINO Execution Provider, ONNX Runtime delivers better inferencing performance on the same hardware compared to generic acceleration on Intel® CPU, GPU, and VPU.

Model Caching

The OpenVINO Execution Provider for ONNX Runtime ensures that the model loading time on Intel® iGPU is sped up through the help of cl_cache, a model caching API. The cl_cache API caches binary representations of OpenCL Kernels provided in text form by the application. By storing the binary representations, compiling is only required the first time, which improves the performance. With the help of this API, users are able to save and load the cl_cache files directly. These cl_cache files can be directly loaded on to the Intel® iGPU hardware as a device target and the inferencing can be done. Additionally, for Intel® Movidius™ Myriad™ X (VPU), OpenVINO Execution Provider uses model caching to enable blobs in order to speed up the model loading time. Learn more about how to use the model caching feature with the OpenVINO Execution Provider here.


With the help of the OpenVINO™ Execution Provider, AI developers can see a significant boost in performance for their deep learning models on Intel® Hardware. Additionally, with the model caching mechanism, OpenVINO™ Execution Provider can speed up the first inference latency of deep learning models some more on Intel® CPU, iGPU and Intel® Movidius™ Myriad™ X (VPU).

Additional Resources

OpenVINO Execution Provider Homepage
OpenVINO Execution Provider Model Caching Feature
OpenVINO Execution Provider PyPi
OpenVINO Execution Provider Docker Hub Image

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more at .
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others

About the Author
Devang is an AI Product Manager at Intel. He is part of Internet of Things (IoT) Group, where his focus is driving OpenVINO™ Toolkit integrations into popular AI Frameworks like ONNX Runtime. He also works with CSPs to enable cloud developers to seamlessly go from cloud to edge.