ExecuTorch with OpenVINO Backend in 2026: New Capabilities and Updates

Stephanie_Maluso · ‎04-06-2026

Authors: Yamini Nimmagadda, Daniil Lyakhov, Surya Siddharth Pemmaraju, Samet Akcay, Dmitriy Pastushekov, Aamir Nazir, Mustafa Cavus

In our previous blog, we explored how the OpenVINO™ backend for ExecuTorch enables seamless inference across Intel’s heterogeneous architecture—CPU, GPU, and NPU. Since then, the ecosystem has evolved rapidly. We are excited to share the latest updates that make deploying high-performance AI on Intel-powered AI PCs and robotics platforms more robust and versatile than ever.

Hardening the Foundation: Systematic Validation and Enterprise Deployment

Reliability and seamless integration are the bedrocks of any AI framework. We have unified our core infrastructure updates to ensure that the OpenVINO™ backend is not only fast but also rock-solid for production environments.

Native Operator Validation at Scale: We have reached a significant milestone by enabling hundreds of native Op Tests and several models from torchvision and torchaudio directly within the upstream ExecuTorch repository. By leveraging the official ExecuTorch operator test suite, we ensure bit-exact functional parity for ATen operators offloaded to the OpenVINO backend. This rigorous validation spans fundamental tensor arithmetic to sophisticated non-linear activations, guaranteeing that the lowering process maintains strict mathematical consistency across all offloaded ATen operators.
CI-Driven Numerical Integrity: These tests are now integrated into the primary CI/CD workflow, creating a high-fidelity feedback loop. Any numerical regressions or unexpected kernel behavior is caught at the PR stage, ensuring that as both the ExecuTorch core and OpenVINO evolve, your model’s integrity remains uncompromised.
ABI Compatibility: To simplify the developer experience, we have transitioned our backend implementation to the OpenVINO C-API. This architectural shift significantly improves ABI (Application Binary Interface) compatibility, allowing ExecuTorch to be linked into complex C++ application environments without the friction of fragile dependency chains or symbol conflicts. Please refer to this PR here for more details.

Advanced Model Compression via Neural Network Compression Framework

Efficiency on the AI PC is no longer just about basic quantization; it’s about sophisticated, data-aware compression. We have integrated the Model Compression API, powered by the Neural Network Compression Framework (NNCF), directly into the ExecuTorch workflow via the quantize_pt2e and compress_pt2e paths.

Unlike standard workflows, NNCF embeds a calibration loop inside the quantization pipeline. This allows each algorithm to refine the model using real data before the next optimization step. Key Compression Algorithms Now Supported include:

Data-Free & Activation-Aware Compression: We support AWQ (Activation-aware Weight Quantization) to find optimal per-channel scales based on activation distributions, and data-free modes that rely solely on pretrained weights.
Mixed Precision & Sensitivity Analysis: By assigning different bit-widths (e.g., INT4/INT8) to individual layers based on their sensitivity, we maximize compression while protecting accuracy. We utilize metrics like Hessian, Mean/Max Variance, and Mean Magnitude for these precision assignments.
SmoothQuant & BiasCorrection: To minimize precision loss in Transformers and CNNs, we use SmoothQuant to migrate quantization difficulty from activations to weights and BiasCorrection to align quantized output distributions with the original float model.

Figure 1 illustrates the NNCF Compression API workflow for ExecuTorch, showing how captured FX Graphs are transformed into optimized artifacts via two distinct paths: quantize_pt2e for SmoothQuant and Bias Correction, and compress_pt2e for advanced weight-reduction techniques like Mixed Precision, Scale Estimation, and AWQ.

Figure 1. NNCF Compression API scheme: quantize_pt2e / compress_pt2e

Powering the Future: New Models Enabled

Alongside core backend hardening and compression improvements, we have expanded model coverage with end-to-end examples that showcase both vision and generative workloads running efficiently on Intel AI PCs across CPU, GPU, and NPU.

Advanced Object Detection using Ultralytics YOLO26

YOLO26 continues the YOLO family’s focus on real-time object detection, offering an improved balance between accuracy, latency, and model scalability. While training and experimentation typically happen in PyTorch, production deployment, especially on edge devices and client platforms—requires a runtime that is portable, efficient, and hardware-aware.

YOLO26 is exported using ExecuTorch’s AOT pipeline. During export, the model graph is analyzed and partitioned, with supported operators lowered to OpenVINO. OpenVINO then applies graph compilation and hardware-specific optimizations. At runtime, applications load the exported ‘.pte’ model, prepare input tensors such as images or video frames, execute inference through ExecuTorch, and retrieve detection outputs including bounding boxes, class IDs, and confidence scores.

In the ExecuTorch GitHub repository, there is a demo available that describes, through simple steps, how to build ExecuTorch with the OpenVINO backend, export and optimize a model, and run it on the CPU, GPU, or NPU. Please follow these instructions to try YOLO26 + ExecuTorch yourself on Intel hardware with OpenVINO.

Recording 2026-04-06 154103.gif

Figure 2: Object Detection using Yolo26

Fast Image Generation with Quantized Stable Diffusion:

In our previous blog, we demonstrated image generation using the FP16 SimianLuo/LCM_Dreamshaper_v7 model. We now apply INT8 quantization, using a hybrid scheme: activation quantization for the UNet and weight-only quantization for the text encoder and VAE. This mixed quantization strategy reflects the computational characteristics of the model, where UNet accounts for most of the inference workload and benefits most from INT8 activation and weights quantization, whereas the text encoder and VAE are more sensitive to static activation quantization, so weight-only quantization preserves the output quality. On an Intel® Core™ Ultra 7 356H CPU, this quantization reduced inference latency by ~1.5x, model load time by ~1.3x, and model size by ~1.7x with ExecuTorch OpenVINO backend. Qualitative evaluation across diverse prompts indicated that INT8 outputs remain visually consistent with FP16 originals, demonstrating that LCM-based diffusion architectures can be aggressively NNCF-quantized with minimal perceptual degradation.

Figure 3: FP16 vs INT8 outputs for SimianLuo/LCM_Dreamshaper_v7, showing minimal visual difference with faster inference and smaller model size

Deploying Qwen2.5-1.5B on Intel AI PCs

Qwen2.5 is an open-weight language model family spanning sizes from 0.5B to 32B parameters and designed for tasks ranging from conversational AI to code generation. The 1.5B variant strikes a practical balance between capability and footprint, making it a strong candidate for deployment directly on client hardware.

ExecuTorch with the OpenVINO backend brings Qwen2.5-1.5B to Intel AI PCs, with support for INT4 weight compression via NNCF's compress_pt2e API to further reduce memory usage without significant accuracy loss. This enables smooth, low-latency text generation across Intel CPU, GPU, and NPU.

Step-by-step instructions for exporting and running Qwen2.5-1.5B on Intel hardware are available here.

Figure 4: Output of Qwen2.5 1.5B model

Accelerating Robotics with Physical AI Studio

The ExecuTorch + OpenVINO stack is not limited to vision and language workloads. A particularly compelling use case is deploying robotic manipulation policies on Intel-powered edge systems — where real-time inference latency directly determines whether a robot can operate safely and effectively.

Physical AI Studio is an end-to-end framework for training and deploying Vision-Language-Action (VLA) models for robotic imitation learning. It natively integrates with the ExecuTorch + OpenVINO backend, enabling robotics policies trained in PyTorch to be exported and deployed on Intel hardware through a single, unified pipeline.

Supported Policy Architectures: Physical AI Studio provides implementations of state-of-the-art imitation learning policies including ACT (Action Chunking with Transformers), Pi0, Pi0.5, SmolVLA, and GR00T N1. These policies learn complex manipulation behaviors — such as grasping, pushing, and multi-step assembly — from human demonstrations. All policies are to support ExecuTorch export with the OpenVINO delegate, which would make it possible to run the full train → export → deploy loop on Intel hardware.

One-Line Export, Any Backend: The export API is designed for simplicity. A trained policy can be exported to ExecuTorch with the OpenVINO delegate in a single call:

from physicalai.policies import ACT

policy = ACT.load_from_checkpoint("checkpoints/model.ckpt")
policy.to_executorch("./exports", delegate="openvino")

Under the hood, this traces the model through torch.export, partitions the computation graph using OpenVINO’s graph partitioner (powered by NNCF), and writes a .pte artifact ready for deployment. The same policy can just as easily be exported to standalone OpenVINO (.xml) or ONNX formats — all through a unified policy.export(path, backend=...) interface.

Unified Inference Across Backends: At deployment time, Physical AI Studio’s InferenceModel automatically detects the exported format and loads the appropriate runtime adapter — OpenVINO, ExecuTorch, or other backends — behind a consistent select_action() API. This means switching from an OpenVINO deployment on a server to an ExecuTorch deployment on an embedded Intel platform requires no code changes beyond pointing to a different export directory:

from physicalai.inference import InferenceModel

policy = InferenceModel.load("./exports")  # auto-detects .pte,or .xml
policy.reset()

while not done:
    action = policy.select_action(observation)
    observation, reward, done = env.step(action)

Figure 5: Robot in action with Physical AI Studio

Numerical Consistency Across Backends: A critical requirement for robotics is that exported models behave identically to the original training model — a small numerical divergence in action predictions can cause a robot arm to miss a grasp or collide with obstacles. Physical AI Studio’s export pipeline produces numerically identical outputs across all backends (OpenVINO, ExecuTorch portable, XNNPACK, and OpenVINO delegate), with cross-backend maximum absolute differences near machine epsilon. This is validated end-to-end in the framework’s backend comparison notebook, which trains a policy, exports to every supported backend, and performs pairwise numerical verification.

Conclusion

Over the past year, ExecuTorch with the OpenVINO™ backend has matured from a promising deployment path into a more production-ready stack for Intel AI PCs and edge systems. We have strengthened correctness and stability through large-scale native operator validation and ABI compatibility. On the optimization front, bringing NNCF into the quantize_pt2e and compress_pt2e workflows enables practical, data-aware compression—from SmoothQuant and BiasCorrection to mixed precision and AWQ—so models can hit tighter latency and memory budgets without sacrificing quality. And with new end-to-end examples spanning YOLO26, quantized LCM-based Stable Diffusion, Qwen2.5-1.5B (INT4), and robotics policies such as ACT, it’s now easier to take PyTorch models through export, compilation, and deployment across CPU, GPU, and NPU.

To get started, follow the examples and resources below, and consider contributing issues or PRs as you bring more workloads to ExecuTorch + OpenVINO on Intel hardware.

Additional Resources

ExecuTorch OpenVINO backend

ExecuTorch OpenVINO Tutorial

OpenVINO documentation

Convert and Optimize YOLO26 with OpenVINO™ Notebook

Intel® Core™ Ultra Processors (Series 3)

Acknowledgments:

Intel: Maksim Proshin, Maxim Vafin, Muthaiah Venkatachalam, Radwan Ibrahim, Ilia Efimov, Stefanka Kitanovska

Meta: Mergen Nachin, Anthony Shoumikhin, Digant Desai, Andrew Caples

Notices & Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

MEIRE · ‎05-19-2026

muito bom