Data Center
Participate in insightful discussions regarding Data Center topics
98 Discussions

From Models to Systems: Enabling Heterogeneous AI Inference with Open Orchestration

KiranAtmakuri
Employee
2 0 915

Author: Kiran Atmakuri, AI Product Manager, Intel Corporation 

 

AI is rapidly shifting from training-centric development to production-scale inference. As that shift accelerates, a big change is becoming clear:

AI inference is no longer just a model problem — it is increasingly a systems problem.

Modern AI applications are becoming more agentic, more multimodal, and more dynamic. They no longer stop at a single model response. Instead, they combine model inference with retrieval, tool execution, data processing, and orchestration. That evolution is driving a new serving paradigm: heterogeneous infrastructure, enabled by open software orchestration.

Today’s AI agents are already heterogeneous by design.

 

At a high level, agentic workloads have two broad components. The first is token generation, or inference, which typically runs on GPUs. The second is the actions around the model, which typically run on CPUs. These actions include retrieval, data processing, tool and API calls, code execution, validation, and orchestration.

For example, when a user asks an AI system to analyze quarterly earnings and build an investment strategy, the system may search the web, parse documents, query databases, run Python analysis, and then synthesize the results into a final response. That is not a single inference call. It is a multi-step system workflow.

This is why AI agents should be viewed as heterogeneous systems that naturally span CPUs and GPUs.

Inference Itself Is Also Becoming Heterogeneous

Beyond the agent workflow, inference itself is becoming disaggregated.

Historically, inference was treated as a single end-to-end workload. Over time, that evolved into a more structured serving model. First, inference was separated into prefill and decode. Now, with multimodal models, it is increasingly separated into encode, prefill, and decode.

This matters because each stage has very different characteristics:

  • Encode converts image, video, or audio inputs into embeddings
  • Prefill processes context and generates the KV cache
  • Decode performs autoregressive token generation

Image 1.png

These stages differ in compute intensity, memory behavior, latency sensitivity, and scaling requirements.

Once inference is broken into distinct stages, it creates an important opportunity: match each stage to the hardware best suited for it.

That is the core idea behind heterogeneous infrastructure.

Why Heterogeneous Infrastructure Matters

A one-size-fits-all serving model becomes less effective as inference grows more disaggregated.

Different stages of inference require different types of hardware. Some are more compute-intensive. Others are more memory-intensive. Some are best suited to accelerators, while others benefit from CPU-based coordination and control.

By matching the right hardware to the right stage, heterogeneous infrastructure can provide several benefits:

Improved utilization of available compute, better latency characteristics (including TTFT and TPOT), more scalable deployment patterns, potential TCO advantages over time, and greater flexibility in how systems are designed.

In this world, success is less about any single chip and more about how the entire system works together.

The Challenge: Making Disaggregated Inference Practical

The opportunity is clear, but so is the challenge.

Once inference is disaggregated across multiple stages and multiple hardware types, orchestration and scheduling become much more complex. The system has to decide where each stage runs, how requests move between stages, and how the intermediate state is managed efficiently.

This is especially true for KV cache management. In disaggregated inference, the system must not only place compute effectively, but also manage how the KV cache is created, moved, stored, and reused across stages.

This is not something developers want to manage manually at production scale. That is why the software layer becomes critical.

Heterogeneous Inference Serving with NVIDIA Dynamo

To help make heterogeneous inference practical, Intel is contributing to the open-source NVIDIA Dynamo inference framework.

Dynamo delivers system-level performance optimizations across multi-node inference environments. It is composed of modular components that can be combined or deployed independently. It provides a platform for coordinating disaggregated inference stages across heterogeneous infrastructure, helping address the operational complexity of scheduling, routing, and serving inference workflows across CPUs, GPUs, and accelerators.

Key capabilities include:

  • Support for vLLM, SGLang, and NVIDIA TensorRT LLM inference backends and serving across heterogeneous hardware environments
  • KV cache-aware request routing and scheduling via Dynamo Router
  • KV cache management across memory and storage tiers via the Dynamo KV Block Manager (KVBM),
  • Low-latency point-to-point KV cache transfer via the NVIDIA NIXL library
  • Topology-aware scaling and gang scheduling in Kubernetes environments via the NVIDIA Grove API
  • Production-ready deployment tools for meeting SLOs via Dynamo AI-Configurator and Dynamo Planner

Together, these capabilities help turn disaggregated inference from an architectural concept into a scalable deployment model for real-world environments.

Intel’s Role: Enabling Heterogeneous Serving in the Open

Intel’s contribution focuses on enabling open source software.

This matters for two reasons:

  • First, it reduces adoption friction. Developers can work within an open framework rather than building custom infrastructure for every deployment.
  • Second, it expands customer choice. Open software makes it easier to mix and match infrastructure based on workload needs, instead of forcing all workloads into a single hardware pattern.

Interoperability is one of the real values of open orchestration.

Use Cases Driving Interest

We are currently evaluating several workload patterns in which heterogeneous infrastructure is particularly relevant.

Image 2.png

 

Across all of these use cases, the pattern is the same: heterogeneous infrastructure expands the solution space.

Open frameworks like Dynamo, with Intel’s upstream enablement for Intel Xeon processors and Intel Data Center GPU platforms, help reduce deployment friction and make heterogeneous serving more usable in real-world environments.

AI Inference: A Future From Chips to Systems

As AI moves into production, the focus is shifting from individual chips to complete systems. Customers increasingly care less about peak model performance in isolation and more about system-level outcomes such as latency, throughput, memory efficiency, scalability, operational simplicity, and total cost of ownership.

This is why open orchestration, disaggregated serving, and heterogeneous infrastructure are becoming more important. AI agents are already heterogeneous, and now inference itself is becoming heterogeneous as well. As inference evolves from a monolithic execution flow into distinct stages such as encode, prefill, and decode, the opportunity to match the right hardware to the right stage becomes increasingly valuable. That value, however, depends on making heterogeneous inference practical to deploy.

The future of AI inference will be increasingly open, software-orchestrated, and heterogeneous — shaped not by any single chip, but by how effectively the full system works together.