Optimizing LLM Inference on Intel® Gaudi® Accelerators with llm-d Decoupling

IntelAI · ‎07-28-2025

Author: Kiran Atmakuri

Co-Authors: Brian Liu, Suresh Nampalli, Pallavi Jaini, Libin Tang, and Leon Tran

Intel® Gaudi® AI Accelerators Now Support llm-d: Enhancing Inference Efficiency Through Prefill and Decode Decoupling

Introduction

As part of the industry’s shift toward more scalable and efficient LLM inference architectures, Intel® Gaudi® accelerators now support the llm-d stack through explicit decoupling of the Prefill (P) and Decode (D) stages. This decoupling improves consistency in inter-token latency, enables optimized resource allocation, and allows dynamic scalability of P and D nodes according to workload demands. Additionally, we demonstrate support for heterogeneous hardware architectures, enabling deployment of Prefill (P) and Decode (D) stages across different hardware platforms.

llm-d Stack on Intel Gaudi Accelerators

The llm-d stack, as outlined in the llm-d architecture documentation, builds upon the vLLM inference engine and introduces disaggregation of the Prefill (P) and Decode (D) stages. This enables decoupled scheduling and scaling of these stages across diverse hardware resources.

On Intel Gaudi accelerators, llm-d support is achieved by integrating with the stack through several key runtime components:

Inference Gateway: Acts as the entry point for incoming requests and handles token-level orchestration for multi-stage inference.
Smart Scheduling: Dynamically routes prefill and decode requests to the appropriate hardware backends based on availability, queue depth, and resource profiles.
Endpoint Picker: Selects the target Gaudi vLLM container for P or D execution, enabling workload distribution across nodes.

The Gaudi vLLM container itself builds on the open-source vLLM engine, with support for:

LMCache: A memory-optimized key-value (KV) cache used by Gaudi to persist activations computed during Prefill and reuse them during Decode. This is critical for avoiding redundant computation and reducing latency in token generation.
KV Connector: Enables cross-node communication of KV cache state between Prefill and Decode stages, supporting both intra- and inter-generation caching when running on multiple Gaudi devices or in hybrid configurations.

By integrating with these components, Intel Gaudi accelerator can now leverage the strengths of the llm-d and vLLM stacks—combining the dynamic request orchestration of llm-d with the efficient execution and memory management provided by vLLM.

Here is an illustration of the llm-d stack on Intel Gaudi accelerator:

Prefill and Decode Decoupling

To validate the core concept of distributed inference via llm-d, we used the Llama3.3-70B and Llama3.1-8B models to compare ITL (Inter Token Latency) behavior between traditional coupled inference and a decoupled setup.

In the baseline (non-llm-d) configuration, Prefill and Decode stages are executed sequentially on the same hardware. In contrast, the llm-d approach separates these stages and routes them independently using the llm-d stack.

Our findings show that with decoupling:

Improved and consistent ITL values across different Queries Per Seconds (QPS)
Decode behavior was more predictable, especially under concurrent request loads.

Figure 1. Showing the improvements to ITL with llm-d prefill and decode dis-aggregation for Llama 3.3 70b. Lower is better.

llama3.1-8b.jpg

Figure 2. Showing the improvements to ITL with llm-d prefill and decode dis-aggregation for Llama 3.1 8b. Lower is better.

Figures 1 and 2 show that the ITL in the case of traditional inference (orange line) is inconsistent across varying query loads, while the llm-d implementation (blue line) provides more uniform latency. These results confirm that distributed inference using llm-d helps maintain predictable ITL, significantly reducing variability caused by shared compute contention between Prefill and Decode phases.

Adaptive Scheduling and Scaling of Prefill and Decode Nodes

We used the Llama3.1 70B model to explore how Prefill and Decode nodes can be independently scheduled and scaled within the llm-d architecture on Intel Gaudi accelerator. Decoupling these stages enables fine-grained control over compute allocation based on workload characteristics and runtime conditions.

In our setup:

xPrefill (Prefill) nodes scaled dynamically based on prompt size, token count, and caching behavior.
yDecode (Decode) nodes scaled based on concurrent decode streams and target response latency.

This flexibility allows Gaudi accelerators to handle heterogeneous load patterns. For example, long-context prompts can trigger scaling of Prefill independently, without impacting low-latency Decode operations.

A key advantage of llm-d is its stage-level scheduling abstraction, which enables intelligent job placement. We leveraged this to route requests dynamically:

For a single-user query with a smaller token input like 5 tokens in our test case, the system ran both stages on a single node to reduce overhead.
For a single-user query with a larger-token input like 59 tokens in our test case, llm-d scheduler uses separate prefill and decode nodes to process the request.
For 256 concurrent requests with 8K-token inputs, llm-d scheduler distributed the work across the Prefill and Decode nodes, guided by queue depth, KV cache state, and utilization metrics.

Figure 3. Based on prompt length (i.e., 5 tokens), System automatically schedules the Prefill and Decode to a Single node.

Figure 4. Based on prompt length (input token 59), System is scheduling the request to both prefill and decode nodes.

For 256 concurrent requests with 8K-token inputs, the Grafana dashboards captured in Figure 5 are real-time metrics with Prefill (dashed lines) and decode (solid lines).

Figure 5: Showing the Request processing and KV cache usage across llm-d prefill and decode nodes for Llama 3.1 8B (top left: There is no queue for prefill and decode; top right: Two decode nodes are actively processing based on the request queue; bottom: Efficient KV-Cache utilization across 2-prefills and 2-decode nodes)

Heterogeneous Hardware Integration

To demonstrate cross-accelerator compatibility within the llm-d stack, we validated a heterogeneous inference configuration using the Llama3.1 8B model. In this setup:

The Prefill (P) stage was executed on Nvidia GPUs.
The Decode (D) stage ran on Intel® Gaudi® 2 accelerator and Intel Gaudi 3 accelerator nodes.

This configuration was enabled by llm-d’s architecture and its support for KV Connect, allowing Prefill and Decode stages to interoperate across different accelerator types without modification to the model or runtime.

The ability to mix Gaudi and Nvidia accelerators in the same inference pipeline offers flexibility and cost efficiency. Organizations can:

Leverage existing GPU infrastructure for parts of the workload.
Incrementally adopt Gaudi accelerators to reduce overall inference TCO.
Optimize resource allocation based on workload type and hardware availability.

This seamless integration path allows for hybrid deployments that scale efficiently while preserving prior hardware investments.

Figure 6. llm-d Infra with heterogenous infra (top left), Prefill processing on NV-GPU (bottom left), Decode Processing on Gaudi 3 (top right), Benchmark test running (bottom left)

Figure 7. Prefill running on Nvidia GPU (top left), Prefill processing on Nvidia GPU (bottom left), Decode running on Gaudi 3 (top right), Benchmark test running (bottom left)

Conclusion

With the integration of the llm-d stack, Intel Gaudi accelerators now support decoupled execution of Prefill and Decode stages, enabling more flexible and efficient deployment of large language models. By leveraging components such as Inference Gateway, LMCache, and KV Connect within the vLLM-based llm-d runtime, Gaudi supports distributed inference patterns that improve consistency in token generation workloads.

We validated this across a range of scenarios:

Consistent ITL behavior with decoupled execution on Llama4-Scout 109B
Scalable Prefill and Decode nodes using Llama3.1 70B across Intel Gaudi 2 accelerators and Intel Gaudi 3 accelerators.
Heterogeneous deployments, running Prefill on Nvidia GPU and Decode on Gaudi, to highlight interoperability and cost-efficiency

We are actively working to further optimize LLM-D performance on Gaudi, including improvements to KV-cache management. We will publish these improvements soon.