Data Center
Participate in insightful discussions regarding Data Center topics
104 Discussions

Lowering Multimodal Inference Cost with Heterogeneous E/PD Disaggregation

KiranAtmakuri
Employee
0 0 29

Authors: Kiran Atmakuri, Pallavi Jaini, Sergey Plotnikov, Daniel Socek, Intel Corporation

 

Executive summary

Multimodal VLM serving is not a single homogeneous workload. A typical request combines vision encode, long-context LLM prefill, and token-by-token decode, and each stage stresses hardware differently across compute, memory capacity, and memory bandwidth.

This technical brief evaluates whether separating the vision encode stage from the prefill/decode path can improve latency and normalized cost efficiency using NVIDIA Dynamo. In the evaluated workload, adding one Intel® Arc™ Pro graphics B60 GPU for encode to an NVIDIA H200 GPU performing prefill/decode increases normalized hardware cost by only 2.5%, while improving median TTFT and P99 TTFT behavior relative to the aggregated H200 baseline. Scaling to four B60 encode workers (4E/1PD) increases normalized cost by 10% and delivers further TTFT gains, demonstrating that encode capacity can be scaled independently to meet workload demand.

Key result summary

Metric

(1E/4E)B60 and H200 E/PD vs. H200 aggregated baseline

Normalized hardware cost

+2.5% / +10%

Median TTFT

~32% lower (1E) / ~76% lower (4E)

P99 TTFT

~44% lower (1E) / ~72% lower (4E)

The central message is technical but commercially relevant: a small amount of purpose-fit encode capacity can reduce queueing and stage interference, allowing the H200 GPU to focus on the prefill/decode path where its HBM capacity and bandwidth are most valuable.

intro_updated.png

Figure 1. Relative impact of B60-H200 E/PD compared with the H200 aggregated baseline. Lower latency values are better; hardware cost is normalized.

1. Introduction: Multimodal inference is not one workload

Multimodal inference is becoming a core part of enterprise AI. Vision-language models are increasingly used for document analysis, medical imaging, retail search, robotics, content moderation, video understanding, and autonomous systems. These applications do not process only text; a single request may include many images, scanned document pages, screenshots, or frames sampled from video.

This changes the economics of inference. A text-only request may include a few hundred or a few thousand input tokens. A multimodal request can expand into tens of thousands of effective tokens after images are encoded and passed into the language model. This creates more upfront compute, larger KV cache pressure, longer time-to-first-token, and more complex GPU utilization patterns.

E/PD Heterogeneous disaggregation offloads the encoding work from the GPUs that handle prefill and decode.

2. Why VLM requests stress inference systems

Vision-language model requests are fundamentally different from text-only requests because visual inputs are transformed into embeddings that increase the effective input sequence processed by the LLM. Multi-page documents, image batches, and video frames can quickly turn a short user prompt into a long-context inference request.

For the workload evaluated in this draft, each request includes 128 input text tokens, 20 images at 480p resolution, and 256 output tokens. After vision encoding, the request reaches roughly 8300 effective LLM input tokens. That length materially affects prefill time, KV cache footprint, and latency under load.

Importantly, our analysis indicates that modern vision towers are relatively lightweight and are not themselves the dominant computational bottleneck. The primary cost arises from subsequent language model processing of the visual tokens generated by the encoder. When visual encoding shares the same accelerator resources as language-model prefill and decode (PD), it competes with the stage that already dominates inference latency and throughput, further exacerbating system bottlenecks.

At higher request rates, the system can experience queueing, KV cache pressure, and head-of-line blocking between image-heavy requests and active decode operations. This is why multimodal serving often benefits from an architecture different from traditional text-only serving.

Workload attribute

Value

Input text tokens

128

Input images

20

Image resolution

480p, 480 x 854

Output tokens

256

Effective LLM input after vision encoding

~8k tokens

3. Encode, Prefill, Decode: three stages, three bottlenecks

A multimodal inference request can be divided into three major stages. The stages are dependent on one another, but they do not have identical resource requirements.

Stage 1: Encode

The encode stage converts raw images into vision embeddings. Images are divided into patches, projected into embeddings, and processed by the vision encoder before being passed into the language model. Encode is compute-oriented, but it is typically much lighter than the full LLM prefill and decode path. It does not maintain large KV caches and places lower demands on memory capacity and bandwidth, making it a strong candidate for deployment on cost-efficient accelerators.

Stage 2: Prefill

The prefill stage processes the full input sequence, including text tokens and the vision tokens produced by the encoder. It builds the KV cache that decode will use to generate output tokens. For multimodal requests, prefill can become particularly expensive because large numbers of vision tokens are incorporated into the input sequence, substantially increasing the amount of language-model computation required.

Stage 3: Decode

The decode stage generates output tokens one step at a time. Each newly generated token attends to the KV cache created during prefill. Decode is often memory-bandwidth-sensitive because every generated token needs access to the KV cache.

Stage

Primary pressure

Best-fit hardware characteristic

Encode

Vision processing / compute

Lower-cost GPU with sufficient compute and memory

Prefill

Long-context compute and KV creation

High-performance GPU

Decode

KV cache reads and memory bandwidth

HBM-rich GPU

4. Aggregated vs. disaggregated serving

In a traditional aggregated deployment, all stages run on the same GPU. The same H200 GPU handles image encode, LLM prefill, and token decode. This is operationally simple, but it forces the encode stage to consume resources on the same high-end GPU needed for prefill and decode.

In an E/PD disaggregated deployment, encode is separated from prefill/decode. In the evaluated heterogeneous configuration, the B60-class GPU handles encode while the H200 GPU handles prefill and decode. The benefit is not only lower normalized cost; it is better stage isolation and better use of the high-end GPU. By separating encode from language-model execution, the system reduces contention between image processing and vision token computation, long-context prefill, and token generation. This allows the H200-class GPU to spend more time on the stages that dominate multimodal inference latency.

Aggregated serving: all stages on one H200 GPU

aggregated_updated.png

Figure 2. Aggregated serving places all stages on one high-end GPU

 

Heterogenous E/PD: encode separated from prefill/decode

disaggregated_updated.pngFigure 3: heterogeneous E/PD separates encode from prefill/decode.

5. Why Dynamo enables this architecture

NVIDIA Dynamo provides the serving framework needed to make this architecture practical. Dynamo supports disaggregated serving, where different inference stages can run on different workers. In an E/PD configuration, encode workers can be separated from prefill/decode workers; the encode workers produce vision embeddings that are transferred to the worker responsible for LLM execution.

Capability

Why it matters

Stage-level hardware matching

Encode can run on cost-efficient GPU resources, while prefill and decode remain on HBM-rich GPUs.

Independent scaling

Encode workers can be scaled separately from prefill/decode workers based on workload characteristics.

Improved high-end GPU utilization

Premium GPUs are reserved for the stages where they provide the most value.

6. Benchmark setup

The comparison intentionally focuses on two configurations to keep the result easy to interpret: an aggregated H200 baseline and a heterogeneous B60-H200 E/PD configuration.

Configuration

Encode

Prefill / Decode

Description

Aggregated:  H200 TP1

H200

H200

All stages run on one H200 GPU.

Disaggregate + Heterogeneous

 B60-H200 E/PD

1x B60

1x H200

1 B60 handles encode; H200 handles prefill/decode.

Disaggregate + Heterogeneous 4E B60-H200 E/PD

4x B60

1x H200

4 B60 handles encode; H200 handles prefill/decode.

Question isolated by the benchmark

What happens when a small, purpose-fit encode GPU is added while preserving the H200 GPU for prefill and decode?

7. Results: throughput, TTFT, ITL

Both heterogeneous B60-H200 configurations improved latency behavior while maintaining comparable throughput. The most important effect is in TTFT, where isolating encode reduces stage interference and queueing pressure on the H200 prefill/decode worker. Scaling to 4 encode workers (4E/1PD) provides further TTFT reduction at modest additional normalized cost.

 

Metric

H200 aggregated TP1

1E B60-H200 E/PD

4E B60-H200 E/PD

Relative change (1E vs. baseline)

Peak request throughput

0.82 req/s

0.88 req/s

0.94 req/s

7.3% higher

Median TTFT @ 1.0 RPS

22 s

15 s

5.2 s

32% lower

P99 TTFT @ 1.0 RPS

50 s

28 s

14 s

44% lower

Median ITL / TPOT proxy @ 1.0 RPS

44 ms

36 ms

36 ms

Comparable

The 4E configuration delivers the lowest TTFT at all request rates.

 

data_updated.pngFigure 4. E/PD Disaggregation performance curves for Qwen3-VL-32B-FP8 (128 input / 256 output tokens, 20 images per request, 480p). H200 aggregated baseline vs. 1E and 4E heterogeneous B60-H200 configurations.

8. Normalized TCO model

The TCO discussion intentionally avoids publishing exact hardware prices. Real customer pricing varies by procurement model, region, volume, support structure, and timing. Instead, the analysis uses normalized cost units.

Hardware class

Normalized cost unit1

H200 prefill/decode GPU

1.000

B60-class encode GPU

0.025

 

Under this model, adding one B60-class encode GPU increases normalized hardware cost from 1.000 to 1.025 (+2.5%). Scaling to four B60 encode workers increases it to 1.100 (+10%), while delivering further TTFT improvements.

 

Configuration

Hardware

Normalized cost1

H200 aggregated TP1

1x H200

1.000

B60-H200 E/PD

1x B60 + 1x H200

1.025

4 E B60-H200 E/PD

4x B60 + 1x H200

1.1

TCO interpretation

For +2.5% normalized hardware cost, the 1E configuration improves median TTFT by ~46%, P99 TTFT by ~56%. The business value comes from improving SLO-qualified performance per normalized hardware cost unit, not simply from adding another GPU.

9. When this architecture works best

Heterogeneous E/PD disaggregation is most attractive when the workload has image-heavy or video-heavy requests, long effective input context under high request rates, or a need to scale encode independently.

Condition

Why it favors heterogeneous E/PD

Image-heavy or video-heavy requests

The more visual input per request, the more valuable it becomes to isolate encode from prefill/decode.

High request rates with moderate-to-large inputs

Combined load and input size increase prefill/decode pressure, driving TTFT growth and KV cache evictions that significantly hurt performance.

High-end GPUs are constrained

H200 GPUs can be reserved for prefill/decode rather than spending time on encode.

TTFT matters

The evaluated configuration materially improved P99 TTFT.

Encode demand scales separately

Additional encode workers can be added without scaling the entire prefill/decode tier.

10. Implementation recipe

To reproduce the results or explore the setup further, here are the key resources used in this post:


Workload generator:  https://github.com/ai-dynamo/aiperf

Recipe: https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-vl-32b-fp8

Dynamo docs on EPD/multimodal inference for more details: https://docs.nvidia.com/dynamo/user-guides/multimodal/encoder-disaggregation

 

11. Conclusion

Multimodal inference should not be treated as a single monolithic workload. Vision encode, prefill, and decode each place different demands on the system. Running all three stages on the same high-end GPU is simple. Still, it can underutilize expensive resources, leading to rising latency under sustained multimodal load, particularly when non-trivial visual context increases prefill and decode pressure.

Heterogeneous E/PD disaggregation provides a more balanced approach. By adding one B60-class GPU as a dedicated encode resource (1E), the normalized hardware cost increases by only 2.5% while delivering materially lower TTFT. Scaling to four encode workers (4E) increases the normalized cost by 10% and reduces the median TTFT by ~76% relative to the aggregated baseline, demonstrating that encode capacity can be scaled independently to match workload demand.

Core takeaway

A small amount of purpose-fit encode capacity can improve latency behavior across the entire system while preserving HBM-rich GPUs for prefill and decode, where they deliver the most value.

 

As VLMs become larger and multimodal requests become more image- and video-dense, especially at high request rates, this pattern becomes increasingly important. Dynamo makes this architecture practical by enabling encode and prefill/decode to be deployed as separate stages, allowing teams to match each stage of inference to the right hardware and improve SLO-qualified performance per normalized cost unit.

 

1 Sources: https://www.dihuni.com/product/nvidia-h200-nvl-gpu-141gb-900-21010-0040-000-pny-sku-nvh200nvltcgpu-kit/ and https://www.newegg.com/arkn-8357-00128-arc-pro-b60-24gb-graphics/p/N82E16814983001

Intel, the Intel logo, and Arc are trademarks of Intel Corporation or its subsidiaries.