Authors: Kiran Atmakuri, Pallavi Jaini, Sergey Plotnikov, Daniel Socek, Intel Corporation
Executive summary
Multimodal VLM serving is not a single homogeneous workload. A typical request combines vision encode, long-context LLM prefill, and token-by-token decode, and each stage stresses hardware differently across compute, memory capacity, and memory bandwidth.
This technical brief evaluates whether separating the vision encode stage from the prefill/decode path can improve latency and normalized cost efficiency using NVIDIA Dynamo. In the evaluated workload, adding one Intel® Arc™ Pro graphics B60 GPU for encode to an NVIDIA H200 GPU performing prefill/decode increases normalized hardware cost by only 2.5%, while improving median TTFT and P99 TTFT behavior relative to the aggregated H200 baseline. Scaling to four B60 encode workers (4E/1PD) increases normalized cost by 10% and delivers further TTFT gains, demonstrating that encode capacity can be scaled independently to meet workload demand.
Key result summary
Metric | (1E/4E)B60 and H200 E/PD vs. H200 aggregated baseline |
Normalized hardware cost | +2.5% / +10% |
Median TTFT | ~32% lower (1E) / ~76% lower (4E) |
P99 TTFT | ~44% lower (1E) / ~72% lower (4E) |
The central message is technical but commercially relevant: a small amount of purpose-fit encode capacity can reduce queueing and stage interference, allowing the H200 GPU to focus on the prefill/decode path where its HBM capacity and bandwidth are most valuable.
Figure 1. Relative impact of B60-H200 E/PD compared with the H200 aggregated baseline. Lower latency values are better; hardware cost is normalized.
1. Introduction: Multimodal inference is not one workload
Multimodal inference is becoming a core part of enterprise AI. Vision-language models are increasingly used for document analysis, medical imaging, retail search, robotics, content moderation, video understanding, and autonomous systems. These applications do not process only text; a single request may include many images, scanned document pages, screenshots, or frames sampled from video.
This changes the economics of inference. A text-only request may include a few hundred or a few thousand input tokens. A multimodal request can expand into tens of thousands of effective tokens after images are encoded and passed into the language model. This creates more upfront compute, larger KV cache pressure, longer time-to-first-token, and more complex GPU utilization patterns.
E/PD Heterogeneous disaggregation offloads the encoding work from the GPUs that handle prefill and decode.
2. Why VLM requests stress inference systems
Vision-language model requests are fundamentally different from text-only requests because visual inputs are transformed into embeddings that increase the effective input sequence processed by the LLM. Multi-page documents, image batches, and video frames can quickly turn a short user prompt into a long-context inference request.
For the workload evaluated in this draft, each request includes 128 input text tokens, 20 images at 480p resolution, and 256 output tokens. After vision encoding, the request reaches roughly 8300 effective LLM input tokens. That length materially affects prefill time, KV cache footprint, and latency under load.
Importantly, our analysis indicates that modern vision towers are relatively lightweight and are not themselves the dominant computational bottleneck. The primary cost arises from subsequent language model processing of the visual tokens generated by the encoder. When visual encoding shares the same accelerator resources as language-model prefill and decode (PD), it competes with the stage that already dominates inference latency and throughput, further exacerbating system bottlenecks.
At higher request rates, the system can experience queueing, KV cache pressure, and head-of-line blocking between image-heavy requests and active decode operations. This is why multimodal serving often benefits from an architecture different from traditional text-only serving.
Workload attribute | Value |
Input text tokens | 128 |
Input images | 20 |
Image resolution | 480p, 480 x 854 |
Output tokens | 256 |
Effective LLM input after vision encoding | ~8k tokens |
3. Encode, Prefill, Decode: three stages, three bottlenecks
A multimodal inference request can be divided into three major stages. The stages are dependent on one another, but they do not have identical resource requirements.
Stage 1: Encode
The encode stage converts raw images into vision embeddings. Images are divided into patches, projected into embeddings, and processed by the vision encoder before being passed into the language model. Encode is compute-oriented, but it is typically much lighter than the full LLM prefill and decode path. It does not maintain large KV caches and places lower demands on memory capacity and bandwidth, making it a strong candidate for deployment on cost-efficient accelerators.
Stage 2: Prefill
The prefill stage processes the full input sequence, including text tokens and the vision tokens produced by the encoder. It builds the KV cache that decode will use to generate output tokens. For multimodal requests, prefill can become particularly expensive because large numbers of vision tokens are incorporated into the input sequence, substantially increasing the amount of language-model computation required.
Stage 3: Decode
The decode stage generates output tokens one step at a time. Each newly generated token attends to the KV cache created during prefill. Decode is often memory-bandwidth-sensitive because every generated token needs access to the KV cache.
Stage | Primary pressure | Best-fit hardware characteristic |
Encode | Vision processing / compute | Lower-cost GPU with sufficient compute and memory |
Prefill | Long-context compute and KV creation | High-performance GPU |
Decode | KV cache reads and memory bandwidth | HBM-rich GPU |
4. Aggregated vs. disaggregated serving
In a traditional aggregated deployment, all stages run on the same GPU. The same H200 GPU handles image encode, LLM prefill, and token decode. This is operationally simple, but it forces the encode stage to consume resources on the same high-end GPU needed for prefill and decode.
In an E/PD disaggregated deployment, encode is separated from prefill/decode. In the evaluated heterogeneous configuration, the B60-class GPU handles encode while the H200 GPU handles prefill and decode. The benefit is not only lower normalized cost; it is better stage isolation and better use of the high-end GPU. By separating encode from language-model execution, the system reduces contention between image processing and vision token computation, long-context prefill, and token generation. This allows the H200-class GPU to spend more time on the stages that dominate multimodal inference latency.
Aggregated serving: all stages on one H200 GPU
Figure 2. Aggregated serving places all stages on one high-end GPU
Heterogenous E/PD: encode separated from prefill/decode
Figure 3: heterogeneous E/PD separates encode from prefill/decode.
5. Why Dynamo enables this architecture
NVIDIA Dynamo provides the serving framework needed to make this architecture practical. Dynamo supports disaggregated serving, where different inference stages can run on different workers. In an E/PD configuration, encode workers can be separated from prefill/decode workers; the encode workers produce vision embeddings that are transferred to the worker responsible for LLM execution.
Capability | Why it matters |
Stage-level hardware matching | Encode can run on cost-efficient GPU resources, while prefill and decode remain on HBM-rich GPUs. |
Independent scaling | Encode workers can be scaled separately from prefill/decode workers based on workload characteristics. |
Improved high-end GPU utilization | Premium GPUs are reserved for the stages where they provide the most value. |
6. Benchmark setup
The comparison intentionally focuses on two configurations to keep the result easy to interpret: an aggregated H200 baseline and a heterogeneous B60-H200 E/PD configuration.
Configuration | Encode | Prefill / Decode | Description |
Aggregated: H200 TP1 | H200 | H200 | All stages run on one H200 GPU. |
Disaggregate + Heterogeneous B60-H200 E/PD | 1x B60 | 1x H200 | 1 B60 handles encode; H200 handles prefill/decode. |
Disaggregate + Heterogeneous 4E B60-H200 E/PD | 4x B60 | 1x H200 | 4 B60 handles encode; H200 handles prefill/decode. |
Question isolated by the benchmark What happens when a small, purpose-fit encode GPU is added while preserving the H200 GPU for prefill and decode? |
7. Results: throughput, TTFT, ITL
Both heterogeneous B60-H200 configurations improved latency behavior while maintaining comparable throughput. The most important effect is in TTFT, where isolating encode reduces stage interference and queueing pressure on the H200 prefill/decode worker. Scaling to 4 encode workers (4E/1PD) provides further TTFT reduction at modest additional normalized cost.
Metric | H200 aggregated TP1 | 1E B60-H200 E/PD | 4E B60-H200 E/PD | Relative change (1E vs. baseline) |
Peak request throughput | 0.82 req/s | 0.88 req/s | 0.94 req/s | 7.3% higher |
Median TTFT @ 1.0 RPS | 22 s | 15 s | 5.2 s | 32% lower |
P99 TTFT @ 1.0 RPS | 50 s | 28 s | 14 s | 44% lower |
Median ITL / TPOT proxy @ 1.0 RPS | 44 ms | 36 ms | 36 ms | Comparable |
The 4E configuration delivers the lowest TTFT at all request rates.
Figure 4. E/PD Disaggregation performance curves for Qwen3-VL-32B-FP8 (128 input / 256 output tokens, 20 images per request, 480p). H200 aggregated baseline vs. 1E and 4E heterogeneous B60-H200 configurations.
8. Normalized TCO model
The TCO discussion intentionally avoids publishing exact hardware prices. Real customer pricing varies by procurement model, region, volume, support structure, and timing. Instead, the analysis uses normalized cost units.
Hardware class | Normalized cost unit1 |
H200 prefill/decode GPU | 1.000 |
B60-class encode GPU | 0.025 |
Under this model, adding one B60-class encode GPU increases normalized hardware cost from 1.000 to 1.025 (+2.5%). Scaling to four B60 encode workers increases it to 1.100 (+10%), while delivering further TTFT improvements.
Configuration | Hardware | Normalized cost1 |
H200 aggregated TP1 | 1x H200 | 1.000 |
B60-H200 E/PD | 1x B60 + 1x H200 | 1.025 |
4 E B60-H200 E/PD | 4x B60 + 1x H200 | 1.1 |
TCO interpretation For +2.5% normalized hardware cost, the 1E configuration improves median TTFT by ~46%, P99 TTFT by ~56%. The business value comes from improving SLO-qualified performance per normalized hardware cost unit, not simply from adding another GPU. |
9. When this architecture works best
Heterogeneous E/PD disaggregation is most attractive when the workload has image-heavy or video-heavy requests, long effective input context under high request rates, or a need to scale encode independently.
Condition | Why it favors heterogeneous E/PD |
Image-heavy or video-heavy requests | The more visual input per request, the more valuable it becomes to isolate encode from prefill/decode. |
High request rates with moderate-to-large inputs | Combined load and input size increase prefill/decode pressure, driving TTFT growth and KV cache evictions that significantly hurt performance. |
High-end GPUs are constrained | H200 GPUs can be reserved for prefill/decode rather than spending time on encode. |
TTFT matters | The evaluated configuration materially improved P99 TTFT. |
Encode demand scales separately | Additional encode workers can be added without scaling the entire prefill/decode tier. |
10. Implementation recipe
To reproduce the results or explore the setup further, here are the key resources used in this post:
Workload generator: https://github.com/ai-dynamo/aiperf
Recipe: https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-vl-32b-fp8
Dynamo docs on EPD/multimodal inference for more details: https://docs.nvidia.com/dynamo/user-guides/multimodal/encoder-disaggregation
11. Conclusion
Multimodal inference should not be treated as a single monolithic workload. Vision encode, prefill, and decode each place different demands on the system. Running all three stages on the same high-end GPU is simple. Still, it can underutilize expensive resources, leading to rising latency under sustained multimodal load, particularly when non-trivial visual context increases prefill and decode pressure.
Heterogeneous E/PD disaggregation provides a more balanced approach. By adding one B60-class GPU as a dedicated encode resource (1E), the normalized hardware cost increases by only 2.5% while delivering materially lower TTFT. Scaling to four encode workers (4E) increases the normalized cost by 10% and reduces the median TTFT by ~76% relative to the aggregated baseline, demonstrating that encode capacity can be scaled independently to match workload demand.
Core takeaway A small amount of purpose-fit encode capacity can improve latency behavior across the entire system while preserving HBM-rich GPUs for prefill and decode, where they deliver the most value. |
As VLMs become larger and multimodal requests become more image- and video-dense, especially at high request rates, this pattern becomes increasingly important. Dynamo makes this architecture practical by enabling encode and prefill/decode to be deployed as separate stages, allowing teams to match each stage of inference to the right hardware and improve SLO-qualified performance per normalized cost unit.
1 Sources: https://www.dihuni.com/product/nvidia-h200-nvl-gpu-141gb-900-21010-0040-000-pny-sku-nvh200nvltcgpu-kit/ and https://www.newegg.com/arkn-8357-00128-arc-pro-b60-24gb-graphics/p/N82E16814983001
Intel, the Intel logo, and Arc are trademarks of Intel Corporation or its subsidiaries.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.