Multi-node deployments using Intel® AI for Enterprise RAG

MichalProstko · ‎08-21-2025

Authored by: Jakub Piasecki and Michał Prostko

Multi-node deployments using

Intel® AI for Enterprise RAG

As enterprises increasingly adopt generative AI to unlock insights from proprietary data, the need for scalable, efficient, and hardware-aware infrastructure becomes critical. Intel® AI for Enterprise RAG addresses this challenge by offering a modular, production-ready framework for Retrieval-Augmented Generation (RAG), a technique that enhances LLM responses with domain-specific knowledge retrieved from enterprise data sources.

Built to run on modern Kubernetes clusters and optimized for Intel® Xeon® platforms as well as Intel® Gaudi® AI accelerators, Intel® AI for Enterprise RAG utilizes advanced pod scheduling, resource isolation, and dynamic scaling to deliver inference across diverse environments. This article explores how the solution scales across multiple nodes, intelligently discovers hardware topology, and uses the Node Resource Interface (NRI) plugin to isolate resources for predictable and efficient AI workloads.

Scaling the Solution Across Multiple Nodes

When deploying the solution in a multi-node Kubernetes environment, one of the first challenges is ensuring that pods are scheduled on the most appropriate nodes - especially given customers' diverse infrastructure setups. To address this, we’ve implemented a node affinity mechanism that dynamically evaluates each node’s capabilities.

Node Topology Discovery and NUMA-Aware Scheduling

This feature performs node topology discovery for each node in the cluster to gather the following information:

numa_nodes: Number of NUMA nodes present on the Kubernetes node.
cpus_per_numa_node: Number of CPU cores available per NUMA node.
amx_supported: Indicates if AMX (Advanced Matrix Extensions) is supported. This is typically true for platforms like 4th Gen Intel® Xeon® Scalable Processor (formerly Sapphire Rapids) or newer.
numa_balanced: Indicates if NUMA nodes have balanced CPU cores and memory distribution. A balanced topology improves performance and scheduling efficiency.
max_balloons_vllm: The maximum number of vLLM pods ("balloons") that can be scheduled on the node.
max_balloons_reranker: Maximum number of reranker pods ("balloons") that can be scheduled on the node.
gaudi_available: Indicates if the node contains Intel® Gaudi® AI accelerators. This is determined by checking if the “habana-device-plugin” pod is running on the node.

Based on this information, the system automatically creates taints and labels on each node to determine eligibility for inference workloads. Pods that serve LLMs, such as vLLM or TorchServe, will prioritize scheduling on nodes with AMX support, whereas other components may be deployed on less specialized nodes.

Kubernetes Resources

By default, Kubernetes allows multiple pods to share the same physical CPU core. However, for CPU-intensive workloads, it's recommended to allocate dedicated CPU cores to avoid resource contention. Similarly, memory-intensive tasks should be confined within the same NUMA node to prevent performance degradation. Since inferencing demands both high CPU and memory resources, running such workloads efficiently on Kubernetes can be challenging. This was addressed by leveraging the Node Resource Interface (NRI) plugin. NRI enables advanced Kubernetes resource management by isolating workloads at the CPU level and aligning them with the underlying hardware topology.

The Balloons Policy

At the core of this strategy is the balloons policy. Balloons are logical CPU groupings that isolate workloads from one another. Each balloon is assigned to a specific class of workloads (e.g., compute-heavy inference pods), ensuring that critical tasks are not impacted by noisy neighbors.

When a pod is scheduled, the NRI plugin assigns it to a balloon based on annotations or runtime configuration. This guarantees that the pod runs only on a dedicated set of CPU cores, improving latency, throughput, and predictability. It also ensures that no other pod or resource can use the balloon’s allocated resources, provides strict isolation, and prevents resource contention.

Resource Isolation Features

Node topology discovery and scheduling: This process automatically inspects each cluster node to determine its CPU layout, NUMA configuration, CPU generation, and other hardware features.
NUMA-aware placement: vLLM pods are spread equally across NUMA nodes and confined to a single NUMA node to avoid cross-node memory access.
Sibling CPU allocation: CPUs not used by vLLM pods are available for other workloads.
Dynamic scaling with HPA: When the balloons.enabled flag is set in config.yaml, the Horizontal Pod Autoscaler’s maxReplicas value is automatically adjusted to match the calculated maxBalloonShape.
Per-node BalloonsPolicy configuration: Each node in the cluster has its own BalloonsPolicy object, which defines resource allocation rules for different pod types. These policies are created automatically and tailored to the node's topology and capabilities.

Below is an example snippet of a BalloonsPolicy for a node, describing the configuration for vLLM workloads.

spec:
   agent:
     nodeResourceTopology: true
   allocatorTopologyBalancing: true
   balloonTypes:
   - name: vllm-balloon
     allocatorPriority: high
     allocatorTopologyBalancing: true
     loads:
     - llm-inference
     matchExpressions:
     - key: name
       operator: In
       values:
       - vllm
       - edp-vllm
     maxBalloons: 2
     maxCPUs: 32
     minCPUs: 32
     preferIsolCpus: false
     preferNewBalloons: true

Explanation of Key Fields

nodeResourceTopology: Enables topology-aware scheduling.
allocatorTopologyBalancing: Ensures balanced NUMA placement.
balloonTypes: Defines isolated CPU allocations for specific workloads.
name: Identifies the balloon type.
allocatorPriority: Prioritizes resource allocation.
loads: Specifies load Class defined in another section of the balloons policy.
matchExpressions: Matches pods by label (name: vllm, edp-vllm).
maxBalloons: Limits the number of vLLM pods per node.
maxCPUs / minCPUs: Allocates exactly 32 CPUs per balloon.
preferIsolCpus: Disables preference for isolated CPUs.
preferNewBalloons: Prefers creating new balloons for clean isolation.

Observability and Telemetry

To support observability and troubleshooting at scale, Intel® AI for Enterprise RAG includes a robust telemetry stack integrated with Grafana dashboards. These dashboards provide visibility into system behavior, resource usage, and workload distribution across nodes, helping teams identify bottlenecks and performance anomalies.

In addition to monitoring, the solution leverages the Horizontal Pod Autoscaler (HPA) in combination with Prometheus metrics to react dynamically to system load. When bottlenecks are detected - such as increased CPU usage or latency - HPA can automatically increase the number of replicas for key components, ensuring sustained performance under growing demand.

Below is an example of an HPA dashboard that displays changes in the number of replicas over time, along with thresholds and measured metric values

This observability and automation framework has been instrumental in validating the solution’s scalability. As we stress-tested the product on a multi-node cluster, we observed automatic utilization increase across multiple nodes, confirming that the combination of auto scaling, intelligent scheduling, resource isolation, and NUMA-aware placement delivers efficient inference workloads.

Conclusion

This combination of topology discovery, balloon-based isolation, and dynamic scaling makes Intel® AI for Enterprise RAG solution adaptable across diverse infrastructure setups. By leveraging hardware capabilities and Kubernetes-native mechanisms, it provides a robust foundation for deploying scalable, efficient, and enterprise-grade generative AI applications.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

CarlBidwell · ‎08-26-2025

Great insights, seeing how Intel AI for Enterprise RAG scales generative AI across multi-node Kubernetes clusters is impressive. The multi-node deployment angle really pushes enterprise readiness forward.
You may also find this blog on the Open Platform for Enterprise AI (OPEA) highly relevant, it explores the modular, open-source foundation fueling many of these RAG deployments.