Scaling Intel® AI for Enterprise RAG Performance: 64-Core vs 96-Core Intel® Xeon®

IgorKonopko · ‎10-23-2025

Authored by: Igor Konopko (Intel)

Executive Summary

This evaluation shows materially higher concurrency and improved latency scaling when moving from a 64-core to a 96-core Intel® Xeon® configuration for Intel® AI for Enterprise RAG inference. The 96-core SKU doubles SLA-compliant concurrency for Llama-AWQ and Mistral-AWQ (32 → 64 users) across all workloads and increases Qwen-AWQ SLA concurrency by 33–50% (workload dependent) versus the 64-core system.

Introduction

Enterprise RAG

Retrieval-Augmented Generation (RAG) represents a transformative approach to artificial intelligence that combines the power of large language models with real-time information retrieval capabilities. Enterprise RAG systems extend this concept to meet the demanding requirements of business-critical applications, providing production-ready solutions that can handle enterprise-scale workloads with the reliability, security, and performance standards that organizations require.

The Intel® AI for Enterprise RAG solution delivers a comprehensive RAG pipeline. This end-to-end solution integrates embedding models, vector databases, reranking capabilities, and large language models into a unified, scalable architecture designed to address key enterprise challenges: maintaining low latency under high concurrent user loads, ensuring consistent response quality, and providing measurable service level agreements that meet business requirements.

Purpose of the Evaluation

In a previous study, we analyzed the 64-core Intel® Xeon® 6 (Xeon 6767P) platform. This follow-up distills that work to answer a focused question: What practical capacity and latency gains are realized by migrating AWQ-quantized LLM inference from a dual-socket Granite Rapids 64-core configuration to a higher-density 96-core configuration (Intel® Xeon® 6972P)?

Testing Methodology: Comprehensive Performance Evaluation

The methodology, data preparation, authentication model, concurrency harness, and retrieval/rerank pipeline remain unchanged from the original study. Only the hardware configuration and the number of vLLM replicas differ between the two test scenarios.

Tested AWQ models

Similar to the previous one, this benchmark evaluates three distinct large language model configurations—this time only their AWQ (Activation-aware Weight Quantization) versions. Each is optimized for different enterprise deployment scenarios and regional market requirements:

Llama hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
Qwen Qwen/Qwen3-14B-AWQ
Mistral solidrust/Mistral-7B-Instruct-v0.3-AWQ

Hardware and Software Configuration

Intel® Xeon® 6767P (64-Core) - Baseline Configuration

Processors: 2× Intel® Xeon® 6767P (64 physical cores per socket, 350W TDP per socket, SNC OFF)
Memory: 512 GB DDR5-6400 (16×32GB modules)
vLLM Replicas: 2 pods
Network: 4× BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controllers
Storage: 447.1GB HFS480G3H2X069N + 447.1GB Dell BOSS-N1
BIOS: Version 1.2.6, microcode 0x10003a2

Intel® Xeon® 6972P (96-Core) - Enhanced Configuration

Processors: 2× Intel® Xeon® 6972P (96 physical cores per socket, 500W TDP per socket, SNC OFF)
Memory: 1536GB DDR5-6400 (24×64GB modules)
vLLM Replicas: 4 pods (doubled for increased throughput)
Network: 2× Ethernet Controller X710 for 10GBASE-T + 1× I210 Gigabit
Storage: 894.3GB SAMSUNG MZ1L2960HCJR SSD
BIOS: BHSDCRB1.IPC.3544.P60.2504160256, microcode 0x10003c1

Shared Software Stack Configuration

Operating System: Ubuntu 24.04.2 LTS
Embedding Service: TorchServe 0.12.0, 4 pod replicas, model: BAAI/bge-base-en-v1.5
Vector Database: Redis 7.4.0-v2, 1M vectors
Retriever: 1 pod replica, k=5
Reranker: TorchServe 0.12.0, 2 pod replicas, top_n=1, model: BAAI/bge-reranker-base
LLM Service: vLLM 0.9.2, BF16 precision
Application: Intel® AI for Enterprise RAG 1.4.0

Performance Analysis: 64-Core vs 96-Core Comparison

The following performance data shows how end-to-end performance of Intel® AI for Enterprise RAG pipeline behaves for multiple LLM models across varying workload patterns and concurrency levels. Time to First Token (TTFT) is reported in seconds and is the end-to-end latency which includes all the RAG components such as embedding, retrieval, reranking and LLM; Time Per Output Token (TPOT) in milliseconds. Lower TTFT improves perceived responsiveness; lower TPOT improves sustained token throughput.

Workload: 128 Input / 128 Output
Optimized for short-form Q&A and quick responses

Figure 1. Workload: 128 Input / 128 Output performance numbers

Workload: 256 Input / 256 Output
Medium-length queries with balanced response requirements

Figure 2. Workload: 256 Input / 256 Output performance numbers

Workload: 256 Input / 512 Output
Medium queries requiring comprehensive responses

Figure 3. Workload: 256 Input / 512 Output performance numbers

Workload: 256 Input / 1024 Output
Extended response generation for detailed analysis and explanations

Figure 4. Workload: 256 Input / 1024 Output performance numbers

Service Level Agreement (SLA) Capacity Analysis

We employ the same SLA thresholds based on extensive user experience research:

Time to First Token (TTFT): < 3 seconds (measures the latency between query submission and the appearance of the first response token)
Time Per Output Token (TPOT): < 100 milliseconds (quantifies the rate at which subsequent tokens are generated after the initial response begins)

Maximum Concurrent Users Under SLA

The table below reports the maximum sustained concurrency that simultaneously satisfies both SLA thresholds.

Figure 5. Maximum concurrent users under SLA

Key Performance Insights

The 96-core configuration demonstrates its most significant advantages under high concurrent user loads:

64+ Concurrent Users: The 96-core system maintains significantly better performance than the 64-core system under heavy loads
TPOT Improvements: Better token generation performance at high concurrency levels
Stability Under Load: More consistent performance characteristics as user count increases

Performance Gains by Model

Llama-AWQ: 100% capacity increase (32 → 64 concurrent users) across all workloads
Qwen-AWQ: 33-50% capacity increase (12-16 → 16-24 concurrent users) depending on workload complexity
Mistral-AWQ: 100% capacity increase (32 → 64 concurrent users) across all workloads

Cost-Benefit Analysis

Hardware Investment Comparison:

Core Count: 50% increase (64 → 96 cores)
Capacity Gain: 100% user capacity increase for Llama-AWQ and Mistral-AWQ models

Enterprise Value Proposition:

User Density: Doubles the concurrent user capacity with moderate additional hardware investment
Performance per Dollar: Enhanced efficiency through better resource utilization
Scaling Economics: Better cost-effectiveness for high-capacity requirements

Deployment Recommendations

96-Core Optimal Use Cases:

High-Capacity Deployments: Organizations planning to support 50+ concurrent users
Multi-Model Environments: Enterprises requiring multiple AWQ models simultaneously
Peak Load Management: Applications with significant usage spikes requiring burst capacity
Consolidation Strategies: Data centers seeking to consolidate multiple 64-core deployments

64-Core Sufficient Scenarios:

Moderate Capacity Requirements: Deployments targeting fewer than 32 concurrent users
Budget-Conscious Implementations: Organizations prioritizing minimal initial investment
Pilot Deployments: Initial RAG implementations with planned future scaling

Conclusion

The Intel® Xeon® 6972P (96-core) configuration delivers substantial performance improvements over the 64-core baseline, doubling SLA-compliant concurrency for Llama-AWQ and Mistral-AWQ while materially enhancing Qwen-AWQ scalability under load. The incremental investment in cores, memory, and compute resources yields a strong return on investment (ROI) for enterprises requiring sustained high-capacity RAG deployments.

Key Recommendations:

Deploy 96-core systems for enterprise implementations targeting 40+ concurrent users
Leverage enhanced capacity for consolidation strategies and multi-model environments
Plan infrastructure scaling proactively based on user growth projections and peak load requirements
Consider regional model optimization when selecting between 64-core and 96-core configurations

The 96-core platform provides immediate capacity uplift and establishes a scalable foundation for future multi-model expansion and evolving workload complexity.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.