Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
815 Discussions

Scaling Intel® AI for Enterprise RAG Performance: 64-Core vs 96-Core Intel® Xeon®

IgorKonopko
Employee
5 0 1,213

Authored by: Igor Konopko (Intel) 

Executive Summary

This evaluation shows materially higher concurrency and improved latency scaling when moving from a 64-core to a 96-core Intel® Xeon® configuration for Intel® AI for Enterprise RAG inference. The 96-core SKU doubles SLA-compliant concurrency for Llama-AWQ and Mistral-AWQ (32 → 64 users) across all workloads and increases Qwen-AWQ SLA concurrency by 33–50% (workload dependent) versus the 64-core system.

Introduction

Enterprise RAG

Retrieval-Augmented Generation (RAG) represents a transformative approach to artificial intelligence that combines the power of large language models with real-time information retrieval capabilities. Enterprise RAG systems extend this concept to meet the demanding requirements of business-critical applications, providing production-ready solutions that can handle enterprise-scale workloads with the reliability, security, and performance standards that organizations require.

The Intel® AI for Enterprise RAG solution delivers a comprehensive RAG pipeline. This end-to-end solution integrates embedding models, vector databases, reranking capabilities, and large language models into a unified, scalable architecture designed to address key enterprise challenges: maintaining low latency under high concurrent user loads, ensuring consistent response quality, and providing measurable service level agreements that meet business requirements.

Purpose of the Evaluation

In a previous study, we analyzed the 64-core Intel® Xeon® 6 (Xeon 6767P) platform. This follow-up distills that work to answer a focused question: What practical capacity and latency gains are realized by migrating AWQ-quantized LLM inference from a dual-socket Granite Rapids 64-core configuration to a higher-density 96-core configuration (Intel® Xeon® 6972P)?

Testing Methodology: Comprehensive Performance Evaluation

The methodology, data preparation, authentication model, concurrency harness, and retrieval/rerank pipeline remain unchanged from the original study. Only the hardware configuration and the number of vLLM replicas differ between the two test scenarios.

Tested AWQ models

Similar to the previous one, this benchmark evaluates three distinct large language model configurations—this time only their AWQ (Activation-aware Weight Quantization) versions. Each is optimized for different enterprise deployment scenarios and regional market requirements:

Hardware and Software Configuration

Intel® Xeon® 6767P (64-Core) - Baseline Configuration
  • Processors: 2× Intel® Xeon® 6767P (64 physical cores per socket, 350W TDP per socket, SNC OFF)
  • Memory: 512 GB DDR5-6400 (16×32GB modules)
  • vLLM Replicas: 2 pods
  • Network: 4× BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controllers
  • Storage: 447.1GB HFS480G3H2X069N + 447.1GB Dell BOSS-N1
  • BIOS: Version 1.2.6, microcode 0x10003a2
Intel® Xeon® 6972P (96-Core) - Enhanced Configuration
  • Processors: 2× Intel® Xeon® 6972P (96 physical cores per socket, 500W TDP per socket, SNC OFF)
  • Memory: 1536GB DDR5-6400 (24×64GB modules)
  • vLLM Replicas: 4 pods (doubled for increased throughput)
  • Network: 2× Ethernet Controller X710 for 10GBASE-T + 1× I210 Gigabit
  • Storage: 894.3GB SAMSUNG MZ1L2960HCJR SSD
  • BIOS: BHSDCRB1.IPC.3544.P60.2504160256, microcode 0x10003c1
Shared Software Stack Configuration
  • Operating System: Ubuntu 24.04.2 LTS
  • Embedding Service: TorchServe 0.12.0, 4 pod replicas, model: BAAI/bge-base-en-v1.5
  • Vector Database: Redis 7.4.0-v2, 1M vectors
  • Retriever: 1 pod replica, k=5
  • Reranker: TorchServe 0.12.0, 2 pod replicas, top_n=1, model: BAAI/bge-reranker-base
  • LLM Service: vLLM 0.9.2, BF16 precision
  • Application: Intel® AI for Enterprise RAG 1.4.0

Performance Analysis: 64-Core vs 96-Core Comparison

The following performance data shows how end-to-end performance of Intel® AI for Enterprise RAG pipeline behaves for multiple LLM models across varying workload patterns and concurrency levels. Time to First Token (TTFT) is reported in seconds and is the end-to-end latency which includes all the RAG components such as embedding, retrieval, reranking and LLM; Time Per Output Token (TPOT) in milliseconds. Lower TTFT improves perceived responsiveness; lower TPOT improves sustained token throughput.

Workload: 128 Input / 128 Output
Optimized for short-form Q&A and quick responses

blogpost2_1.png

Figure 1. Workload: 128 Input / 128 Output performance numbers

Workload: 256 Input / 256 Output
Medium-length queries with balanced response requirements

blogpost2_2.png

Figure 2. Workload: 256 Input / 256 Output performance numbers

Workload: 256 Input / 512 Output
Medium queries requiring comprehensive responses

blogpost2_3.png

Figure 3. Workload: 256 Input / 512 Output performance numbers

Workload: 256 Input / 1024 Output
Extended response generation for detailed analysis and explanations

blogpost2_4.png

Figure 4. Workload: 256 Input / 1024 Output performance numbers

Service Level Agreement (SLA) Capacity Analysis

We employ the same SLA thresholds based on extensive user experience research:

  • Time to First Token (TTFT): < 3 seconds (measures the latency between query submission and the appearance of the first response token)
  • Time Per Output Token (TPOT): < 100 milliseconds (quantifies the rate at which subsequent tokens are generated after the initial response begins)
Maximum Concurrent Users Under SLA

The table below reports the maximum sustained concurrency that simultaneously satisfies both SLA thresholds.

blogpost2_5.png

Figure 5. Maximum concurrent users under SLA

Key Performance Insights

The 96-core configuration demonstrates its most significant advantages under high concurrent user loads:

  • 64+ Concurrent Users: The 96-core system maintains significantly better performance than the 64-core system under heavy loads
  • TPOT Improvements: Better token generation performance at high concurrency levels
  • Stability Under Load: More consistent performance characteristics as user count increases
Performance Gains by Model
  • Llama-AWQ: 100% capacity increase (32 → 64 concurrent users) across all workloads
  • Qwen-AWQ: 33-50% capacity increase (12-16 → 16-24 concurrent users) depending on workload complexity
  • Mistral-AWQ: 100% capacity increase (32 → 64 concurrent users) across all workloads

Cost-Benefit Analysis

Hardware Investment Comparison:
  • Core Count: 50% increase (64 → 96 cores)
  • Capacity Gain: 100% user capacity increase for Llama-AWQ and Mistral-AWQ models
Enterprise Value Proposition:
  • User Density: Doubles the concurrent user capacity with moderate additional hardware investment
  • Performance per Dollar: Enhanced efficiency through better resource utilization
  • Scaling Economics: Better cost-effectiveness for high-capacity requirements

Deployment Recommendations

96-Core Optimal Use Cases:
  • High-Capacity Deployments: Organizations planning to support 50+ concurrent users
  • Multi-Model Environments: Enterprises requiring multiple AWQ models simultaneously
  • Peak Load Management: Applications with significant usage spikes requiring burst capacity
  • Consolidation Strategies: Data centers seeking to consolidate multiple 64-core deployments
64-Core Sufficient Scenarios:
  • Moderate Capacity Requirements: Deployments targeting fewer than 32 concurrent users
  • Budget-Conscious Implementations: Organizations prioritizing minimal initial investment
  • Pilot Deployments: Initial RAG implementations with planned future scaling

Conclusion

The Intel® Xeon® 6972P (96-core) configuration delivers substantial performance improvements over the 64-core baseline, doubling SLA-compliant concurrency for Llama-AWQ and Mistral-AWQ while materially enhancing Qwen-AWQ scalability under load. The incremental investment in cores, memory, and compute resources yields a strong return on investment (ROI) for enterprises requiring sustained high-capacity RAG deployments.

Key Recommendations:
  • Deploy 96-core systems for enterprise implementations targeting 40+ concurrent users
  • Leverage enhanced capacity for consolidation strategies and multi-model environments
  • Plan infrastructure scaling proactively based on user growth projections and peak load requirements
  • Consider regional model optimization when selecting between 64-core and 96-core configurations

The 96-core platform provides immediate capacity uplift and establishes a scalable foundation for future multi-model expansion and evolving workload complexity.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

Intel does not control or audit third-party data.  You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.