Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
814 Discussions

Comprehensive Analysis: Intel® AI for Enterprise RAG Performance

IgorKonopko
Employee
6 0 1,219

Authored by: Igor Konopko (Intel) 

Executive Summary

This comprehensive analysis demonstrates that systems with two 64-core Intel® Xeon® processors can effectively support enterprise-scale RAG deployments, handling up to 32 concurrent users with optimized configurations that comply with targeted SLAs. These results validate Intel® Xeon® as a viable foundation for production of RAG systems, offering operational simplicity and cost-effective scaling for enterprise workloads by eliminating GPU dependencies and reducing infrastructure complexity.

Introduction

Enterprise RAG

Retrieval-Augmented Generation (RAG) represents a transformative approach to artificial intelligence that combines the power of large language models with real-time information retrieval capabilities. Enterprise RAG systems extend this concept to meet the demanding requirements of business-critical applications, providing production-ready solutions that can handle enterprise-scale workloads with the reliability, security, and performance standards that organizations require.

The Intel® AI for Enterprise RAG solution delivers a comprehensive RAG pipeline. This end-to-end solution integrates embedding models, vector databases, reranking capabilities, and large language models into a unified, scalable architecture designed to address key enterprise challenges: maintaining low latency under high concurrent user loads, ensuring consistent response quality, and providing measurable service level agreements that meet business requirements.

Purpose of the Evaluation

This analysis presents a comprehensive performance evaluation of Intel® AI for Enterprise RAG solution on Intel® Xeon®. It also compares AWQ (Activation-aware Weight Quantization) and standard BF16 model implementations, providing enterprises with detailed insights into the performance trade-offs and optimization benefits available through advanced quantization techniques.

Testing Methodology: Comprehensive Performance Evaluation

Testing Architecture and Procedure

The Intel® AI for Enterprise RAG benchmark employs the end-to-end ChatQA testing suite, specifically designed to simulate real-world enterprise usage patterns. The methodology implements a multi-layered approach that evaluates the entire RAG pipeline under controlled, reproducible conditions across both AWQ-quantized and standard model implementations.

Data Preparation and Vector Database Setup

The testing begins with comprehensive data ingestion, populating the vector database with approximately 1 million vectors. This dataset combines ~55,000 vectors derived from real contextual documents with additional Wikipedia content to achieve the database size which is capable of stressing the vector database algorithms. The benchmark uses questions generated from PubMed medical literature (pubmed23n0001 dataset), ensuring domain-relevant and realistic query patterns across all model variants.

Concurrent Load Simulation

The testing framework employs Python-based benchmark tools that simulate concurrent user loads ranging from 12 to 128 parallel connections. Each connection maintains independent session state and executes queries with varying input token lengths (128-256 tokens) and expected output lengths (128-1024 tokens), providing comprehensive coverage of enterprise usage patterns for both quantized and BF16 models. This approach simulates realistic authentication scenarios where multiple concurrent users access the system through secure HTTPS protocol (similarly as when accessing UI), ensuring consistent security overhead across all model configurations.

Performance Metrics Collection

The benchmark captures two critical performance indicators across all model variants:

  • Time to First Token (TTFT): measures the latency between query submission and the appearance of the first response token. It is the end-to-end latency which includes all the RAG components such as embedding, retrieval, reranking and LLM. In enterprise contexts, this metric directly correlates with user perception of system responsiveness. The benchmark methodology establishes a target TTFT of under 3 seconds, which research indicates is the threshold beyond which users perceive noticeable delays in interactive applications.
  • Time Per Output Token (TPOT): quantifies the rate at which subsequent tokens are generated after the initial response begins. This metric is critical for maintaining user engagement during longer responses and directly impacts the perceived "intelligence" and fluency of the system. The SLA target of under 100 milliseconds per token ensures smooth, natural-feeling response streaming.
Test Configuration Parameters

The benchmark evaluates different input/output token combinations across all models to simulate different use cases of ChatQnA:

  • 128 input tokens / 128 output tokens: Short query, short response scenario
  • 256 input tokens / 256 output tokens: Medium query, medium response scenario
  • 256 input tokens / 512 output tokens: Medium query, extended response scenario
  • 256 input tokens / 1024 output tokens: Medium query, long response scenario

Hardware and Software Configuration

Hardware Infrastructure
  • Processors: 2× Intel® Xeon® 6767P (64 physical per socket, 350W TDP per socket, SNC OFF)
  • Memory: 512GB DDR5-6400 (16×32GB modules at 6400MT/s)
  • Storage: 447.1GB HFS480G3H2X069N + 447.1GB Dell BOSS-N1
  • Network: 4× BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controllers
  • Features: Hyper-Threading enabled, Turbo Boost enabled
  • BIOS: Version 1.2.6, microcode 0x10003a2
Software Stack Configuration
  • Operating System: Ubuntu 24.04.2 LTS (kernel 6.8.0-64-generic)
  • Embedding Service: TorchServe 0.12.0, 4 pod replicas, model: BAAI/bge-base-en-v1.5
  • Vector Database: Redis 7.4.0-v2, 1M vectors
  • Retriever: 1 pod replica, k=5
  • Reranker: TorchServe 0.12.0, 2 pod replicas, top_n=1, model: BAAI/bge-reranker-base
  • LLM Service: vLLM 0.9.2, 2 pod replicas, BF16 precision
  • Application: Intel® AI for Enterprise RAG 1.4.0

Model Descriptions and Target Markets

The benchmark evaluates three distinct large language model configurations, each optimized for different enterprise deployment scenarios and regional market requirements:

Llama:  The Llama model series, developed by Meta, is one of the most widely adopted open-source LLM architectures in North and Latin American enterprise environments. Its broad ecosystem support and proven deployment history make it a preferred choice for organizations requiring reliable, well-documented AI solutions with strong compliance alignment. In this evaluation, meta-llama/Llama-3.1-8B-Instruct and its quantized counterpart hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 were tested.

Qwen:  Qwen, developed by Alibaba Cloud, brings specialized capabilities for the Asia-Pacific region with particular strength in Chinese, Japanese, Korean, and Southeast Asian language markets. It is the largest model in this evaluation (14B parameters). We chose to include the AWQ variant Qwen/Qwen3-14B-AWQ  as it best fits the intended profile for this setup.

Mistral: Mistral AI's architecture reflects European priorities in efficient model design, combining competitive performance with reduced computational requirements aligned to data sovereignty and energy objectives. It appeals to EU enterprises prioritizing GDPR compliance, data localization, and regionally governed AI operations. These optimizations make it well-suited for deployments where efficiency, cost control, and regulatory alignment are critical. In this evaluation, mistralai/Mistral-7B-Instruct-v0.3  and its AWQ variant solidrust/Mistral-7B-Instruct-v0.3-AWQ were tested.

Model Architecture Analysis: AWQ vs Standard Implementations

The benchmark evaluates both AWQ and standard BF16 implementations of three distinct large language model architectures, providing comprehensive insights into quantization benefits and performance trade-offs.

AWQ quantization represents a breakthrough in model optimization that significantly enhances performance on Intel® Xeon® architectures. Unlike traditional quantization methods that uniformly reduce precision across all model weights, AWQ employs activation-aware techniques that preserve critical model components while optimizing others.

Benefits of AWQ on Intel® Xeon®:
  • Memory Efficiency: Reduces model memory footprint by 2-4x, enabling larger models to run on standard enterprise hardware
  • Precision Preservation: Maintains model accuracy while delivering substantial performance improvements (typically 95%+ accuracy retention)
  • Cache Optimization: Improved cache locality reduces memory bandwidth requirements, critical for multi-user scenarios

Comprehensive Performance Results Analysis

The following performance data shows how end-to-end performance of Intel® AI for Enterprise RAG pipeline behaves for multiple LLM models across varying workload patterns and concurrency levels. Time to First Token (TTFT) is reported in seconds and is the end-to-end latency which includes all the RAG components such as embedding, retrieval, reranking and LLM; Time Per Output Token (TPOT) in milliseconds. Lower TTFT improves perceived responsiveness; lower TPOT improves sustained token throughput.

Workload: 128 Input / 128 Output
Optimized for short-form Q&A and quick responses

blogpost1_1.png

Figure 1. Workload: 128 Input / 128 Output performance numbers

Workload: 256 Input / 256 Output
Medium-length queries with balanced response requirements

blogpost1_2.png

Figure 2. Workload: 256 Input / 256 Output performance numbers

Workload: 256 Input / 512 Output
Medium queries requiring comprehensive responses

blogpost1_3.png

Figure 3. Workload: 256 Input / 512 Output performance numbers

Workload: 256 Input / 1024 Output
Extended response generation for detailed analysis and explanations

blogpost1_4.png

Figure 4. Workload: 256 Input / 1024 Output performance numbers

Service Level Agreement (SLA) Analysis

Understanding SLAs

Service Level Agreements (SLAs) are formal commitments between service providers and customers that define specific, measurable performance standards.

Why SLAs Matter for Enterprise AI:
  • User Satisfaction: Quantitative thresholds directly correlate with user satisfaction and productivity
  • Predictability: Consistent performance under varying load conditions is often more valuable than peak performance
  • Business Planning: SLAs provide the foundation for capacity planning, infrastructure budgeting, and resource allocation decisions
  • Accountability: They translate technical performance into business-meaningful metrics
SLA Criteria for RAG solutions

The benchmark establishes end-to-end SLA thresholds based on extensive user experience research:

  • Time to First Token (TTFT): < 3 seconds
  • Time Per Output Token (TPOT): < 100 milliseconds
Comprehensive SLA Compliance Comparison

The following table shows the maximum number of concurrent users each model configuration can support while maintaining SLA compliance (TTFT < 3s, TPOT < 100ms).

blogpost1_5.png

Figure 5. Maximum concurrent users under SLA

Performance Summary

Enterprise-Ready Results:
  • 32 concurrent users supported on single dual-socket Intel® Xeon® node
  • Significant latency improvements with AWQ quantization
  • 2x user capacity scaling compared to standard model implementations
  • CPU-only deployment eliminates GPU infrastructure dependencies
Model Performance Ranking:
  • Llama-AWQ & Mistral-AWQ: Consistent 32-user capacity across workloads
  • Standard Mistral: 24-32 users depending on workload complexity
  • Standard Llama: 16-24 users with higher resource requirements
  • Qwen-AWQ: 12-16 users, specialized for Asia-Pacific markets

Conclusions and Key Takeaways

This comprehensive analysis establishes Intel® Xeon® processors as a robust foundation for Intel® AI for Enterprise RAG deployments. The combination of CPU-only architecture and AWQ quantization delivers enterprise-grade performance while maintaining operational simplicity:

  1. Intel® Xeon® Scales RAG Concurrency: A dual-socket node (2× Intel® Xeon® 6767P, 64 cores each) supports up to 32 concurrent users within SLA thresholds - sufficient for many small and medium size companies or early production deployments.
  2. AWQ Quantization Is Foundational: Activation-aware quantization delivers significant TPOT reductions and up to 2× SLA-capable concurrency compared to standard models. These efficiency gains make CPU-only deployments economically and operationally viable.
  3. Latency Dynamics: TTFT increases modestly with concurrency; TPOT is the first SLA breaker. AWQ shifts the TPOT inflection point outward, preserving conversational fluidity at higher loads.
  4. Holistic Pipeline Matters: Gains from AWQ amplify when embedding, retrieval, and reranking are jointly optimized—preventing upstream bottlenecks from eroding quantization benefits.
  5. Regional Strategy: Intel® AI for Enterprise RAG supports and provides acceptable performance for a set of models commonly used in different geographic regions.
  6. Operational Simplicity: CPU-based deployment simplifies the deployment model due to the lack of external accelerators.
  7. Business Impact: Lower per-session resource consumption reduces TCO, enabling reinvestment in redundancy or security enhancements without compromising user experience.

Next Steps for Enterprise Implementation

For Immediate Deployment:
  • Start with Llama-AWQ or Mistral-AWQ for maximum user capacity (32 concurrent users)
  • Plan infrastructure around dual-socket Intel® Xeon® 6767P configuration
  • Implement comprehensive monitoring for TTFT and TPOT metrics
For Regional Deployments:
  • Consider Qwen-AWQ for Asia-Pacific markets despite lower capacity (12-16 users)
  • Evaluate Mistral for European deployments requiring data sovereignty
  • Plan horizontal scaling for larger user populations beyond 32 concurrent sessions

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

Intel does not control or audit third-party data.  You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.