Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
791 Discussions

Accelerating vLLM Inference: Intel® Xeon® 6 Processor Advantage over AMD EPYC

Abi
Employee
2 0 1,204

Authors:

Saumya Buliya, Software Engineer, HCL Technologies

Abirami Prabhakaran, Principal Engineer, Intel

Diwei Sun, AI Frameworks Engineer, Intel

Introduction

As AI workloads scale across data centers, organizations increasingly explore CPU-based inference as a cost-effective alternative to GPU-heavy infrastructure. The vLLM (Virtualized Large Language Model) framework, optimized for CPU inference, is emerging as a powerful solution for efficiently serving large language models (LLMs).

Conversational chatbots are AI-powered software programs that simulate human conversation through large language models (LLMs), understand context, and engage with users in real time. They are commonly used in customer service, virtual assistance, and other business applications to enhance user experience and streamline operations. In these applications, one of the most important SLAs is Time per Output Token (TPOT), the latency for generating every new word in a response.

This blog presents the performance comparison of vLLM inference serving for Conversational Chatbot use-cases on the Intel Xeon 6 processor (6767P, 64c) versus AMD EPYC processors (9755, 128c and 9965, 192c). Using the latest vLLM architecture, hardware, and software optimizations, the results demonstrate clear Intel architectural advantages in output throughput and concurrency at TPOT (<100ms) SLA, making Intel® Xeon® 6 processors a compelling choice for real-time, multi-user LLM deployments.

Why vLLM for CPU Inference

vLLM v0.9.1 and beyond introduces several optimizations tailored for CPU-based inference. These features enable vLLM to deliver high throughput and low latency on standard CPU hardware, making it ideal for serving models like Meta’s Llama 3.1-8B Instruct.

  • PagedAttention: Reduces memory fragmentation and improves cache locality.
  • Token-aware scheduling: Dynamically batches sequences of similar lengths.
  • Asynchronous execution: Minimizes idle CPU time.
  • Tensor Parallelism (TP) and Chunked Prefill: Enhance scalability and memory efficiency.
  • V1 Engine plus IPEX: Boosts performance with Intel-specific optimizations. zenDNN 5.1.0 did not support vLLM v1 attention backend at this time.

Benchmark Results

Output Token Throughput and Concurrent user prompt at TPOT SLA:

vLLM 0.9.1 supports a benchmark tool that can measure online serving inference performance. Using this benchmark, we measure the output token throughput (tokens/sec) at ‘x’ concurrent user prompts meeting a Time per output token (TPOT) SLA of 100ms. The ‘x’ concurrent user prompts were exercised as a parameter sweep to achieve the above. To maximize the number of prompts at TPOT SLA, the Tensor Parallel feature of vLLM is used.

As can be seen from the results below, for various combinations of input/output token prompts, Intel Xeon 6 processors with fewer cores consistently outperformed higher core count AMD EPYC processors across all input/output token combinations at TPOT SLA requirement. Maximizing both the number of concurrent prompts supported and the throughput (token/sec) achieved, Intel Xeon 6767P (64c) versus AMD EPYC 9755 (128c) delivered up to 1.4x higher performance and 2.8x higher performance per core across a mix of Chatbot use cases. More impressive, when the same Xeon 6767P (64c), with two-thirds fewer cores, is compared with AMD EPYC 9965 (192c), Xeon 6767P delivered up to 2.7x higher performance and 8.2x higher performance per core across the same mix of Chatbot use cases. This highlights the architectural advancements of Intel Xeon 6 processors with Intel® Advanced Matrix Extensions (Intel® AMX), higher memory bandwidth offered by MRDIMMs, and the latest SW stack optimizations.

As seen below, the Intel Xeon 6 processor consistently outperformed the AMD EPYC processors in all tested prompt/token scenarios at TPOT SLA requirements, even though Intel processors had fewer cores. It should be noted that AMD EPYC 9965 (192c) delivered lower vLLM performance under the TPOT SLA than AMD EPYC 9755 (128c).

Picture1.png

Figure 1: Intel 6767P delivers 1.4x higher performance and 2.8x higher performance/core vs. AMD EPYC 9755 and delivers 2.7x higher performance and 8.2x higher performance/core vs. AMD EPYC 9965.

Picture2.png

Figure 2: Intel 6767P delivers 1.3x higher performance and 2.6x higher performance/core vs. AMD EPYC 9755 and delivers 2.6x higher performance and 7.9x higher performance/core vs. AMD EPYC 9965.

Picture3.png

Figure 3: Intel 6767P delivers 1.07x higher performance and 2.1x higher performance/core vs. AMD EPYC 9755 and delivers 1.8x higher performance and 5.4x higher performance/core vs. AMD EPYC 9965.

These results illustrate Intel Xeon 6 CPU's architectural strengths, including:

  • Intel® Advanced Matrix Extension (Intel® AMX) delivering AI acceleration built into every core.
  • Multiplexed Rank DIMMs (MRDIMMs) delivering improved memory bandwidth for better memory access patterns.
  • Sub-NUMA clustering for efficient socket utilization.
  • Enhanced software optimizations from a vast open software ecosystem.

These features contribute to lower latency and higher scalability, especially in short and mid-length prompt scenarios.

Conclusion

Intel’s Xeon 6767P (64c) platform delivers superior performance for vLLM inference serving with half the cores per socket compared to AMD EPYC 9755 (128c) and one-third the cores per socket compared to AMD EPYC 9965(192c) processors. By handling up to 1.4x higher token throughput and up to 1.8x higher concurrent prompts over AMD EPYC 9755 and 2.7x higher token throughput and up to 3.2x higher concurrent prompts over AMD EPYC 9965 processors, the latest generation Intel Xeon 6 CPUs can provide significantly higher performance for real-time, multi-user deployments that demand low latency and high scalability.

For organizations seeking efficient, scalable, and cost-effective LLM serving, Intel Xeon 6 platforms offer a compelling solution.

Learn more at intel.com/xeon6.

 

About vLLM v0.9.1:

The release used at the time of benchmarking includes:

  • Enhanced support for CPU inference
  • Improved scheduling and memory management
  • Compatibility with Llama 3.1-8B Instruct
  • Optimizations for Intel’s IPEX and Torch 2.7.0

These updates make vLLM a robust framework for production-grade LLM serving on the latest Intel Xeon 6 CPUs.

Configuration:

Xeon 6767P: 1-node, 2x Intel(R) Xeon(R) 6767P, 64 cores, 350W TDP, SNC On, HT On, Turbo On, Total Memory 1024GB (16x64GB DDR5 8800MT/s [8000MT/s]), microcode 0x1000360, 2x BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller, 1x 1.7T Micron_7450_MTFDKCC1T9TFR, Ubuntu 24.04.2 LTS, 6.8.0-51-generic. Using physical cores only. Test by Intel as of August 2025.

AMD 9755: 1-node, 2x AMD EPYC 9755 128-Core Processor, 500W TDP, NPS=4, SMT On, Boost On, Total Memory 1536GB (24x64GB DDR5 6400 MT/s [6000 MT/s]), microcode 0xb002116, 2x Ethernet Controller X710 for 10GBASE-T, 2x Ethernet Controller E810-C for QSFP, 1x 5.8T INTEL SSDPE2KE064T8, 1x 1.7T Micron_7450_MTFDKBG1T9TFR, Ubuntu 24.04 LTS, 6.8.0-64-generic. Using physical cores only. Test by Intel as of August 2025.

AMD 9965: 1-node, 2x AMD EPYC 9965 192-Core Processor, 500W TDP, NPS=4, SMT On, Boost On, Total Memory 1536GB (24x64GB DDR5 6400 MT/s [6000 MT/s]), microcode 0xb101047, 2x MT2910 Family, 2x BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller, 1x 1.7T Micron_7450_MTFDKBG1T9TFR, Ubuntu 24.04 LTS, 6.8.0-47-generic. Using physical cores only. Test by Intel as of August 2025.

Software:

Use Case: Conversational Chatbot using Llama 3.1-8B Instruct

Framework: vLLM v0.9.1 with Tensor Parallelism, Cache Prefixing, Chunked Prefill, Torch version: 2.7.0+cpu, IPEX version: 2.7.0+cpu

Prompt Sizes common for this use-case: Input tokens - 128, 256, 1024; Output tokens - 256, 512, 1024
SLA Target: 100ms TPOT

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.