Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
845 Discussions

A Practical Guide to CPU-Optimized LLM Deployment on Intel® Xeon® 6 Processors on AWS.

Zahid
Employee
1 0 1,215

Deploying Large Language Models on Intel® Xeon® 6 using vLLM: A GPU‑Free, Production‑Ready Guide via AWS Marketplace 

For years, deploying large language models has been closely tied to expensive, power‑intensive GPU infrastructure, creating cost barriers and operational dependencies that many organizations struggle to overcome. As demand for AI accelerates, companies are facing rising hardware costs, constrained GPU availability, and the inability to scale LLM capabilities across on‑premises, hybrid, and edge environments. Modern CPU‑optimized inference frameworks—such as vLLM with BF16 support, improved parallelism, and advanced caching strategies—are breaking this dependency by delivering competitive performance on widely available Intel Xeon servers. This shift empowers businesses to leverage existing CPU estates, reduce total cost of ownership, and democratize AI deployment without sacrificing performance or scalability. 

 With Intel® Xeon® 6 processors and a vLLM‑optimized deployment, a broad range of sub‑20B‑parameter text‑generation models can be served with high throughput and low latency - delivering fully CPU‑based, production‑grade performance without reliance on GPUs. 

This article documents the solution: architecture, optimizations, deployment steps, and advanced features like tool calling, chunked prefill, and NUMA‑aware auto parallelism for maximum throughput. 

 

Why CPU‑Only LLM Inference Is Now Practical 

Modern CPUs - especially Intel® Xeon® 6 - include hardware and ISA features that accelerate AI workloads: 

  • Intel® Advanced Matrix Extensions (Intel® AMX) accelerate matrix math 
  • Intel® Deep Learning Boost (DL Boost) accelerates vector operations 
  • AVX‑512 enables wide vector compute 
  • E‑cores and P‑cores allow scalable parallelism 

These hardware accelerators are automatically leveraged through modern AI frameworks. Libraries such as oneDNN and Intel® Extension for PyTorch* (IPEX) integrate these low‑level CPU capabilities directly into high‑level frameworks like PyTorch 2.0 and vLLM. As a result, developers do not need to write AMX, AVX‑512, or DL Boost instructions themselves—the frameworks transparently invoke optimized kernels under the hood. 

Paired with the vLLM inference engine and PyTorch 2.0, these CPU‑optimized software layers make it possible to achieve competitive LLM inference performance without GPUs, while keeping the developer experience simple and familiar. 

 

Deployment Overview: 

This guide outlines the deployment of a production‑ready, CPU‑only inference stack. The Llama 3.1 8B Instruct model is used as the reference implementation; however, equivalent solutions are available for additional models offered through AWS Marketplace: 

Intel® AI for Enterprise Inference - Llama-3.1-8B-Instruct 

Intel® AI for Enterprise Inference - Mistral-7B-Instruct-v0.3 

Intel® AI for Enterprise Inference - Qwen3-14B 

The deployment includes: 

Llama 3.1 8B Instruct — an 8B parameter instruction‑tuned model for conversational agents, summarization, QA, code assistance, multilingual dialogue and tool‑assisted conversations. 

vLLM CPU‑optimized Docker image with: 

  • PagedAttention: 

PagedAttention stores KV cache in fixed‑size blocks and uses a block‑table indirection layer, eliminating fragmentation and allowing near‑zero memory waste. This is critical on CPUs where DRAM bandwidth and allocator overhead are major bottlenecks, enabling larger effective batch sizes and stable throughput. 

  • Tensor / pipeline parallelism: 

Tensor parallelism shards weight matrices across workers, reducing per‑worker memory footprint and improving cache locality, while pipeline parallelism distributes layers across workers to align with CPU sockets/NUMA domains. These strategies help scale LLM inference across many CPU cores efficiently without saturating memory bandwidth. 

  • Multiprocessing backend (mp): 

vLLM’s native mp backend provides lightweight, single‑node distributed execution without Ray, ideal for CPU containers. It allows multiple workers to run in parallel across cores/sockets while avoiding multi‑node orchestration overhead, improving initialization times and overall CPU throughput. 

  • Long context support (32K tokens): 

Long context operation stresses KV‑cache memory rather than compute; vLLM’s paged KV design and configurable --max-model-len ensure efficient memory allocation for large contexts. Combined with CPU KV‑cache controls (e.g., VLLM_CPU_KVCACHE_SPACE), CPUs can handle long prompts without excessive preemption or re-computation. 

  • Chunked prefill and optimized batching: 

Chunked prefill breaks long prompts into smaller segments and mixes them with decode tokens, balancing compute‑heavy prefill with memory‑bound decode. This maximizes CPU utilization, reduces inter‑token latency, and improves throughput by scheduling work within the max_num_batched_tokens budget. 

  • Intel® Xeon® 6 optimizations including AMX activation, DL Boost, and vLLM CPU kernel flags (e.g., VLLM_CPU_SGL_KERNEL=1, VLLM_CPU_KVCACHE_SPACE=40). 
  • Tool calling with llama3_json parser support for external tool invocation. 

The solution is packaged with a CloudFormation template for plug‑and‑play deployment. As part of this deployment, the stack automatically provisions an OpenAI‑compatible inference endpoint powered by vLLM. This means applications that already integrate with the OpenAI API (e.g., using /v1/chat/completions, /v1/completions, or /v1/embeddings) can connect to this endpoint without any code changes - simply update the base URL and provide a dummy API key. The solution configures networking, autoscaling, security groups, and the vLLM server so that developers can deploy models and immediately interact with them using standard OpenAI client libraries. 

 

Architecture at a Glance 

Zahid_3-1771213950950.png

 

This deployment detects NUMA topology and automatically chooses tensor or pipeline parallelism to maximize locality and throughput. End users benefit from consistently optimized performance, as all platform‑level enhancements and tuning are fully integrated into the deployed solution. 

 

Cost & Sizing Guidance 

Several advantages of this solution are cost predictability and flexibility. 

  • Software Cost: The AWS Marketplace subscriptions for these Intel® AI solutions are free of charge. 
  • Infrastructure CostOnly the underlying Amazon EC2 consumption is billed, with no additional charges for the inference software stack. 

Sizing Recommendations: The solution is optimized for the Amazon EC2 r8i instance family (Intel® Xeon® 6), which offers high memory bandwidth crucial for LLM inference. The Marketplace Solution supports specific sizes tailored for these workloads. 

 

Workload Type  

Recommended Instance Example  

Notes  

Production (Performance)  

r8i.24xlarge (Default)  

Recommended baseline for optimal throughput and latency.  

High Scale / Heavy Load  

r8i.48xlarge or r8i.metal-48xl 

For massive concurrency or larger batch sizes.  

Balanced / Development 

r8i.8xlarge, r8i.12xlarge, or r8i-flex variants  

Cost-effective entry points for lower traffic or functional testing.  

 

Note: The template allows selection from r8i.8xlarge up to r8i.metal-96xl (including flexible r8i-flex options). 

Tip: Spot Instances can reduce costs by up to 90% compared to On-Demand pricing. CPU-based inference workloads are particularly well-suited for Spot Instances due to typically lower interruption rates and broader instance availability. 

 

NUMA‑Aware Parallelism: Automatic Tensor/Pipeline Optimization 

During EC2 bootstrap, the template runs lscpu to detect sockets and NUMA nodes, then chooses an optimal sharding strategy. 

Example mapping: 

  • NUMA nodes per socket 2 or 4 → Tensor Parallelism (balanced cores per NUMA) 
  • NUMA nodes per socket 3 or 6 → Pipeline Parallelism (better for uneven NUMA groups) 

Benefits: 

  • Minimal cross‑NUMA traffic 
  • Maximum memory locality and cache utilization 
  • Higher throughput under load 

This is applied automatically — no manual tuning required. 

 

Built‑in Performance Enhancements 

vLLM runtime flags used by the deployment (examples): 

--dtype bfloat16 
--max-model-len 32768 
--enable-chunked-prefill 
--distributed-executor-backend mp 
--block-size 128 
--max-num-batched-tokens 2048 
--max-num-seqs 256 

Intel‑specific environment variables: 

Environment Variable 

Recommended Value 

Description 

VLLM_CPU_SGL_KERNEL 

1 

Enables specialized kernels for low-latency tasks like real-time serving. This requires a CPU with the AMX instruction set, BFloat16 model weights, and specific weight shapes. 

VLLM_CPU_KVCACHE_SPACE 

40 

Specifies how much system memory (in GB) to reserve the model’s KV cache; higher values allow more parallel requests and longer context windows. 

VLLM_RPC_TIMEOUT 

100000 

Sets the maximum allowed timeout (in milliseconds) for remote‑procedure calls between vLLM components, preventing worker stalls under heavy CPU load. 

VLLM_ALLOW_LONG_MAX_MODEL_LEN 

1 

Allows using longer-than-default maximum model sequence lengths, useful for long‑context workloads on high‑memory Xeon servers. 

VLLM_ENGINE_ITERATION_TIMEOUT_S 

120 

Defines how long (in seconds) the engine waits for an iteration to complete before treating it as stalled—helpful for stabilizing large CPU-bound workloads. 

VLLM_CPU_NUM_OF_RESERVED_CPU 

0 

Reserves a specific number of CPU cores from vLLM’s thread pool, so they remain free for system tasks or co‑resident services.  

 

These settings increase throughput, reduce time‑to‑first‑token, and provide larger KV cache for long contexts. 

 

Deploying the Stack (Step‑by‑Step) 

This guide follows the standard AWS Marketplace deployment flow for the Llama 3.1 8B Instruct listing. 

1. Discover & Subscribe:

  • Navigate to the Intel® AI for Enterprise Inference - Llama-3.1-8B-Instruct listing on AWS Marketplace. 
  • Click Continue to Subscribe and accept the terms. 
  • Once subscribed, click Continue to Configuration. Select CloudFormation Template as the fulfillment option and then choose the appropriate AWS Region. 
  • Click Continue to Launch, select Launch CloudFormation in the action dropdown, and click Launch. 

2. Create the CloudFormation Stack 

  • The template URL will be pre-filled in the CloudFormation console. Click Next. 
  • Configure Parameters: 
  • Instance Type: Select an Intel® Xeon® 6 optimized instance (default: r8i.24xlarge). 
  • Network: Provide a SubnetId and SecurityGroupId. Note: The Security Group must permit inbound TCP traffic on port 8000.
  • Hugging Face Token: Enter Hugging Face User Access Token to authorize the model download during deployment. 
  • Proceed through the wizard, acknowledge capabilities, and click Submit. 

3. Monitor & Access 

  • Wait for the stack status to reach CREATE_COMPLETE. 
  • Go to the Resources tab or the EC2 Console to find the instance’s Public IP. 
  • The model endpoint becomes availabale at: http://<EC2_PUBLIC_IP>:8000/v1/chat/completions 

Query the Model (curl example) 

curl -X POST http://<EC2_PUBLIC_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
--data '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{ "role": "user", "content": "What is the capital of France?" }
]
}'

Example Python Client 

The example code below can be used to set the base URL and API key to access the model endpoint. 

from openai import OpenAI

base_url = "http://<EC2_PUBLIC_IP>:8000/v1/"
OPENAI_API_KEY = "dummy"
client = OpenAI(api_key=OPENAI_API_KEY, base_url=base_url)

completion = client.chat.completions.create(
  model="meta-llama/Llama-3.1-8B-Instruct",
  messages=[
    {"role": "developer", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  stream=True
)

for chunk in completion:
  print(chunk.choices[0].delta)

Zahid_9-1771213950953.png

This code connects to a locally hosted OpenAI‑compatible server (running on an EC2 instance) and sends a chat completion request to the Llama‑3.1‑8B‑Instruct model. It streams the model’s response back chunk‑by‑chunk and prints each incremental output as it arrives. 

 

Final Thoughts & Call to Action 

Running large language models on CPUs at scale was once impractical; however, with Intel® Xeon® 6 acceleration, vLLM optimizations, NUMA‑aware parallelism, extended‑context support, chunked prefill, tool‑calling capabilities, and CloudFormation‑based automation, the Llama 3.1 8B Instruct model can now be deployed in a powerful, cost‑efficient, and production‑ready manner - entirely on CPU infrastructure and without the need for GPUs. 

 

Get Started Now 

Ready to deploy? Visit the AWS Marketplace and launch the instance from the provided listing: 

 

References: