Efficient AI Inference on CPUs with OpenVINO

EhssanKhan · ‎05-19-2026

Deploying machine learning models to production often requires compiling or exporting them to formats different from those used during development. In many environments, such as embedded systems, installing a full Python stack is not an option. In other cases, we may need to run our model on different hardware or in a different programming language. One approach that addresses these challenges is OpenVINO™, and in this post, I walk through exporting a model to the OpenVINO Intermediate Representation (IR) and deploying it with OpenVINO GenAI to unlock Intel Xeon optimizations.

I use an AWS instance running on Intel® Xeon® 6 processors with Performance cores optimized for compute-intensive workloads and share the performance benchmarks on the Phi-4-mini-instruct (with 3.8B parameters) and gpt-oss-20b so you can see exactly what each layer of the stack buys you. The Mixture of Experts (MoE) model gpt-oss-20b has 21B total parameters and about 3.6B active parameters per token. How does the latency and throughput of our chosen sparse MoE model compare to the dense model of a similar active parameter count? You can find the answer and my configurations at the end of this blog.

As the benchmarks show, Intel Xeon processors with Intel® Advanced Matrix Extensions (Intel® AMX) deliver strong AI inference performance. For many applications, CPU inference with OpenVINO is all you need to satisfy your service level objectives (SLOs), especially when you can leverage existing CPU capacity without provisioning dedicated GPUs.

Exporting to OpenVINO IR and Deployment

While OpenVINO began its life as a performance-critical C++ engine for computer vision, it has evolved into a versatile backbone for the modern AI stack. Its C++ foundation remains a key advantage for edge and client-side deployment, enabling direct access to Intel CPU, GPU, and NPU optimizations for high-performance inference.

With the introduction of the GenAI API, the OpenVINO™ toolkit has lowered the barrier to entry for generative workloads. It now offers a streamlined path for developers to integrate LLMs and image generation models into their applications, providing a unified interface that balances the power of C++ with the accessibility of Python.

Exporting with Optimum Intel

Optimum Intel is the result of collaboration between Hugging Face and Intel to optimize the Transformers and Diffusers models on Intel hardware. It enables both weight-only quantization (e.g., 4-bit Activation-aware Weight Quantization (AWQ)), and static quantization (e.g., 8-bit weights and activations with calibration). By default, on Intel CPU and Intel GPU, OpenVINO runtime provides dynamic quantization of activations of 4/8-bit quantized MatMuls.

Below, I apply AWQ to Phi-4-mini-instruct for illustration purposes. A pre-optimized version of this model is already available in OpenVINO’s Hugging Face repository, which should be your first stop when looking for OpenVINO models. Exporting to OpenVINO IR can take several minutes, depending on your hardware and chosen model.

Scale estimation is an additional accuracy-recovery step that runs alongside AWQ and is worth enabling when you have a calibration dataset available. For a deeper dive, see the model optimization guide.

Install the necessary Python packages in a virtual environment and use the command line interface to apply the desired quantizations and export the model:

pip install "optimum[openvino]" openvino-tokenizers datasets openvino-genai

optimum-cli export openvino \
-m microsoft/Phi-4-mini-instruct \
--weight-format int4 \
--awq \
--scale-estimation \
--group-size 64 \
--dataset wikitext2 \
./phi-4-mini-instruct

For a no-code alternative, OpenVINO provides two Hugging Face spaces for exporting and NNCF quantization of models to OpenVINO IR.

An OpenVINO computation graph is represented with two files that you find in your output directory: an .xml file that describes the model topology and a .bin file that contains the weights and binary data.

Figure 1: Clipped view of a computation graph in Netron.

Deploying with OpenVINO GenAI

With the exported model in hand, deployment is straightforward using the OpenVINO GenAI API:

import openvino_genai

OV_FILES_PATH = "./phi-4-mini-instruct"
pipe = openvino_genai.LLMPipeline(OV_FILES_PATH, "CPU")
print(pipe.generate("What is OpenVINO?", max_length=200))

While deploying is possible with Optimum Intel, OpenVINO GenAI offers a smaller footprint, fewer dependencies, and better performance optimization options, particularly for C++ applications.

Performance Benchmarks

Below are the results of benchmarking Phi-4-mini-instruct and gpt-oss-20b on an Intel Xeon 6 instance with 48vCPUs(1). All runs sweep max concurrency from 1 to 32 using a shared prompt workload(2). I report three metrics: Time to First Token (TTFT), Time Per Output Token (TPOT), and output throughput.

I used OpenVINO Model Server (OVMS) to serve the Phi and gpt-oss models at INT8 and INT4 precisions. I also used vLLM’s benchmark tool for performance measurements so you can easily compare the results against a baseline, if you desire. Just keep in mind that gpt-oss uses MXFP4, a group-quantized floating-point format, for MoE weights while retaining bfloat16 for attention and other layers. As a result, vLLM’s CPU backend does not currently support it out of the box. The good news? These models are already available in OpenVINO IR format through OpenVINO’s model hub, so you can serve them on your CPU using OVMS right away. Here’s how you can serve a gpt-oss model with OVMS:

docker run -d \
--user $(id -u):$(id -g) \
-p 8000:8000 \
-v ~/models:/models:rw \
openvino/model_server:weekly \
--source_model OpenVINO/gpt-oss-20b-int8-ov \
--tool_parser gptoss \
--reasoning_parser gptoss \
--model_repository_path /models \
--model_name gpt-oss-20b-int8-ov \
--task text_generation \
--rest_port 8000 \
--target_device CPU \
--cache_size 0

Check out these vLLM and OVMS resources for more information.

Phi-4-mini-instruct Results

Figure 2 For the Phi model, choose OVMS INT4 when optimizing for single-user latency.

For a 128/128 input/output configuration, OVMS INT4 delivers the lowest TPOT latency for a single request,1.8x faster than OVMS INT8. This makes INT4 the clear choice for latency-sensitive, single-user workloads.

As concurrency increases, the batched decode MatMul workload transitions from memory-bound to compute-bound. Once memory bandwidth is no longer the bottleneck, INT4’s size advantage diminishes, and its dequantization overhead becomes a relative cost. By concurrency 32, INT4 and INT8 TPOT converge to near parity, down from the 1.8x gap at concurrency 1. On the prefill side, INT4 TTFT grows faster than INT8, as the dequantization overhead penalizes INT4’s compute-bound prefill at higher batch sizes.

Both Phi-4-mini-instruct configurations remain within the SLOs TTFT < 3 s, and TPOT < 100 ms through 32 concurrencies.

gpt-oss-20b Results

Figure 3: For gpt-oss-20b on CPU, INT4 is preferable across the board—it wins on both latency and throughput. The dashed line represents the 100 ms TPOT SLO constraint.

Prefill MatMuls are always compute-heavy. For gpt-oss, the forward pass per iteration is more bandwidth-sensitive because of the larger volume of weight data that must stream through memory. The two competing effects of INT4 are the smaller weights leading to faster weight loading and the larger dequantization overhead. The winner is determined by the magnitude of these two factors for each model.

Similar to the Phi model, at concurrency 1, OVMS INT4 delivers the lowest TPOT with a 1.6x improvement over INT8. As with Phi, INT4 maintains a throughput lead at all concurrency levels for gpt-oss: 1.6x at concurrency 1, narrowing to 1.1x at concurrency 8 as the workload shifts toward compute saturation. Even at that point, the model’s larger weight volume keeps INT4’s bandwidth advantage intact, so INT4 remains the faster option across the board. The volume of weight data also drives higher per-token latency, pushing TPOT past the target SLO at concurrency 8 earlier than with Phi on our 24-core instance. Deployments targeting such SLOs at higher concurrency would benefit from a larger core count or multi-instance configuration.

Dense vs. Sparse MoE: The Head-to-Head

Phi-4-mini-instruct (3.8B dense) and gpt-oss-20b (3.6B active parameters, 21B total) offer a natural comparison: similar active parameter counts, different architectures. The performance gap is narrowest at concurrency 1, where Phi INT4 achieves 36 tokens/s vs. gpt-oss INT4’s 29 tokens/s. As the table below shows, the dense model delivers 2.4–2.5x higher throughput at concurrency 8.

Metric (concurrency=8)	Phi INT4	gpt-oss INT4	Phi INT8	gpt-oss INT8
Output throughput (tokens/s)	161.0	66.6	152.4	61.4
TPOT (ms)	47.7	115.6	50.7	125.6
TTFT (ms)	302.1	691.0	274.2	733.7

Figure 4: Despite similar active parameter counts, the full parameter set of the MoE must reside in memory. As concurrency increases, requests can activate different experts, leading to higher aggregate memory traffic than in the dense model.

Scaling Efficiency

How efficiently does throughput scale as we add concurrent requests? Under ideal (linear) scaling, doubling concurrency doubles throughput. Scaling efficiency is defined as: (concurrent throughput / single-request throughput) / concurrency x 100%.

At concurrency 8:

Configuration	Scaling Efficiency
Phi / OVMS INT8	94.6%
Phi / OVMS INT4	55.8%
gpt-oss / OVMS INT8	41.7%
gpt-oss / OVMS INT4	28.5%

Two notes are in order. First, INT8 scales 1.5–1.7x more efficiently than INT4 for both models: clean mapping of INT8 weights to AMX tiles avoids the dequantization overhead that increasingly contends for compute as concurrency rises. Second, the dense Phi model scales roughly 2x more efficiently than the MoE gpt-oss at the same precision—its smaller total weight footprint leaves more memory bandwidth headroom per concurrent request.

Despite scaling less efficiently, INT4 delivers higher absolute throughput and lower TPOT than INT8 at every concurrency level for both models. The efficiency gap reflects INT4’s much higher single-request starting point, not a throughput deficit. Phi also continues to scale well beyond concurrency 8: at concurrency 32, INT8 holds 75.0% efficiency and INT4 holds 43.3%, both still within SLO targets. The optimal quantization level is not universal and depends on the model’s architecture and memory footprint. Profiling on your target hardware is essential.

Conclusion

These benchmarks show that Intel Xeon 6 processors with OpenVINO can serve production LLM workloads efficiently without dedicated GPU infrastructure. For teams already running on 4th Gen Intel Xeon systems or newer, this is the inference performance you can unlock today without procuring new hardware. Explore pre-optimized models on Hugging Face, or export your model to OpenVINO IR with a single optimum-cli command, and deploy with OVMS in minutes.

Discover how Intel Xeon 6 processors can transform your AI performance, and learn more about what you can accomplish with OpenVINO.

Acknowledgements

The author would like to thank Abi Prabhakaran, Adrian Boguszewski, Dariusz Trawinski, Lisa Brock, Louie Tsai, Muthaiah Venkatachalam, Rachel Novak, and Zhuo Wu for their input on this post.

Learn More

Footnotes:

(1) Configurations: 1-instance Amazon EC2 r8id.12xlarge, 1x Intel(R) Xeon(R) 6975P-C, 24 cores, HT On, Turbo On, 384 GB total memory, BIOS 1.0, microcode 0x10003e0, 1x Elastic Network Adapter, 1x 500G Amazon Elastic Block Store, 1x 2.6T Amazon EC2 NVMe Instance Storage, Ubuntu 24.04.3 LTS, 6.14.0-1018-aws, PyTorch 2.10.0+cpu, OpenVINO GenAI 2026.0.0.0, OpenVINO Model Server 2026.0.0.4d3933c5c,
OpenVINO backend 2026.0.0.0rc3, vLLM 0.16.0rc2.dev472+gee59a7c61, Test by Intel as of February 2026.

(2) Benchmark parameters: Input/output length: 128/128, dataset: random, request rate: inf, tp: 1, temperature: 0.

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of the dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

u8 · ‎05-29-2026

informative

thank you