Small language models (SLMs) have gained significant traction recently thanks to their efficiency and strong performance compared to their larger counterparts. This is especially important in situations such as edge servers or offline agents, where deploying a large language model (LLM) on GPUs isn’t feasible.
In this blog, I show you how to run responsive, CPU-only applications using a quantized SLM in the GPT-Generated Unified Format (GGUF). For serving models, I use llama.cpp, a popular C/C++ LLM inference framework with Python bindings, optimized with single instruction, multiple data (SIMD) for CPU performance. Through this llama.cpp study, we shall identify the quantizations producing the highest throughput and parallel efficiency for our base model, task, and hardware.
Building llama.cpp
Prerequisite. I used a Google Cloud instance of Intel® Xeon® 6 processors with performance cores optimized for compute-intensive workloads. If you wish to reproduce the results, go to console.cloud.google.com, set up your billing, and head to “Compute Engine.” Then, increase the boot disk size from the left-hand panel and click on “Create instance” to configure your virtual machine (VM)—you can find my configurations at the end of this blog.(1)
To begin, let's review the steps to build Llama.cpp locally.
Step 1. Open a terminal, update system packages and upgrade.
| sudo apt update && sudo apt upgrade -y |
Step 2. Install required tools and libraries.
| sudo apt install git g++ cmake ninja-build libcurl4-openssl-dev -y |
Step 3. Download the Intel® oneAPI Base Toolkit offline installer and execute it by modifying the lines below for your file name. This toolkit includes libraries for developing high-performance, data-centric applications across diverse architectures, featuring an industry-leading C++ compiler.
| chmod +x intel-oneapi-base-toolkit-<version>.sh sudo ./intel-oneapi-base-toolkit-<version>.sh |
Step 4. Initialize the environment variables needed for Intel oneAPI, ensuring compilers and libraries are correctly configured for your development session. Note: You need to repeat this step for every new session.
| source /opt/intel/oneapi/setvars.sh |
Step 5. Clone the llama.cpp repository.
| git clone https://github.com/ggml-org/llama.cpp cd llama.cpp |
Step 6. Set up the project using Intel’s compilers and math libraries for optimized performance.
| cmake -B build -G Ninja -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_NATIVE=ON cmake --build build --config Release |
Preparing a GGUF Model
Llama.cpp expects models in .gguf format. This binary file format bundles model weights, tokenizer, architecture, vocabulary size, and other metadata. You can convert/quantize your model weights into GGUF using the ggml-org/gguf-my-repo tool, or download models with GGUF files from Hugging Face. Alternatively, you may use llama.cpp for model conversion as follows.
Step 1. Create and activate a Python virtual environment.
| sudo apt install python3.13-venv -y python3 -m venv env source env/bin/activate |
Step 2. Go to llama.cpp directory and install these requirements.
| pip install -r ./requirements/requirements-convert_hf_to_gguf.txt pip install transformers torch sentencepiece |
Step 3. Download your model from Hugging Face. Note: if your model is gated, first you need to accept the terms of the model through its Hugging Face model card. Then, you should run hf auth login and provide your Hugging Face read token before proceeding to downloading the model.
| hf download <Huggingface_model_name> |
By default, files are cached in Hugging Face’s local cache (~/.cache/huggingface). You can change that behavior by using the --local-dir flag.
Step 4. Create a models directory to organize the GGUF files and run the convert_hf_to_gguf.py Python script.
| python3 convert_hf_to_gguf.py <path_to_huggingface_model> --outfile ~/models/<model_name>.gguf |
Choosing the Best Quantization for Your Model, Task, and Hardware
Quantization can be viewed as a lossy compression: it’s a technique for reducing the precision of model weights (and sometimes activations) and, in turn, model size to speed up inference. Llama.cpp provides tools for both quantizing models and measuring their accuracy loss across different quantizations in perplexity (ppl). Here, we show how to quantize a model using llama.cpp and refer you to their readme file for more details about the available parameters and arguments.
To quantize a GGUF model from the previous section into Q8_0 precision, execute the following:
| cd ~/llama.cpp/build/bin/ ./llama-quantize ~/models/<model_name>.gguf ~/models/<model_name>-Q8_0.gguf Q8_0 |
Use the --help flag to see the list of all supported quantizations for your llama.cpp build. When you quantize a model to lower precision, it gets smaller, and you might expect a faster inference at the expense of higher perplexity. But is that really the case? Also, if you have some quantized models with acceptable accuracy, how do you determine which one runs “best” on your hardware? These are the questions that we shall explore in this section and the next using another llama.cpp tool called llama-batched-bench.
To start our experiments, let’s download and quantize the three models Llama-3.2-3B-Instruct, gemma-3-4b-it, and Qwen3-8B at precisions F16, BF16, Q8_0, Q4_K_M, and Q2_K.
Make sure you have sourced Intel oneAPI environment variables in your session as in Step 4 in Building llama.cpp, and then run the benchmark script as follows:
| cd ~/llama.cpp/build/bin ./llama-batched-bench -m ~/models/<model_name>.gguf -c 0 -npp 256 -ntg 256 -npl 1,4,8,16 -t $(nproc) |
This will generate a table of measurements including S_PP (prompt processing speed, in tokens/sec), S_TG (token generation speed, in tokens/sec), and S (total speed, in tokens/sec).
The parameter c specifies the context size, and the argument 0 means “loaded from model”. Setting t to nproc allows you to use all your CPU threads. The values of parameters npl (number of parallel prompts), npp (number of prompts per batch), and ntg (number of tokens per batch) depend on your specific use case. For instance, for text summarization, you can try 1024 and 256 for npp and ntg, respectively.
The plots below show the results of benchmarks with both npp and ntg set to 256 on my 48-vCPU machine, without hyperthreading.
Figure 1: For small batches (≤4), F16 and BF16 perform similarly, with Q2_K faster; for larger batches, Q4_K_M is fastest on 3B/4B models, while 8B peaks with Q8_0.
For batch sizes ≤ 4, F16 and BF16 exhibit nearly identical throughput, while Q2_K achieves superior performance. At larger batch sizes, Q4_K_M delivers the highest throughput for 3B and 4B models, whereas the 8B model reaches its peak throughput with Q8_0 precision.
To understand why performance shifts between quantization levels as parallel prompts increase, it helps to think in terms of two core inference constraints: (1) how fast the system can move model weights through the memory hierarchy (DRAM → cache → cores), and (2) how much math throughput the CPU can sustain once those weights are available.
At low parallelism, decoding often behaves like a weight-streaming workload: the CPU repeatedly pulls large weight matrices from DRAM, and overall throughput is frequently limited by memory traffic and memory latency rather than raw compute. In that regime, smaller weight formats reduce the number of bytes read per token and can increase throughput. Compressing weights down to very small formats (e.g., 2-bit) can substantially reduce memory traffic compared to 8-bit and 16-bit weights; even though these formats require additional unpacking and per-block scaling, the reduction in memory movement can outweigh the extra compute at low parallelism. FP16 and BF16 both store 16 bits per weight, so when the limiting factor is dominated by moving those 16-bit weights around, they often look similar (although differences in kernel paths and ISA support can still produce gaps in practice).
As you increase the number of parallel prompts, the workload becomes more compute-heavy because each weight value is reused across more sequences within the same matrix multiply. This increases arithmetic intensity and shifts the bottleneck from “fetching weights” toward “executing the matmuls.” In that compute-bound regime, the overheads of aggressive compression become more important. Quantized kernels must unpack low-bit values and apply per-block scales (and related bookkeeping) to use 2-bit/4-bit data in the dot-product/GEMM path. At high parallelism, that dequantization work can compete with the main matmul for execution resources, limiting or reversing the benefit of extreme compression (e.g., Q2_K). By contrast, INT8-style quantization (e.g., Q8_0) typically aligns better with x86 acceleration: when llama.cpp uses INT8-friendly paths (e.g., Intel® AVX-512 Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) INT8 where available), the core dot-product can run very efficiently, and the relative “unpack/scaling” penalty is smaller.
The larger 8B model performs more total matrix-multiply work per generated token than 3B/4B models, which pushes it into the compute-limited regime sooner as parallelism rises. In that setting, efficient INT8 execution can outpace the bandwidth savings of 4-bit or 2-bit formats whose extra dequantization becomes a bottleneck. This is why the 8B model prefers Q8_0 for higher-parallelism setups, whereas lower-bit formats win at lower parallelism.
Intel Xeon 6 processor’s scalable memory architecture (offering up to 12 channels) provides the bandwidth to support these quantizations without bottlenecks, particularly important as you scale to larger models or higher batch sizes. While our Google Cloud configuration uses standard DDR5, on-premises deployments can leverage Intel Xeon 6 processors with P-cores that support MRDIMMs; these can deliver over 37% more memory bandwidth than standard DDR5, further accelerating large-scale AI and HPC inference workloads.
Figure 2: Prompt processing speed remains largely constant across parallel prompts for all models. Q4_K_M delivers highest speeds on 3B/4B models. The 8B model exhibits the most variation, with Q8_0 outperforming other quantizations.
Inference is a balance between the prompt-processing (prefill) and token-generation (decode) phases. Prompt processing is the compute-heavy step that ingests the prompt sequence to build the KV cache, which is then used by the decode phase. Unlike token generation (which progresses token-by-token), prefill processes many prompt tokens in a burst (e.g., a 256-token prompt) and performs large matmuls across all layers. Even with a single prompt, these operations are typically large enough to keep the CPU’s compute pipelines busy, so increasing the number of parallel prompts often does not make prefill “faster per prompt.” Instead, adding more parallel prompts primarily increases total work, which tends to increase prefill duration and/or lead to throughput plateaus once the system’s compute and memory resources are saturated.
Because prefill is compute-intensive, highly compressed formats can introduce additional unpacking and per-block scaling overhead during these large matmuls. That overhead can extend the prefill phase, delaying the transition to steady-state decode and reducing end-to-end effective throughput at higher parallelism—consistent with the behavior observed in the prompt-processing vs. parallel-prompt graphs.
Total throughput is defined as the ratio of all tokens and total time. Since I set npp=ntg in my experiment, this implies that total throughput is the harmonic mean of prompt processing and token generation speeds:
But as Figure 2 shows, prompt processing speed remains roughly constant across parallel prompts, hence total throughput S is primarily a function of token generation speed S_TG. When S_PP ≫ S_TG, we have S ≈ 2S_TG, and this is why the shape of some of the total throughput curves (Figure 1) closely mirrors the token generation curves (Figure 3).
Figure 3: Token generation speed scales strongly with parallel prompts across all quantizations. Since prompt processing speed remains constant (Figure 2), token generation becomes the primary driver of total throughput (Figure 1).
Intel AVX-512 support in Intel Xeon 6 processors enables llama.cpp’s SIMD optimizations, which are particularly effective for matrix operations in quantized inference. Notice how Q8_0 and Q4_K_M quantizations—which map well to SIMD operations—outperform F16/BF16 at scale. For large batch sizes, use quantizations that map cleanly to the powerful Intel® Advanced Vector Extensions 512 (Intel® AVX-512) and Intel® Advanced Matrix Extensions (Intel® AMX) accelerators to avoid wasted cycles on dequantization.
Speedup and Parallel Efficiency
Next, we explore speedup and parallel efficiency. Speedup is the ratio you get when you divide the time taken to run on k threads by the time taken with n threads for n > k. A speedup of 2.0, for example, means that the program ran 2 times faster than the baseline. Examining the speedup as the number of threads varies reveals the scalability of our experiment. Parallel efficiency is defined as the observed speedup per increase in thread count relative to the baseline. Let us fix the batch size to 8 and execute the following commands for our Q8_0 and Q4_K_M models.
| for threads in 4 8 12 16 20 24; do ./llama-batched-bench -m ~/models/<model_name>.gguf -c 0 -npp 256 -ntg 256 -npl 8 -t $threads >> ~/outputs/<file_name>.txt done |
This allows us to use 4 threads as the baseline for computing speedup and parallel efficiency. Below are the results for throughput speedup and parallel efficiency.
| Model | Q8_0 Speedup | Q4_K_M Speedup |
| Llama-3.2-3B-Instruct | 4.36x (73%) | 4.15x (69%) |
| gemma-3-4b-it | 4.03x (67%) | 3.63% (61%) |
| Qwen3-8B | 3.22x (54%) | 3.65% (61%) |
Table 1: Total throughput speedup from 4 to 24 threads. Smaller models scale better with threading than larger models.
Figure 4: Incremental speedup gains shrink at higher thread counts, as indicated by the concavity of the speedup curves.
The 73% parallel efficiency of the 3B model from 4 to 24 threads shows that Intel Xeon 6 cores sustain strong throughput scaling, allowing more SLM inferences per CPU and improving serving cost efficiency.
Figure 5: For throughput, the 3b/4B models maintain higher efficiency under Q8_0, while the 8B model benefits more from Q4_K_M quantization.
Prompt processing parallelizes better than token generation across all models. It achieves up to 5.13x speedup, compared with token generation’s 4.10x, highlighting the Intel Xeon 6 processor’s P-core architecture and its ability to sustain highly parallel workloads.
Figure 6: For prompt processing, all three models (3B/4B/8B) scale more effectively under Q8_0 than Q4_K_M, with speedup increasing nearly linearly up to 16 threads.
Summary
When choosing the best configuration for serving a model, it’s important to consider both throughput and thread scaling. In practice, the optimal setup depends on your deployment: use the quantization that maximizes raw throughput for low-core setups, and the one with better parallel efficiency for highly threaded scenarios. Balancing these metrics ensures the highest effective throughput per CPU, reducing cost per inference.
Serving an SLM-Powered Agent with llama.cpp
Now that we’ve explored how to determine the optimal SLM and thread count for our task, let’s see how we can serve a simple agent locally using llama.cpp and OpenAI Agents SDK.
Step 1. Start the llama.cpp server.
./llama-server -m ~/models/<model_name>.gguf -t 12 --port 8080 |
Step 2. Open a new terminal and install OpenAI Agents library.
| pip install openai-agents |
Step 3. Save the following Python code as demo.py and run it with python3 demo.py.
from agents import Agent, Runner, OpenAIChatCompletionsModel from openai import AsyncOpenAI
client = AsyncOpenAI(base_url="http://localhost:8080/v1", api_key="any string") local_model = OpenAIChatCompletionsModel(model="SLM", openai_client=client)
agent = Agent(name="Assistant", instructions="You are a helpful assistant", model=local_model)
result = Runner.run_sync(agent, "Write a haiku about recursion in programming.") print(result.final_output) |
You should get an output like this:
Code calls itself on,
A mirrored, nested dance flows,
Logic finds its way.
Conclusion
Intel Xeon 6 processors provide the perfect foundation for SLM deployment, combining Intel AVX-512 and Intel AMX acceleration, scalable threading, and memory bandwidth to deliver responsive AI experiences without GPU requirements. As our benchmarks show, careful selection of model size, quantization, and thread count can achieve high throughput on CPU alone.
Discover how Intel Xeon 6 processors can transform your AI performance.
Acknowledgments
The author thanks Benjamin Odom, Abirami Prabhakaran, and Sheik Mohamed Imran for helpful discussions and feedback on an earlier draft.
(1) Configurations: 1-instance c4-highmem-48-lssd: Intel Xeon 6985P-C 48 vCPUs, HT Off, 372 GB total memory, Google Compute Engine Virtual, Debian 13, 6.12.57-1-cloud-amd64 (x86_64), llama.cpp release b7054, Intel® oneAPI Base Toolkit version 2025.3.0.375, Tested by Intel as of November 2025.
Notices and Disclaimers
Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.