cblas_dgemm_batch slower than sequential cblas_dgemm loop for small variable-size GEMMs

miglia — Mon, 24 Nov 2025 15:36:20 GMT

I have a workload that requires many small DGEMM operations, and I'm trying to determine if cblas_dgemm_batch can improve performance over a simple loop of cblas_dgemm calls.

Workload:
I process ~400,000 batches. Each batch contains multiple small DGEMM operations that I either execute sequentially (cblas_dgemm) or via cblas_dgemm_batch.

Batch sizes (operations per batch):
- Median: 13 ops
- Mean: 49 ops
- Range: 1 to 14,543 ops
- Most batches are small: 25% have ≤5 ops, 42% have ≤20 ops

GEMM dimensions are small and vary within each batch:

	Min	Max	Median	Mean
m	1	1419	161	268
n	1	80	2	7
k	1	80	1	3

The comparison:

// Option 1: Sequential loop for (int i = 0; i < N; i++) { cblas_dgemm(CblasColMajor, transa[i], transb[i], m[i], n[i], k[i], alpha[i], A[i], lda[i], B[i], ldb[i], beta[i], C[i], ldc[i]); }

// Option 2: Batch call (N groups, 1 operation per group) std::vector<MKL_INT> group_size(N, 1); cblas_dgemm_batch(CblasColMajor, transa.data(), transb.data(), m.data(), n.data(), k.data(), alpha.data(), A_ptrs.data(), lda.data(), B_ptrs.data(), ldb.data(), beta.data(), C_ptrs.data(), ldc.data(), N, group_size.data());

Since each GEMM has different dimensions, I use N groups with 1 operation each.

Environment:
- CPU: Intel Xeon Platinum 8470
- MKL: 2023.2.0

Results (total GEMM time across all ~400K batches):

Threads	Sequential	Batch	Note
1	72.5 s	93.6 s	Batch is 29% slower
2	70.4 s	106.1 s	Batch is 51% slower
4	179 s	120 s	Both much slower than 1 thread
8	266 s	195 s	Both much slower than 1 thread

Questions:
1) At 1-2 threads, why is cblas_dgemm_batch slower than a sequential loop?

2) Why does performance degrade so much with more threads?
3) For this type of workload (many small GEMMs with variable dimensions), is there a recommended approach?

topic cblas_dgemm_batch slower than sequential cblas_dgemm loop for small variable-size GEMMs in Intel® oneAPI Math Kernel Library

cblas_dgemm_batch slower than sequential cblas_dgemm loop for small variable-size GEMMs