Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
7266 Diskussionen

cblas_dgemm_batch slower than sequential cblas_dgemm loop for small variable-size GEMMs

miglia
Einsteiger
141Aufrufe

I have a workload that requires many small DGEMM operations, and I'm trying to determine if cblas_dgemm_batch can improve performance over a simple loop of cblas_dgemm calls.

 

Workload:
I process ~400,000 batches. Each batch contains multiple small DGEMM operations that I either execute sequentially (cblas_dgemm) or via cblas_dgemm_batch.

Batch sizes (operations per batch):
- Median: 13 ops
- Mean: 49 ops
- Range: 1 to 14,543 ops
- Most batches are small: 25% have ≤5 ops, 42% have ≤20 ops

GEMM dimensions are small and vary within each batch:

 MinMaxMedianMean
m11419161268
n18027
k18013

 

The comparison:

// Option 1: Sequential loop
for (int i = 0; i < N; i++) {
cblas_dgemm(CblasColMajor, transa[i], transb[i], m[i], n[i], k[i], alpha[i], A[i], lda[i], B[i], ldb[i], beta[i], C[i], ldc[i]);
}
  // Option 2: Batch call (N groups, 1 operation per group)
  std::vector<MKL_INT> group_size(N, 1);

  cblas_dgemm_batch(CblasColMajor,
                    transa.data(), transb.data(),
                    m.data(), n.data(), k.data(),
                    alpha.data(),
                    A_ptrs.data(), lda.data(),
                    B_ptrs.data(), ldb.data(),
                    beta.data(),
                    C_ptrs.data(), ldc.data(),
                    N, group_size.data());

Since each GEMM has different dimensions, I use N groups with 1 operation each.

Environment:
- CPU: Intel Xeon Platinum 8470
- MKL: 2023.2.0

Results (total GEMM time across all ~400K batches):

ThreadsSequentialBatchNote
172.5 s93.6 sBatch is 29% slower
270.4 s106.1 s Batch is 51% slower
4179 s120 sBoth much slower than 1 thread
8266 s195 sBoth much slower than 1 thread

 

Questions:
1) At 1-2 threads, why is cblas_dgemm_batch slower than a sequential loop?

2) Why does performance degrade so much with more threads?
3) For this type of workload (many small GEMMs with variable dimensions), is there a recommended approach?

 

0 Kudos
0 Antworten
Antworten