- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a workload that requires many small DGEMM operations, and I'm trying to determine if cblas_dgemm_batch can improve performance over a simple loop of cblas_dgemm calls.
Workload:
I process ~400,000 batches. Each batch contains multiple small DGEMM operations that I either execute sequentially (cblas_dgemm) or via cblas_dgemm_batch.
Batch sizes (operations per batch):
- Median: 13 ops
- Mean: 49 ops
- Range: 1 to 14,543 ops
- Most batches are small: 25% have ≤5 ops, 42% have ≤20 ops
GEMM dimensions are small and vary within each batch:
| Min | Max | Median | Mean | |
| m | 1 | 1419 | 161 | 268 |
| n | 1 | 80 | 2 | 7 |
| k | 1 | 80 | 1 | 3 |
The comparison:
// Option 1: Sequential loop
for (int i = 0; i < N; i++) {
cblas_dgemm(CblasColMajor, transa[i], transb[i], m[i], n[i], k[i], alpha[i], A[i], lda[i], B[i], ldb[i], beta[i], C[i], ldc[i]);
} // Option 2: Batch call (N groups, 1 operation per group)
std::vector<MKL_INT> group_size(N, 1);
cblas_dgemm_batch(CblasColMajor,
transa.data(), transb.data(),
m.data(), n.data(), k.data(),
alpha.data(),
A_ptrs.data(), lda.data(),
B_ptrs.data(), ldb.data(),
beta.data(),
C_ptrs.data(), ldc.data(),
N, group_size.data());Since each GEMM has different dimensions, I use N groups with 1 operation each.
Environment:
- CPU: Intel Xeon Platinum 8470
- MKL: 2023.2.0
Results (total GEMM time across all ~400K batches):
| Threads | Sequential | Batch | Note |
| 1 | 72.5 s | 93.6 s | Batch is 29% slower |
| 2 | 70.4 s | 106.1 s | Batch is 51% slower |
| 4 | 179 s | 120 s | Both much slower than 1 thread |
| 8 | 266 s | 195 s | Both much slower than 1 thread |
Questions:
1) At 1-2 threads, why is cblas_dgemm_batch slower than a sequential loop?
2) Why does performance degrade so much with more threads?
3) For this type of workload (many small GEMMs with variable dimensions), is there a recommended approach?
Link Copied
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page