- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
GEMMs shapes:
- small-oc-big-ic
- mb1ic7168oc704
- mb4ic7168oc704
- small-ic-big-oc
- mb1ic352oc7168
- mb4ic352oc7168
Dtypes: INT8
Host:
GNR-AP with MCR (128cores per socket, snc3 on, 43-43-42 cores, MCR: 8800MT/s - 1400GB/s 2 sockets)
- Note: using only one subnuma to run above GEMMs, i.e., only 43 cores and ~233 GB/s mem bandwidth
Problem descriptions:
We could not get max memory bandwidth ( expected: MCR : 1400GB/s 2 sockets, ~233 GB/s per subnuma) when running GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small.
For small-oc-big-ic, we already parallel block_mb-block_ic-block_oc, block size is all 32.
For small-ic-big-oc, we already parallel block_mb-block_oc, block size is all 32.
Weights is int8, and we are using int8 intrisics for computation.
Activation is bf16, and runtime quantized per row to int8.
With above optimizations we only get half memory bandwidth ( half of ~233 GB/s ), and suppose with mb1 or mb4, those GEMMs are memory bounds, shall be using full bandwidth ~233 GB/s .
Is there any more optimization to be considered? Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for sharing the detailed info about your GEMMs optimization challenge on Xeon with INT8 data. Given that you’re getting only about half the expected memory bandwidth (~115 GB/s instead of ~233 GB/s), here are some thoughts:
Key Optimization Areas to Consider:
Memory Access and Bandwidth:
Ensure memory is allocated and accessed in a NUMA-aware fashion to avoid cross-node penalties.
Use aggressive prefetching to hide latency and improve cache utilization.
Consider reorganizing data layouts to improve spatial locality.
Blocking and Parallelization:
Revisit your blocking sizes—while 32 is a common choice, tuning it for your specific cache sizes and workload might yield gains.
Evaluate thread balancing especially since minibatch sizes are small (mb=1 or 4), which can impact parallel efficiency.
Compute Kernel Efficiency:
Make sure your INT8 intrinsics use AVX512 VNNI or equivalent instructions to maximize throughput.
Check if fused multiply-add (FMA) instructions are properly leveraged.
Quantization Overheads:
If runtime per-row quantization to INT8 is costly or causing non-uniform memory access, consider pre-quantizing or coarser granularity quantization.
Profiling:
Tools like Intel VTune can reveal whether you are memory bound or compute bound, and show cache usage patterns.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@novahayes12 Spoken like a true chatgpt answer.
Doc (not an Intel employee or contractor)
[W10 is this generation's XP]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the detailed info.
With small mb and ic/oc sizes, limited memory-level parallelism or underutilized vector units could be the bottleneck. Make sure memory is NUMA-local, thread pinning is correct, and AVX-512 or AMX (if supported) is fully used. Also consider fusing GEMMs or tuning block sizes for better cache and bandwidth usage.
Best regards,
Nova Hayes

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page