How to optimize GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small (INT8 on Xeon)

JN-G-0 · ‎05-06-2025

GEMMs shapes:

small-oc-big-ic
- mb1ic7168oc704
- mb4ic7168oc704
small-ic-big-oc
- mb1ic352oc7168
- mb4ic352oc7168

Dtypes: INT8

Host:

GNR-AP with MCR (128cores per socket, snc3 on, 43-43-42 cores, MCR: 8800MT/s - 1400GB/s 2 sockets)

Note: using only one subnuma to run above GEMMs, i.e., only 43 cores and ~233 GB/s mem bandwidth

Problem descriptions:

We could not get max memory bandwidth ( expected: MCR : 1400GB/s 2 sockets, ~233 GB/s per subnuma) when running GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small.

For small-oc-big-ic, we already parallel block_mb-block_ic-block_oc, block size is all 32.

For small-ic-big-oc, we already parallel block_mb-block_oc, block size is all 32.

Weights is int8, and we are using int8 intrisics for computation.

Activation is bf16, and runtime quantized per row to int8.

With above optimizations we only get half memory bandwidth ( half of ~233 GB/s ), and suppose with mb1 or mb4, those GEMMs are memory bounds, shall be using full bandwidth ~233 GB/s .

Is there any more optimization to be considered? Thanks.

novahayes12 · ‎05-23-2025

Hi,

Thanks for sharing the detailed info about your GEMMs optimization challenge on Xeon with INT8 data. Given that you’re getting only about half the expected memory bandwidth (~115 GB/s instead of ~233 GB/s), here are some thoughts:

Key Optimization Areas to Consider:

Memory Access and Bandwidth:
- Ensure memory is allocated and accessed in a NUMA-aware fashion to avoid cross-node penalties.
- Use aggressive prefetching to hide latency and improve cache utilization.
- Consider reorganizing data layouts to improve spatial locality.
Blocking and Parallelization:
- Revisit your blocking sizes—while 32 is a common choice, tuning it for your specific cache sizes and workload might yield gains.
- Evaluate thread balancing especially since minibatch sizes are small (mb=1 or 4), which can impact parallel efficiency.
Compute Kernel Efficiency:
- Make sure your INT8 intrinsics use AVX512 VNNI or equivalent instructions to maximize throughput.
- Check if fused multiply-add (FMA) instructions are properly leveraged.
Quantization Overheads:
- If runtime per-row quantization to INT8 is costly or causing non-uniform memory access, consider pre-quantizing or coarser granularity quantization.
Profiling:
- Tools like Intel VTune can reveal whether you are memory bound or compute bound, and show cache usage patterns.

AlHill · ‎05-23-2025

@novahayes12 Spoken like a true chatgpt answer.

Doc (not an Intel employee or contractor)
[W10 is this generation's XP]

novahayes12 · ‎05-24-2025

Thanks for the detailed info.

With small mb and ic/oc sizes, limited memory-level parallelism or underutilized vector units could be the bottleneck. Make sure memory is NUMA-local, thread pinning is correct, and AVX-512 or AMX (if supported) is fully used. Also consider fusing GEMMs or tuning block sizes for better cache and bandwidth usage.

Best regards,
Nova Hayes