How to optimize GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small (INT8 on Xeon)

JN-G-0 · ‎05-06-2025

GEMMs shapes:

small-oc-big-ic
- mb1ic7168oc704
- mb4ic7168oc704
small-ic-big-oc
- mb1ic352oc7168
- mb4ic352oc7168

Dtypes: INT8

Host:

GNR-AP with MCR (128cores per socket, snc3 on, 43-43-42 cores, MCR: 8800MT/s - 1400GB/s 2 sockets)

Note: using only one subnuma to run above GEMMs, i.e., only 43 cores and ~233 GB/s mem bandwidth

Problem descriptions:

We could not get max memory bandwidth ( expected: MCR : 1400GB/s 2 sockets, ~233 GB/s per subnuma) when running GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small.

For small-oc-big-ic, we already parallel block_mb-block_ic-block_oc, block size is all 32.

For small-ic-big-oc, we already parallel block_mb-block_oc, block size is all 32.

Weights is int8, and we are using int8 intrisics for computation.

Activation is bf16, and runtime quantized per row to int8.

With above optimizations we only get half memory bandwidth ( half of ~233 GB/s ), and suppose with mb1 or mb4, those GEMMs are memory bounds, shall be using full bandwidth ~233 GB/s .

Is there any more optimization to be considered? Thanks.

Ruqiu_C_Intel · ‎05-16-2025

oneMKL doesn't support int8 on CPU for C interface yet. More oneMKL C interface developer reference here Developer Reference for Intel® oneAPI Math Kernel Library - C

Do you talking oneMKL Sycl interface? And have you try Intel oneDNN lib for int8 in Gemm? oneDNN/src/cpu/gemm/s8x8s32/ref_gemm_s8x8s32.cpp at main · uxlfoundation/oneDNN