Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1775 Discussions

How to optimize GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small (INT8 on Xeon)

JN-G-0
Employee
94 Views

GEMMs shapes:

  • small-oc-big-ic
    • mb1ic7168oc704
    • mb4ic7168oc704
  • small-ic-big-oc
    • mb1ic352oc7168
    • mb4ic352oc7168

Dtypes: INT8

Host:

GNR-AP with MCR (128cores per socket, snc3 on, 43-43-42 cores, MCR: 8800MT/s - 1400GB/s 2 sockets)

  • Note: using only one subnuma to run above GEMMs, i.e., only 43 cores and ~233 GB/s mem bandwidth

 

Problem descriptions:

We could not get max memory bandwidth ( expected: MCR : 1400GB/s 2 sockets, ~233 GB/s per subnuma) when running GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small.

 

For small-oc-big-ic, we already parallel block_mb-block_ic-block_oc,  block size is all 32.

For small-ic-big-oc, we already parallel block_mb-block_oc,  block size is all 32.

 

Weights is int8, and we are using int8 intrisics for computation.

Activation is bf16, and runtime quantized per row to int8.

 

With above optimizations we only get half memory bandwidth ( half of ~233 GB/s ), and suppose with mb1 or mb4, those GEMMs are memory bounds, shall be using full bandwidth  ~233 GB/s .

 

Is there any more optimization to be considered? Thanks.

0 Kudos
0 Replies
Reply