- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
GEMMs shapes:
- small-oc-big-ic
- mb1ic7168oc704
- mb4ic7168oc704
- small-ic-big-oc
- mb1ic352oc7168
- mb4ic352oc7168
Dtypes: INT8
Host:
GNR-AP with MCR (128cores per socket, snc3 on, 43-43-42 cores, MCR: 8800MT/s - 1400GB/s 2 sockets)
- Note: using only one subnuma to run above GEMMs, i.e., only 43 cores and ~233 GB/s mem bandwidth
Problem descriptions:
We could not get max memory bandwidth ( expected: MCR : 1400GB/s 2 sockets, ~233 GB/s per subnuma) when running GEMMs with small-ic-big-oc or small-oc-big-ic when mb is also small.
For small-oc-big-ic, we already parallel block_mb-block_ic-block_oc, block size is all 32.
For small-ic-big-oc, we already parallel block_mb-block_oc, block size is all 32.
Weights is int8, and we are using int8 intrisics for computation.
Activation is bf16, and runtime quantized per row to int8.
With above optimizations we only get half memory bandwidth ( half of ~233 GB/s ), and suppose with mb1 or mb4, those GEMMs are memory bounds, shall be using full bandwidth ~233 GB/s .
Is there any more optimization to be considered? Thanks.
Link Copied

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page