cblas_sgemm performance bug with AVX512

zbjornson · ‎09-08-2020

Hello,

I believe there is a performance bug in cblas_sgemm in MKL 2020 v2 and v3 on Intel AVX512 processors.

#include <cstddef>
#include "mkl.h"
int main() {
  size_t m = 250000;
  size_t nk = 6;
  float* data = new float[m * nk];
  float* other = new float[nk * nk];
  float* dest = new float[m * nk];
  cblas_sgemm(CblasColMajor, CblasTrans, CblasTrans, m, nk, nk, 1.0f, data, nk, other, nk, 0.0f, dest, m);
}

Run on a Xeon Cascade Lake:

with MKL_ENABLE_INSTRUCTIONS=AVX2: 398 ops/sec
with MKL_ENABLE_INSTRUCTIONS=AVX512: 41 ops/sec - this should be >= AVX2
with default dispatching: 41 ops/sec

Run on an AMD EPYC Rome:

with default dispatching: 243 ops/sec

The defect only manifests for nk < 8.

Thank you,
Zach

Gennady_F_Intel · ‎09-09-2020

Is that Linux OS?

Did you try the MKL 2020.0 version?

zbjornson · ‎09-10-2020

This is on Linux, yes.

The same issue happens with 2020.0.

Here's the full build line I'm using:

g++ -I/opt/intel/mkl/include/ -DMKL_ILP64 -L/opt/intel/mkl/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl test.cpp -o test.o

Output:

$ MKL_VERBOSE=1 MKL_ENABLE_INSTRUCTIONS=AVX2 time ./test.o
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz ilp64 sequential
MKL_VERBOSE SGEMM(T,T,250000,6,6,0x7fff9a6baba8,0x7ffb82e42010,6,0x556212ef8f20,6,0x7fff9a6babb0,0x7ffb82889010,250000) 6.34ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
0.10user 0.00system 0:00.10elapsed 100%CPU (0avgtext+0avgdata 16884maxresident)k
0inputs+0outputs (0major+3368minor)pagefaults 0swaps

$ MKL_VERBOSE=1 MKL_ENABLE_INSTRUCTIONS=AVX512 time ./test.o
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.80GHz ilp64 sequential
MKL_VERBOSE SGEMM(T,T,250000,6,6,0x7ffd14db70e8,0x7f398b0ec010,6,0x560b61254f20,6,0x7ffd14db70f0,0x7f398ab33010,250000) 27.96ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
0.63user 0.00system 0:00.63elapsed 100%CPU (0avgtext+0avgdata 16960maxresident)k
0inputs+0outputs (0major+3368minor)pagefaults 0swaps

Gennady_F_Intel · ‎09-11-2020

Ok, I see, thanks.

I think, for such kind of tall and skin matrixes, no opportunities are using the wide (512bit) registers for vectorization.

When the nk is getting largen, then the performance of AVX-512 code branch is growing and will exceed the AVX2 code.

Gennady_F_Intel · ‎09-24-2020

The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

cblas_sgemm performance bug with AVX512

Performance