Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

cblas_sgemm performance bug with AVX512

zbjornson
Beginner
1,957 Views

Hello,

I believe there is a performance bug in cblas_sgemm in MKL 2020 v2 and v3 on Intel AVX512 processors.

#include <cstddef>
#include "mkl.h"
int main() {
  size_t m = 250000;
  size_t nk = 6;
  float* data = new float[m * nk];
  float* other = new float[nk * nk];
  float* dest = new float[m * nk];
  cblas_sgemm(CblasColMajor, CblasTrans, CblasTrans, m, nk, nk, 1.0f, data, nk, other, nk, 0.0f, dest, m);
}


Run on a Xeon Cascade Lake:

  • with MKL_ENABLE_INSTRUCTIONS=AVX2: 398 ops/sec
  • with MKL_ENABLE_INSTRUCTIONS=AVX512: 41 ops/sec - this should be >= AVX2
  • with default dispatching: 41 ops/sec

Run on an AMD EPYC Rome:

  • with default dispatching: 243 ops/sec

The defect only manifests for nk < 8.

Thank you,
Zach

Labels (1)
0 Kudos
4 Replies
Gennady_F_Intel
Moderator
1,947 Views

Is that Linux OS?

Did you try the MKL 2020.0 version?



0 Kudos
zbjornson
Beginner
1,933 Views

This is on Linux, yes.

The same issue happens with 2020.0.

Here's the full build line I'm using:

g++ -I/opt/intel/mkl/include/ -DMKL_ILP64 -L/opt/intel/mkl/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl test.cpp -o test.o

Output:

$ MKL_VERBOSE=1 MKL_ENABLE_INSTRUCTIONS=AVX2 time ./test.o
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz ilp64 sequential
MKL_VERBOSE SGEMM(T,T,250000,6,6,0x7fff9a6baba8,0x7ffb82e42010,6,0x556212ef8f20,6,0x7fff9a6babb0,0x7ffb82889010,250000) 6.34ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
0.10user 0.00system 0:00.10elapsed 100%CPU (0avgtext+0avgdata 16884maxresident)k
0inputs+0outputs (0major+3368minor)pagefaults 0swaps
$ MKL_VERBOSE=1 MKL_ENABLE_INSTRUCTIONS=AVX512 time ./test.o
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.80GHz ilp64 sequential
MKL_VERBOSE SGEMM(T,T,250000,6,6,0x7ffd14db70e8,0x7f398b0ec010,6,0x560b61254f20,6,0x7ffd14db70f0,0x7f398ab33010,250000) 27.96ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
0.63user 0.00system 0:00.63elapsed 100%CPU (0avgtext+0avgdata 16960maxresident)k
0inputs+0outputs (0major+3368minor)pagefaults 0swaps
0 Kudos
Gennady_F_Intel
Moderator
1,925 Views

Ok, I see, thanks.

I think, for such kind of tall and skin matrixes, no opportunities are using the wide (512bit) registers for vectorization.

When the nk is getting largen, then the performance of AVX-512 code branch is growing and will exceed the AVX2 code.


0 Kudos
Gennady_F_Intel
Moderator
1,898 Views

The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.



0 Kudos
Reply