Community
cancel
Showing results for 
Search instead for 
Did you mean: 
zbjornson
Beginner
235 Views

cblas_sgemm performance bug with AVX512

Hello,

I believe there is a performance bug in cblas_sgemm in MKL 2020 v2 and v3 on Intel AVX512 processors.

#include <cstddef>
#include "mkl.h"
int main() {
  size_t m = 250000;
  size_t nk = 6;
  float* data = new float[m * nk];
  float* other = new float[nk * nk];
  float* dest = new float[m * nk];
  cblas_sgemm(CblasColMajor, CblasTrans, CblasTrans, m, nk, nk, 1.0f, data, nk, other, nk, 0.0f, dest, m);
}


Run on a Xeon Cascade Lake:

  • with MKL_ENABLE_INSTRUCTIONS=AVX2: 398 ops/sec
  • with MKL_ENABLE_INSTRUCTIONS=AVX512: 41 ops/sec - this should be >= AVX2
  • with default dispatching: 41 ops/sec

Run on an AMD EPYC Rome:

  • with default dispatching: 243 ops/sec

The defect only manifests for nk < 8.

Thank you,
Zach

Labels (1)
0 Kudos
4 Replies
Gennady_F_Intel
Moderator
225 Views

Is that Linux OS?

Did you try the MKL 2020.0 version?



zbjornson
Beginner
211 Views

This is on Linux, yes.

The same issue happens with 2020.0.

Here's the full build line I'm using:

g++ -I/opt/intel/mkl/include/ -DMKL_ILP64 -L/opt/intel/mkl/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl test.cpp -o test.o

Output:

$ MKL_VERBOSE=1 MKL_ENABLE_INSTRUCTIONS=AVX2 time ./test.o
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz ilp64 sequential
MKL_VERBOSE SGEMM(T,T,250000,6,6,0x7fff9a6baba8,0x7ffb82e42010,6,0x556212ef8f20,6,0x7fff9a6babb0,0x7ffb82889010,250000) 6.34ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
0.10user 0.00system 0:00.10elapsed 100%CPU (0avgtext+0avgdata 16884maxresident)k
0inputs+0outputs (0major+3368minor)pagefaults 0swaps
$ MKL_VERBOSE=1 MKL_ENABLE_INSTRUCTIONS=AVX512 time ./test.o
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.80GHz ilp64 sequential
MKL_VERBOSE SGEMM(T,T,250000,6,6,0x7ffd14db70e8,0x7f398b0ec010,6,0x560b61254f20,6,0x7ffd14db70f0,0x7f398ab33010,250000) 27.96ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
0.63user 0.00system 0:00.63elapsed 100%CPU (0avgtext+0avgdata 16960maxresident)k
0inputs+0outputs (0major+3368minor)pagefaults 0swaps
Gennady_F_Intel
Moderator
203 Views

Ok, I see, thanks.

I think, for such kind of tall and skin matrixes, no opportunities are using the wide (512bit) registers for vectorization.

When the nk is getting largen, then the performance of AVX-512 code branch is growing and will exceed the AVX2 code.


Gennady_F_Intel
Moderator
176 Views

The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.



Reply