topic Re:cblas_sgemm performance bug with AVX512 in Intel® oneAPI Math Kernel Library

cblas_sgemm performance bug with AVX512

zbjornson — Tue, 08 Sep 2020 17:48:18 GMT

Hello,

I believe there is a performance bug in cblas_sgemm in MKL 2020 v2 and v3 on Intel AVX512 processors.

#include <cstddef> #include "mkl.h" int main() { size_t m = 250000; size_t nk = 6; float* data = new float[m * nk]; float* other = new float[nk * nk]; float* dest = new float[m * nk]; cblas_sgemm(CblasColMajor, CblasTrans, CblasTrans, m, nk, nk, 1.0f, data, nk, other, nk, 0.0f, dest, m); }

Run on a Xeon Cascade Lake:

with MKL_ENABLE_INSTRUCTIONS=AVX2: 398 ops/sec
with MKL_ENABLE_INSTRUCTIONS=AVX512: 41 ops/sec - this should be >= AVX2
with default dispatching: 41 ops/sec

Run on an AMD EPYC Rome:

with default dispatching: 243 ops/sec

The defect only manifests for nk < 8.

Thank you,
Zach

Re:cblas_sgemm performance bug with AVX512

Gennady_F_Intel — Wed, 09 Sep 2020 07:32:33 GMT

Is that Linux OS?

Did you try the MKL 2020.0 version?

Re: Re:cblas_sgemm performance bug with AVX512

zbjornson — Thu, 10 Sep 2020 17:29:16 GMT

This is on Linux, yes.

The same issue happens with 2020.0.

Here's the full build line I'm using:

g++ -I/opt/intel/mkl/include/ -DMKL_ILP64 -L/opt/intel/mkl/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl test.cpp -o test.o

Output:

$ MKL_VERBOSE=1 MKL_ENABLE_INSTRUCTIONS=AVX2 time ./test.o MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz ilp64 sequential MKL_VERBOSE SGEMM(T,T,250000,6,6,0x7fff9a6baba8,0x7ffb82e42010,6,0x556212ef8f20,6,0x7fff9a6babb0,0x7ffb82889010,250000) 6.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 0.10user 0.00system 0:00.10elapsed 100%CPU (0avgtext+0avgdata 16884maxresident)k 0inputs+0outputs (0major+3368minor)pagefaults 0swaps

$ MKL_VERBOSE=1 MKL_ENABLE_INSTRUCTIONS=AVX512 time ./test.o MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.80GHz ilp64 sequential MKL_VERBOSE SGEMM(T,T,250000,6,6,0x7ffd14db70e8,0x7f398b0ec010,6,0x560b61254f20,6,0x7ffd14db70f0,0x7f398ab33010,250000) 27.96ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 0.63user 0.00system 0:00.63elapsed 100%CPU (0avgtext+0avgdata 16960maxresident)k 0inputs+0outputs (0major+3368minor)pagefaults 0swaps

Re:cblas_sgemm performance bug with AVX512

Gennady_F_Intel — Fri, 11 Sep 2020 11:36:33 GMT

Ok, I see, thanks.

I think, for such kind of tall and skin matrixes, no opportunities are using the wide (512bit) registers for vectorization.

When the nk is getting largen, then the performance of AVX-512 code branch is growing and will exceed the AVX2 code.

Re:cblas_sgemm performance bug with AVX512

Gennady_F_Intel — Thu, 24 Sep 2020 10:08:27 GMT

The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.