topic Here is the result: in Intel® oneAPI Math Kernel Library

AVX512 slower than AVX2? What I am doing wrong?

tomasz_j_2 — Tue, 17 Apr 2018 20:03:30 GMT

Hello All,

I was so excited to test new the new Intel Xeon Silver 4114 CPU just to find out that with AVX512 enabled the performance of the matrix multiplication is the same as with legacy SSE4. If I restrict the MKL library to use AVX2 only, then the speed of the computation is twice as fast. What I am doing wrong here? The library seem to respond OK to the following call (here in FORTRAN):

stat=mkl_cbwr_set (MKL_CBWR_AVX512)

with stat == 0. But the computation slows down by a factor of two compared to the speed I get after setting the environment variable MKL_ENABLE_INSTRUCTIONS to AVX2. Is possible that this is what I should get for this particular CPU? The MKL version is 2018.2.199.

Thanks!

Tomasz

Tomasz, what input size do

Gennady_F_Intel — Wed, 18 Apr 2018 02:46:58 GMT

Tomasz, what input size do you observe such gap? We will check. Is that Lin* OS?

Yes, it is Linux, kernel 4.9

tomasz_j_2 — Wed, 18 Apr 2018 12:26:42 GMT

Yes, it is Linux, kernel 4.9.0-0, Debian OS.

I am testing on matrices that are 3000x3000 in dimension, double precision numbers. I did some research and I suspect that this is what I should get. Intel website says that Silver 4114 has one FPU per core which is capable of AVX512. If this is true, then the increase of efficiency coming from AVX512 is offset by less FPUs available on the chip (I suspect there are 2 FPUs capable of AVX2). The numbers I get for 3000x3000 matrices are as follows:

24.668 Gflop/s AVX512

40.540 Gflop/s AVX2

21.739 Gflop/s AVX

11.668 Gflop/s SSE4_2

If my suspicion about the number of FPUs is correct then MKL should fall back to AVX2 on Xeon Silver to get the max throughput.

I hope my guess is not correct, otherwise what would be goal of retrofitting the CPU with crippled AVX512 capability?

Thanks!

Tomasz

Could you please set MKL

Gennady_F_Intel — Fri, 20 Apr 2018 04:23:08 GMT

Could you please set MKL_VERBOSE=1 env variable to check if AVX-512 branch of the MKL code has been executed?

Here is the result:

tomasz_j_2 — Fri, 20 Apr 2018 12:37:02 GMT

Here is the result:

MKL_VERBOSE Intel(R) MKL 2018.0 Update 2 Product build 20180127 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.20GHz lp64 sequential
MKL_VERBOSE DGEMM(n,n,3000,3000,3000,0x7ffe29d0bda8,0x7f8b5bfe1620,3000,0x7f8b6048b820,3000,0x7ffe29d0bdb0,0x7f8b64935a20,3000) 2.14s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

So, can I conclude that this

tomasz_j_2 — Tue, 24 Apr 2018 20:29:36 GMT

So, can I conclude that this is what I should get from this processor, and AVX512 is, in fact, slower than legacy AVX2 on Xeon Silver? I see that the the "Gold" series has two FPUs per core.

I still hope that the answer is negative and something can be done.

PMU unit of low-end "Silver"

Bernard — Wed, 12 Dec 2018 17:06:06 GMT

PMU unit of low-end "Silver" processors will probably more eagerly lower the reference clock of cores which execute AVX512 code.

You should invest in Gold SKU or maybe in HEDT Skylake-X processors.