I was so excited to test new the new Intel Xeon Silver 4114 CPU just to find out that with AVX512 enabled the performance of the matrix multiplication is the same as with legacy SSE4. If I restrict the MKL library to use AVX2 only, then the speed of the computation is twice as fast. What I am doing wrong here? The library seem to respond OK to the following call (here in FORTRAN):
with stat == 0. But the computation slows down by a factor of two compared to the speed I get after setting the environment variable MKL_ENABLE_INSTRUCTIONS to AVX2. Is possible that this is what I should get for this particular CPU? The MKL version is 2018.2.199.
Yes, it is Linux, kernel 4.9.0-0, Debian OS.
I am testing on matrices that are 3000x3000 in dimension, double precision numbers. I did some research and I suspect that this is what I should get. Intel website says that Silver 4114 has one FPU per core which is capable of AVX512. If this is true, then the increase of efficiency coming from AVX512 is offset by less FPUs available on the chip (I suspect there are 2 FPUs capable of AVX2). The numbers I get for 3000x3000 matrices are as follows:
24.668 Gflop/s AVX512
40.540 Gflop/s AVX2
21.739 Gflop/s AVX
11.668 Gflop/s SSE4_2
If my suspicion about the number of FPUs is correct then MKL should fall back to AVX2 on Xeon Silver to get the max throughput.
I hope my guess is not correct, otherwise what would be goal of retrofitting the CPU with crippled AVX512 capability?
Here is the result:
MKL_VERBOSE Intel(R) MKL 2018.0 Update 2 Product build 20180127 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.20GHz lp64 sequential
MKL_VERBOSE DGEMM(n,n,3000,3000,3000,0x7ffe29d0bda8,0x7f8b5bfe1620,3000,0x7f8b6048b820,3000,0x7ffe29d0bdb0,0x7f8b64935a20,3000) 2.14s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
So, can I conclude that this is what I should get from this processor, and AVX512 is, in fact, slower than legacy AVX2 on Xeon Silver? I see that the the "Gold" series has two FPUs per core.
I still hope that the answer is negative and something can be done.
PMU unit of low-end "Silver" processors will probably more eagerly lower the reference clock of cores which execute AVX512 code.
You should invest in Gold SKU or maybe in HEDT Skylake-X processors.