Why is AVX512 not autodetected and used in MKL?

erling_andersen · ‎01-18-2019

Hi

I have binary build Intel C 19.0.1 that is a heavy user of MKL.

When I do

export MKL_ENABLE_INSTRUCTIONS=AVX512

time goes from 88s to 55s for 1 problem instance.

That is HUGE savnings.

How come MKL figure that by itself? What is the philosophy?

Also it seems if I instead do

mkl_enable_instructions(MKL_ENABLE_AVX512)

then it has ZERO effect. Why?

Erling

Gennady_F_Intel · ‎01-21-2019

Hi Erling,

Intel MKL automatically queries and then dispatches the code path supported on your Intel® processor to the optimal instruction set architecture (ISA) by default. Please take a look to see more details at this Dev Guide section: https://software.intel.com/en-us/mkl-linux-developer-guide-instruction-set-specific-dispatching-on-intel-architectures

regard to the second questions: could you check which code has been actually called by setting MKL_VERBOSE? export MKL_VERBOSE=1, then call your code you will see which mkl's code path has been called.

--Gennady

Gennady_F_Intel · ‎01-21-2019

You may see smth like as following :

MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20180928 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.70GHz lp64 intel_thread

MKL_VERBOSE SGEMM(N,N,1024,1024,1024,0x7fff1bca3b84,0x2b3ac5e1f080,1024,0x2b3ac6220080,1024,0x7fff1bca3b98,0x2b3ac6621080,1024) 248.28ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

erling_andersen · ‎01-21-2019

I apologize my initial comparison is wrong. I compared the MT version to the ST version. So the speed up came from using multithreading.

It seems AVX512 is turned on based on the the method you specify. However, the run time is unchanged compared MKL_ENABLE_AVX2.

I also compared code compiled with Intel C 16 and 19 on the AVX512 machine, but run time was same.

What kind of speed up should I expect?

How do I make sure AVX512 is turned off in the comparison?

Gennady_F_Intel · ‎01-21-2019

>>It seems AVX512 is turned on based on the the method you specify.

please check if this CPU supports this instruction sets. check the command lscpu | grep 512 ( avx512cd avx512bw avx512vl and ....)

>> What kind of speed up should I expect?

<< this will depends on mkl's function did you run and problem sizes

>> How do I make sure AVX512 is turned off in the comparison?

<< MKL_VERBOSE==ON|OFF

erling_andersen · ‎01-21-2019

My CPU is:

https://ark.intel.com/products/120483/Intel-Xeon-Gold-6126-Processor-19-25M-Cache-2-60-GHz-

eda@nordborg:~/mosekdbg$ lscpu | grep 512
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d

I get the same performance with

MKL_VERBOSE Intel(R) MKL 11.3 Update 2 Product build 20160120 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, L
nx 2.60GHz lp64 sequential

and

MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20180928 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled proces
sors, Lnx 2.60GHz lp64 sequential

Gennady_F_Intel · ‎01-21-2019

ok, the sequential code. Then what functions do you run and what is the typical problem size?

erling_andersen · ‎01-21-2019

MKL is used within in a multithreaded sparse Cholesky.
MKL is only used seq. mode.
Only double precision computations.
Most of the time is spend in GEMM. Some SYRK. Many smallish. Think n<=256.

Gennady_F_Intel · ‎01-22-2019

Sparse Cholesky? - What is exact function name?

regard to dgemm. AVX2 vs AVX-512 code branch: here is the quick results I see on my side ( 1 thread, small sizes)

>>>>>>>>>>>>>>>>>>........ 128x128 ..........
MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20180928 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.70GHz lp64 intel_thread
MKL_VERBOSE DGEMM(N,N,128,128,128,0x7ffd8c2e1d40,0x10890c0,128,0x10ba2c0,128,0x7ffd8c2e1d48,0x2b58b743c080,128) 18.56ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_NUM_THREADS == 1
size == 128, GFlops == 48.915
Done

MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20180928 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.70GHz lp64 intel_thread
MKL_VERBOSE DGEMM(N,N,128,128,128,0x7fff506767c0,0x19050c0,128,0x19362c0,128,0x7fff506767c8,0x2b600ea5a080,128) 17.75ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_NUM_THREADS == 1
size == 128, GFlops == 96.141
Done

erling_andersen · ‎01-22-2019

The sparse Cholesky is one I wrote. I doubt its name is relevant. It is was just tell you the context in which I do GEMM.

Using my own dense POTRF for n=5000 I get

eda@nordborg:~/mosekdbg$ export MKL_ENABLE_INSTRUCTIONS=AVX2
eda@nordborg:~/mosekdbg$ ~/mosekprj/dev/bld/nordborg/final/default/intelc-19.0.1/bin/runchol ++5000
Dense potrf time: 8.65e-01

eda@nordborg:~/mosekdbg$ export MKL_ENABLE_INSTRUCTIONS=AVX512
eda@nordborg:~/mosekdbg$ ~/mosekprj/dev/bld/nordborg/final/default/intelc-19.0.1/bin/runchol ++5000
Dense potrf time: 5.09e-01

A decent speed up. 40% time reduction. So far so good.

Returning the sparse Cholesky (I wrote myself)

MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20180928 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.60GHz lp64 sequential
MKL_VERBOSE DGEMM(T,N,1980,56,255,0x7ffed89cf480,0x9f61e00,256,0x9c1b3c0,256,0x7ffed89cf488,0x9c91280,1980) 1.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20180928 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.60GHz lp64 sequential
MKL_VERBOSE DGEMM(T,N,1980,56,255,0x7fff40530680,0xaf7ae00,256,0xac343c0,256,0x7fff40530688,0xacaa280,1980) 1.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

I see AVX512 really speeds up this particular matrix multiplication just like dense POTRF. I assume the ms numbers are time to do the operation. Correct?

However, my application does a tons of other matrix multiplications, daxpy, etc but GEMM dominates. Most of the GEMM is much smaller than the above one. This leads to an improvement in total time runtime of about 10% for the whole application. Disapointing.

What is the cause for the poor AVX512 result I wonder.

* Down clocking?
* Benefit of AVX512 only materialize for not too small GEMMs?

TimP · ‎01-23-2019

Are your matrices small enough to benefit from MKL_DIRECT_CALL? If this feature applies to AVX512, you might well expect the threshold where it is needed to increase with the wider data format.

AVX512 for small data groups also might be expected to be more sensitive to alignment and performance peaking with special sizes such as some unspecified multiple of AVX512 width.

These considerations would have stronger effect if the GEMM usage is one which benefits from internal dot product organization. This would include but not be limited to those where the netlib reference uses a dot product (although not written in f90). There could be additional cases where MKL needs to perform partial transpose to reach full performance for large matrices, where a quite large matrix would be needed to see the benefit. These questions get into the trade secret aspects of MKL, so the most you could do to understand them is to profile with a tool such as Advisor or VTune in an attempt to analyze execution paths and where time is "wasted" on small matrices.

Gennady_F_Intel · ‎01-23-2019

Tim -- direct call will not help for this problem sizes. as you can see from verbose, the problem sizes are too big

MKL_VERBOSE DGEMM(T,N,1980,56,255,0x7ffed89cf480,0x9f61e00,256,0x9c1b3c0,256,0x7ffed89cf488,0x9c91280,1980) 1.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

Gennady_F_Intel · ‎01-24-2019

Erliing

>>MKL_VERBOSE DGEMM(T,N,1980,56,255,0x7fff40530680,0xaf7ae00,256,0xac343c0,256,0x7fff40530688,0xacaa280,1980) 1.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

I see AVX512 really speeds up this particular matrix multiplication just like dense POTRF. I assume the ms numbers are time to do the operation. Correct?

<< yes, that's correct. here is the link to the mkl user's guide which describing verbose format: https://software.intel.com/en-us/mkl-windows-developer-guide-call-description-line

>> However, my application does a tons of other matrix multiplications, daxpy, etc but GEMM dominates. Most of the GEMM is much smaller than the above one. This leads to an improvement in total time runtime of about 10% for the whole application. Disapointing.

<< You need to check the whole pipeline of your Cholesky implementation and see where the AVX-512 speedup is not sufficient. You may set MKL_VERBOSE=1 and compare the AVX2 and AVX-512 log files for all mkl's routines you use.

--Gennady

erling_andersen · ‎01-25-2019

Direct call did not make a significant change.

A lot of hard work is needed.Got it.

Is there any info. about the optimal alignment, leading dimension, stride when using MKL on AVX 512. Or any advice.

erling_andersen · ‎01-28-2019

If we look at the results below we see AVX-512 works fantastic for a large DGEMM. Just as expected. However, for DGEMV it is much slower. How do I get the best of both worlds?

AVX-512:

MKL_VERBOSE DGEMM(T,N,256,1536,208,0x7ffc4289e370,0x3b99ef0,208,0x3c01ef0,208,0x7ffc4289e378,0x3ef3900,256) 1.68ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMV(T,1,1791,0x7ffc4289e670,0x3e74100,256,0x3e74100,1,0x7ffc4289e678,0x3e74108,256) 12.84us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMV(T,2,1790,0x7ffc4289e670,0x3e74900,256,0x3e74900,1,0x7ffc4289e678,0x3e74910,256) 12.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMV(T,3,1789,0x7ffc4289e670,0x3e75100,256,0x3e75100,1,0x7ffc4289e678,0x3e75118,256) 12.57us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMV(T,4,1788,0x7ffc4289e670,0x3e75900,256,0x3e75900,1,0x7ffc4289e678,0x3e75920,256) 12.49us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

AVX2:

MKL_VERBOSE DGEMM(T,N,256,1536,208,0x7ffd0f77b3f0,0x3df0ef0,208,0x3e58ef0,208,0x7ffd0f77b3f8,0x414a900,256) 3.12ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMV(T,1,1791,0x7ffd0f77b6f0,0x40cb100,256,0x40cb100,1,0x7ffd0f77b6f8,0x40cb108,256) 7.34us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMV(T,2,1790,0x7ffd0f77b6f0,0x40cb900,256,0x40cb900,1,0x7ffd0f77b6f8,0x40cb910,256) 6.63us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMV(T,3,1789,0x7ffd0f77b6f0,0x40cc100,256,0x40cc100,1,0x7ffd0f77b6f8,0x40cc118,256) 6.71us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMV(T,4,1788,0x7ffd0f77b6f0,0x40cc900,256,0x40cc900,1,0x7ffd0f77b6f8,0x40cc920,256) 6.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

Gennady_F_Intel · ‎01-29-2019

regard to gemv - this is an another case. This kind of tasks are memory but not compute bound problems and performance results would depend on HW features.

erling_andersen · ‎01-29-2019

Agreed. But why does it get slowed when AVX-512 is enabled? I guess the same amount of memory is read for AVX2 and AVX-512.

It seems the DGEMVs takes 100% more time when AVX-512 is enabled.

The runs are on the same machine.