Intel MKL (CBLAS) doesn't support more than 8 processors. Is it true ?

yuryserdyuk · ‎03-28-2010

Hi !

I have machine with 2 Intel Xeon CPUX5570 processors. So the number of logical cores is 16.
NowI am trying to perform

[cpp]mkl_set_num_threads ( P );   
  
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, N, N, N, 1.0, A, N, B, N, 0.0, C, N );  [/cpp]

Then for P > 1 and P <= 8 and P odd, program is executed on P - 1 processors.
For P > 8, program is executed always on 8 processors.

How to force program to use more then 8 processors ?

MKL Version used 10.2.4.032.

Thanks.

TimP · ‎03-28-2010

Did you refer to previous discussions about how MKL uses 1 thread per core, unless you over-ride the default, in order to avoid accidental performance reduction?

Gennady_F_Intel · ‎03-29-2010

Yury,

please try to change MKL_DYNAMIC variable:mkl_set_dynamic( FALSE ). See more details into User's Guide. Please pay attention - in this case you may have performancedegradation.

--Gennady

yuryserdyuk · ‎03-29-2010

Yes, you are right - mkl_set_dynamic helps, but the results degradate considerably:

N	cblas_sgemm (8 proc)	cblas_sgemm(16 proc)	cuBLAS(Tesla 1060 GPU)
8192	6,06	7,26	2,71
10240	11,72	13,90	5,26
12288	20,23	24,32	9,07
14336	32,16	38,06	14,37
16384	48,46	58,80	21,42
18432	68,59	82,60	30,46

N is a matrix size, and time is given in seconds.

So, obviously, Intel MKL doesn't scale more than 8 processors on processors with Hyper-Threading ...

The same picture is observed for cblas_dgemm function ...

Gennady_F_Intel · ‎03-29-2010

This is an expectingbehaviorof Intel MKL. We don't recommend use HT enabled with this case.

Please read more about into UserGuide "The use of Hyper-Threading Technology".

--Gennady

TimP · ‎03-29-2010

That section is in the user guide, found in the Documentation/en_us/mkl/ directory of the compiler installation, page 6-16. It can't be found by the search function in Adobe.
In short, as MKL schedules the floating point adder and multiplier to full effectiveness when running 1 thread per core, and the hyper-threads share the paths to higher level cache and memory, the interference effect of additional threads should not be a surprise.