Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

BLAS Level 2 uses more than one core.

yuriisig
Beginner
318 Views
I have noticed that on my processor i7 860 BLAS Level 2 uses more than one core. What sense? Better on 1core to realise good algorithm, instead of to downgrade efficiency of the processor
0 Kudos
5 Replies
TimP
Honored Contributor III
318 Views
Quoting - yuriisig
I have noticed that on my processor i7 860 BLAS Level 2 uses more than one core. What sense? Better on 1core to realise good algorithm, instead of to downgrade efficiency of the processor
MKL didn't have level 2 threading available until recently, but it was requested frequently. It would take a large vector size to make threading pay off. If your case is using more than optimum threads, you have several options, including mkl_sequential, setting number of threads by environment variable or OpenMP call, or compiling from source.
0 Kudos
yuriisig
Beginner
318 Views
Quoting - tim18
...but it was requested frequently...

Why? I think that it is related to an inefficiency of a code of Intel MKL. In my threediagonalisation of the packed matrixes some core for BLAS Level 2 are not required. I DSPTRD on one core for matrixes 5000*5000 gives 21.1 s., and Inel MKL DSPTRD - 28.7 c. and Inel MKL DSYTRD - 26.4 c (i7 860).
0 Kudos
TimP
Honored Contributor III
318 Views
Quoting - yuriisig

Why? I think that it is related to an inefficiency of a code of Intel MKL. In my threediagonalisation of the packed matrixes some core for BLAS Level 2 are not required. I DSPTRD on one core for matrixes 5000*5000 gives 21.1 s., and Inel MKL DSPTRD - 28.7 c. and Inel MKL DSYTRD - 26.4 c (i7 860).
MKL has to include all the functionality of the standard BLAS versions of those functions. You should easily be able to improve on performance of most Level 2 BLAS, particulary those like these which call level 1 BLAS, by writing code for your own usage. I'm not so familiar with these particular functions; assuming that dspr2 or dspmv or the like may be important, they would require OpenMP schedule(guided) if threading were applied to the public source. So one would think there could be a gain from threading on Core i7, not as large as for those suitable for default schedule, for problems in a certain size range, if it is not so large that cache misses dominate over influence of threading.
0 Kudos
yuriisig
Beginner
318 Views
Quoting - tim18
MKL has to include all the functionality of the standard BLAS versions of those functions. You should easily be able to improve on performance of most Level 2 BLAS, particulary those like these which call level 1 BLAS, by writing code for your own usage. I'm not so familiar with these particular functions; assuming that dspr2 or dspmv or the like may be important, they would require OpenMP schedule(guided) if threading were applied to the public source. So one would think there could be a gain from threading on Core i7, not as large as for those suitable for default schedule, for problems in a certain size range, if it is not so large that cache misses dominate over influence of threading.

I used IDA Pro for scanning of functions Inel MKL. I have other algorithms. It is possible to look my old operation: http://www.thesa-store.com/products/
0 Kudos
Ying_H_Intel
Employee
318 Views
Quoting - yuriisig

I used IDA Pro for scanning of functions Inel MKL. I have other algorithms. It is possible to look my old operation: http://www.thesa-store.com/products/

Hello,

Justadd some comments,
Some BLAS level 1 and Level 2 function are threaded since MKL 10.2, please see
http://software.intel.com/en-us/articles/threaded-blas-level-1-and-2-on-nehalem/
or
http://software.intel.com/en-us/articles/intel-mkl-threaded-functions/

But the performance mainly depends on the data location in cache and other factors, for example,
in http://software.intel.com/en-us/articles/performance-slow-down-when-dynamically-linking-with-intel-mkl/
when
1) the data set is small in the application.
2) The second run may have better performance than the first run.
3) The problem happen whendynamic linking with Intel MKL

You may check them. If it is not related to all of above, may you provide a test case(include theinput data)? itwould be helpful.

Regards,
Ying
0 Kudos
Reply