Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
4995 Discussions

Concurrency Problem with Intel MKL BLAS

nunoxic
Beginner
613 Views
I am implementing a code (Iterative Matrix Solver : Language is Fortran) which basically works like this :
while (error < tolerance OR iterations < max iterations)
{
Few DGEMV
Few DDOT
Few DNRM
}
The problem is as follows :
1. The time required for solving increases erratically with the increase in number of threads.
I put it in a loop which works like (Not actual code)
export MKL_DYNAMIC=FALSE
for i=1 to 8
export MKL_NUM_THREADS=i
end
The CPU usage is in accordance with the number of threads at all times (For instance, 200% for 2 threads, 800% for 8 threads etc.) but the lowest time (and hence best performance) is observed at 2 Threads which is bothering me.
2. I am linking my program as :
ifort -fp-model source ParaQMR.f90 -xSSE3 -lmkl_blas95 -lmkl_intel -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -m32 && time ./a.out
(I can attach the source code but it will be of no real use since the code reads data from a 2 GB text file which obviously I can't upload)
My hardware is Intel Xeon with 8 cores 8 threads. (32 bit Ubuntu 10.04 < I know 32 bit is a mistake but I doubt it will make a difference)
3. I tried VTune and although I am not a computer major or something, I tried my level best at interpreting the results and what I feel from the graph below is that most of the the time is spent creating and destroying threads each time a new iteration starts. But I am not so sure. Is there any way to circumvent this problem ?
I tried everything but nothing seems to work
I have tried :
1. Using my own Matrix-Vector Multiplication code. It suffers immensely when compared to MKL BLAS.
2. Intel Inspector. Didn't seem to make head or tail of what was happening.
3. Experiment with different flags, compiler (gfortran), gfortran and -fopenmp tag instead etc. Didn't work.
(By didn't work I mean, didn't produce the decrease in time as I expected)
Please help me out and let me know if I should upload more VTune Graphs.
0 Kudos
8 Replies
TimP
Honored Contributor III
613 Views
Following comments might be more topical and get more expert suggestions on the MKL forum:
By resetting MKL_DYNAMIC, I believe you are disabling MKL's own effort to place threads efficiently, so you should be setting KMP_AFFINITY or MKL equivalent directly for each number of threads, according to your platform. Probably 1 thread per CPU, if a dual CPU, for 2 threads, 1 thread per cache, if a split cache CPU; never 2 threads per core when other cores are idle, ....
As you are using dgemv, you might expect performance to drop as soon as you run 2 threads each on 1 or more cores; are you trying to quantify that?
If your cache footprint is very large, it is possible to see your performance peak as soon as you have enough threads to use all of the cache. In such a case, VTune cache events could clarify it.
0 Kudos
nunoxic
Beginner
613 Views
I used KMP_AFFINITY and set it as :
export MKL_DYNAMIC=FALSE
export MKL_NUM_THREADS=8
export KMP_AFFINITY="verbose,granularity=fine,proclist=[0,1,2,3,4,5,6,7],explicit"
ifort blah blah
It didn't help. As far as I know, cache was being managed optimally and a second thread was not started on any core if another core was sitting idle.
I still belive that what is happening is
it = 0
create threads > carry out dgemv > destroy threads
create threads > carry out ddot > destroy threads
create threads > carry out dnrm > destroy threads
it = 1
create threads > carry out dgemv > destroy threads
.
.
.
Can this be the problem ? If yes, Can this be avoided ?
0 Kudos
Chao_Y_Intel
Moderator
613 Views

Hi,

Have you tried export MKL_DYNAMIC=TRUE

It will suggest MKl to choose the good the threading number for the problem. As Tim noted, for the DGEMV, DDOT function, increasing the threading number may not improve the performance. If MKL_DYNAMIC is FALSE, it will force MKL to the threading you set.

Thanks,
Chao

0 Kudos
barragan_villanueva_
Valued Contributor I
613 Views
Hi,

Overheadon joining OMP-threads can be significant if your data-volume in MKL functions is notbig enough.
IsSMT (hyper-threading) on?
So that 8 CPUs means: 4-cores with hyper-threading. Please clarify.
0 Kudos
nunoxic
Beginner
613 Views
I didn't get your point on joining OMP threads. I have a 10000x10000 matrix being multiplied with a 10000x1 vector.
I don't think SMT is on (I googled and it gave me results related to BIOS settings which I swore never to fiddle with), the only settings I tweaked with are the ones mentioned (MKL_NUM_THREADS and MKL_DYNAMIC and more recently KMP_AFFINITY)
Also,
MKL_DYNAMIC=TRUE uses 2 threads most of the times which is irritating since my goal is to create a code which uses max. system resources.
0 Kudos
barragan_villanueva_
Valued Contributor I
613 Views
To see SMT setting from libiomp5 please add `verbose' to your KMP_AFFINITY setting:
KMP_AFFINITY=verbose,$KMP_AFFINITY
If logical thread is bound tonot oneCPUs (likely two) then SMT is ON.

Also please look at related MKL articles/discussions:
http://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-intel-mkl-100-threading/
http://software.intel.com/en-us/articles/setting-thread-affinity-on-smt-or-ht-enabled-systems/
http://software.intel.com/en-us/forums/showthread.php?t=68141

BTW, If Hyper-Threading technology is enabled on the systems, it is recommended that the threading numbers be set equal to the number of real processors or cores. That is only half number of the logical processors.



0 Kudos
nunoxic
Beginner
613 Views
Just one more question :
Will changing my code to C benefit the speed in any way ?
0 Kudos
TimP
Honored Contributor III
613 Views
Quoting nunoxic

Will changing my code to C benefit the speed in any way ?

Unlikely from what you have said so far.

0 Kudos
Reply