I have a few newbie questions about the best practices for using MKL in a managed environment.
- It's said that by default, MKL will try to pick the best number of threads. Does it mean that this is done everytime a BLAS function is invoked, because they are stateless?
- If not, could you please explaion what actually happened?
- If yes, can I request MKL to estimate the best number of threads once, and use that number for all subsequent function calls?
- Are there any tricks for minimizing the overhead of calling MKL from a managed environment?
To the best of my knowledge, you could limit the number of threads by setting OMP_NUM_THREADS or MKL_NUM_THREADS. If you always want 1 thread per MKL call, linking mkl_sequntial could be more efficient. MKL will not choose more threads than physical cores, when HyperThreading is recognized, unless you over-ride MKL_DYNAMIC. MKL can choose a smaller number of threads on each call according to the size of the problem. It will not remember a choice from a previous call, and I doubt you could capture its choice of number of threads so as to reduce the limit you have set.
I'm not sure what you mean about reducing the overhead from a managed environment, As always, you could tinker with KMP_BLOCKTIME in case you want to keep the thread pool active for more or less time than the default 200 milliseconds, which may help you take advantage of your KMP affinity settings.
.>>>Are there any tricks for minimizing the overhead of calling MKL from a managed environment?>>>
I suppose that net framework JIT compiler will be able to optimize managed code hot spots.For example where c# wrapper function will call into MKL library function then MSIL call instruction will be translated into native processor assembly call instruction and stored in some kind of cache so the next time JIT compiler will not perform the translation.
Regarding the number of threads:
I do want to let MKL decide the best number of threads. However, I don't want MKL to do it in every function call. So the question is: if the cost of determining the best number of threads is not innelegible, can I set up so that MKL only needs to determine that best number only once?
MKL_DYNAMIC being TRUE means that Intel MKL will always try to pick what it considers the best number of threads, up to the maximum specified by the user. MKL_DYNAMIC being FALSE means that Intel MKL will not deviate from the number of threads the user requested, unless there are reasons why it has no choice. The value of MKL_DYNAMIC is by default set to TRUE, regardless of OMP_DYNAMIC, whose default value may be FALSE.
I didn't get your message. Or maybe I didn't make myself clear enough. So let me explain my question again.
In my C# code, I have
public static extern void cblas_dgemv(params);
The main program calls cblas_dgemv about 1 million times. I don't know if MKL tries to pick the best number of threads 1 million times or not. It seems like it because the runtime report for each iteration varies quite a bit.
In that case, I just want MKL picking the optimal number of threads only once. Use that number of threads in every dgemv call.
Sergey Kostrov wrote:
>>...I don't want MKL to do it in every function call...
Take into account that MKL could be used in 3 ways:
- sequential, or
- parallel, or
So, it is Not possible to mix, for example, sequential with parallel, and so on. Another thing is that Not all MKL functions are threaded.
There is also more advanced method to map/verify if OpenMP threads are mapped directly to raw threads (Windows threads) this involves api monitor(s) and/or usage of logexts.dll tracking windbg extension.You will simply observe the call count to CreateThread function issued from MKL library.But for many users it is an overkill, so as Sergey suggested you can use Task Manager to verify the number of created threads.