I'm using both TBB (parallel_for and parallel_reduce) and MKL (BLAS and VML) in various stages of computation. On the stages where I parallelize the program manually with TBB, I still need to use MKL (say sdot) but I don't want MKL threads it to compete for CPU cores with TBB threads and other MKL threads doing the same from other TBB threads so I'm fine with just sequnetial MKL mode here. During other single-threaded stages, I can give all CPU cores to MKL while executing the same operation (sdot).
Is it possible to do this in real time, or the sequential/parallel choice is done solely during the link time? Also, all computations stages are very short (less than 1 ms) so I'd like to avoid using any slow methods to reconfigure the library. What is the best way to address this?
I'm using MKL 10.2, if it matters; please let me know if MKL 10.3 has anything to help with this particular problem. I can also consider switching to OpenMP from TBB (at least, I'll have the same OMP thread pool for both MKL and manual parallelization, unlike MKL+TBB solution where thread pools are separate).
in real time you can call mkl_set_num_threads( nthr ) during other single-threaded stages of yours applications.
In that case you have to link with the threaded libraries,
but as you said, all your computations stages are very short, it might will not help because of the task size is very small to take the advantage of using the threaded version. But you can try to check it yourself.
If I understood correctly, your suggestion is to call mkl_set_num_threads(1) before diving into TBB-parallelized code, and then set it back with mkl_set_num_threads(N) when I'm back in single-threaded code and ready to crunch matrices using MKL parallelism.
May I ask for more information?
What is actually going under the hood when I call mkl_set_num_threads to 1 and back? Is it just a one integer maipulation (would be perfect) or it actually creates/kills threads? When I return the num_threads back, will it affect the latest CPU affinities of the MKL threads? Is performance of the threaded MKL with num_threads==1 same as of the sequential MKL? When num_threads==1 with the threaded MKL, will everything be executed in the caller threads (I hope so) or there will be one separate MKL thread from the MKL thread pool that will execute all requests from all threads sequentially (unlikely both worth checking)?