I have been using the KMP_AFFINITY envvar to display and set the affinity settings for the OMP threads in MPI code whose tasks use several OMP threads each.
I have noticed that when Intel MPI is used (and the KMP_AFFINITY setting requires pinning of OMP threads), the OMP library "knows" to pin OMP threads belogning to different mpi tasks onto disjoint sets of cores. However, when I try to do this on the same code compiled against a non-Intel MPI stack, the OMP runtime pins the OMP threads to the same cores for all MPI tasks running on the same node.
Is there any way to instruct the OMP runtime to pin OMP threads in a more reasonable way for the non-Intel MPI case? How could I replicate the behavior of OMP runime when it works under Intel MPI tasks vs non Intel ones?
For example, assume a 2 socket SMP node with 4 or 6 cores / socket, how would I ask the OMP run-time to bind the OMP threads which task k uses only to the sockets (or cores) the associated tasks are supposed to run only?
Another Q: the KMP_AFFINITY also directly affects MKL's behavior, correct?
Satisfactory support for multiple THREAD_FUNNELED ranks per node is a value added feature specific to each MPI. I think it is still under development for OpenMPI; with other open source MPIs you may have no way to do it except to make a script which gives each rank an appropriate set_affinity() or taskset command (or use 1 rank per node). Your MPI is out of date; it may support affinity automatically only for Intel CPU models which were released prior to the MPI release date. Many applications which use MKL are using mkl_sequential, which leaves the affinity entirely under the control of the application in the normal way. If you are using threaded MKL in THREAD_FUNNELED mode, the MKL should see the KMP_AFFINITY passed to the rank where it starts. With default setting of MKL_DYNAMIC, MKL will attempt to limit itself to 1 thread per core. MKL_NUM_THREADS and OMP_NESTED may also come into play. So I'm not certain you could say that KMP_AFFINITY has "direct effect." You could ask questions about MKL under Intel MPI on the HPC forum.
I have seen several postings here relating to users wish a multi-threaded application that they also want to use MKL. The general problem the run into is dual thread pools.
One option for the user is to configure MKL to serial, this works well when the users multi-threaded code can make concurrent calls to MKL .AND. where the work done/sec exceeds that of running a serial app with MKL in parallel.
A second option is to run both in parallel and set the KMP_BLOCKTIME = 0 (or some small number)
The performance of MKL in parallel on large matrix sizes is superb. So if you do a substantial amount of work with large matrices then you would rather not run KMP serially... but what about the user application that can make use of some level of parallelization?
Note, I am not an MKL user/guru so I do not know all the tuning options.
It would seem appropriate to me to provide for an MKL_AFFINITY as well as the KMP_AFFINITY, and perhaps an initialization tuning function where the user can use general terms as to how to partition the cores between the application OpenMP threads and the MKL (OpenMP) threads. These could be either init-once or dynamic (TBD).
On a 4P system, the user might elect to ask for 1P for his application's OpenMP threads and 3Ps for MKL, (or other combinations such as:
n-threads for app, remainder for MKL n-reserved application threads, m-total application threads, (MKL uses total - n)
Something like those options.
This functionality would provide for cooperative multi-threading.
when I use Intel MPI with KMP_AFFINITY set to "compact" and "scatter" MKL binds the threads to separate cores per MPI rank, as expected. When I use the same code compiled on another MPI stack (OpenMPI or MVAPICH2) the MKL threads are bound to the same cores for every rank on a node.
I guess MKL lib has a global knowledge of MKL threads on each MPI rank and it can use different sets of cores for the threads of each rank, something it does not do with other MPI stacks.
One way around this core overcommitment with non-Intel MPI is to try to bind a rank on a socket, or entire node and then use "respect" or "none" with KMP_AFFINITY.
Several users in our environmnet had very negtive speedups with non-Intel MPI + NKL exactly because several MKL threads were bound to the same core...
I am trying to upgrade our Intel MPI lib to the latest.