Hi, I have a question about an efficient method for extending well implemented scientific code written using C++ and OpenMP, to MPI layer.
This code is architecture-aware implementation (ccNuma, affinity, caches, etc..) that can utilize different aspects of architectures, and especially used all threads.
The main goal is to implement MPI layer without performance losses on the exits shared memory code, and do it efficiently.
So, I have to overlap MPI communications with OpenMP computations. My application allows for achieving this goal since I perform a loop-blocking technique.
Shortly speaking: When the results from the one block can be send to another MPI rank, the OpenMP threads can perform computations – such schema is repeated several time, and after it the synchronization point is necessary. Then, such a structure is run thousand times.
The main requirement/limitations of MPI communication will be a lot of small portions of data for exchanging (a lot of data bars of size 1.5 KB or 3 KB from 3D arrays)
This code will be run on rather novel hardware and software :
The general question how to do it in an efferent way: I do not ask about implementation details but which MPI scenarios can guaranties the best performance.
In fact, the 2.b is most suitable for my application, but the programmer is responsible to guaranties the right MPI communication paths between MPI ranks and OMP threads.
If any can help me or share with me his advance experience I will be very happy.
Dedicating a thread for MPI communications is a good idea. This ensures that resources can be available for communications. You say you aren't asking about implementation details, but this is dependent on the implementation details. For example, should a separate thread be spawned for communication by the MPI implementation? This isn't specified in the standard, and is therefore up to the implementation developers. Some applications will benefit from spawning that thread, others would be better served leaving those resources available for computation.
I'd recommend leaving all of your communications in one thread. The easiest way to do this is to put all of your MPI communications in the master thread, but depending on your application, you might do better to have it as an OMP SINGLE call (if the master thread is delayed elsewhere, you could in theory start/complete the communication before the master thread reaches that point).