Showing results for 
Search instead for 
Did you mean: 

efficent methood for hybrid OpenMP and MPI

Hi, I have a question about an efficient method for extending well implemented scientific code written using C++ and OpenMP, to MPI layer.

This code is architecture-aware implementation (ccNuma, affinity, caches, etc..) that can utilize different aspects of architectures, and especially used all threads.

The main goal is to implement MPI layer without performance losses on the exits shared memory code, and do it efficiently.

So, I have to overlap MPI communications with OpenMP computations. My application allows for achieving this goal since I perform a loop-blocking technique.

Shortly speaking: When the results from the one block can be send to another MPI rank, the OpenMP threads can perform computations – such schema is repeated several time, and after it the synchronization point is necessary.  Then, such  a structure is run thousand times.  

The main requirement/limitations of MPI communication will be a lot of small portions of data for exchanging (a lot of data bars of size 1.5 KB or 3 KB from 3D arrays)

This code will be run on rather novel hardware and software  :

  1. Intel CPU cluster
  2. Intel MIC cluster: MPI communication between KNC (and KNL similar to 1.)
  3. Hybrid: MPI communication between CPUs and MICs

The general question how to do it in an efferent way:  I do not ask about implementation details but which MPI scenarios can guaranties the best performance.

In details:

  1. Does the MPI communication cause any cores overheads – I men when I run both MPI communications and OMP computations at the same time but on different memory region
  2.  Should I allocated MPI communication for a separate (dedicated for this task) core, when other cores will perform OMP computations, which scenarios will be more efficiently:
    1. OMP master or a single  threads blinded to a single physical core run communication only, other OMP threads use others cores for computation
      - which communication will be better here synchronous or asynchronous ??
    2. a selected group of OMP threads for MPI communication and computations while others OMP threads for computations only
    3. or other solutions ??

In fact, the 2.b is most suitable for my application, but the programmer is responsible to guaranties the right MPI communication paths between MPI ranks and OMP threads.  

If any can help me or share with me his advance experience I will be very happy.






0 Kudos
1 Reply

Dedicating a thread for MPI communications is a good idea.  This ensures that resources can be available for communications.  You say you aren't asking about implementation details, but this is dependent on the implementation details.  For example, should a separate thread be spawned for communication by the MPI implementation?  This isn't specified in the standard, and is therefore up to the implementation developers.  Some applications will benefit from spawning that thread, others would be better served leaving those resources available for computation.

I'd recommend leaving all of your communications in one thread.  The easiest way to do this is to put all of your MPI communications in the master thread, but depending on your application, you might do better to have it as an OMP SINGLE call (if the master thread is delayed elsewhere, you could in theory start/complete the communication before the master thread reaches that point).

0 Kudos