Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2234 Discussions

MPI_WIN_ALLOCATE_SHARED direct/RMA access

jimdempseyatthecove
Honored Contributor III
2,868 Views

From the MPI 3.1 specification:

This is a collective call executed by all processes in the group of comm. On each
process, it allocates memory of at least size bytes that is shared among all processes in
comm, and returns a pointer to the locally allocated segment in baseptr that can be used
for load/store accesses on the calling process. The locally allocated memory can be the
target of load/store accesses by remote processes; the base pointers for other processes
can be queried using the function MPI_WIN_SHARED_QUERY. The call also returns a
window object that can be used by all processes in comm to perform RMA operations.
The size argument may be di erent at each process and size = 0 is valid. It is the user's
responsibility to ensure that the communicator comm represents a group of processes that
can create a shared memory segment that can be accessed by all processes in the group.

On a single SMP host with multiple ranks it is clear that you can use this to construct a window to a multi-process shared memory buffer that can be accessed (with care) either with direct load/store instructions or by way of RMA operations. Note, each rank/process may have a different virtual address base for the baseptr.

From the MPI 3.1 specification it is stated (implied) that the group of comm must have the capability to access the same physical memory (which may be mapped at different virtual addresses in different processes).

Now as a simplification of my query, consider the situation of say 8 processes running on 2 hosts, 4 processes per host (and the hosts do not have sharable memory between them).

Can all 8 processes issue MPI_WIN_ALLOCATE_SHARED using MPI_COMM_WORLD returning 8 win objects ( 4 per host) with:

4 processes on host 0 having shared memory (and direct access by those processes)
4 processes on host 1 having shared memory (different from host 0, and direct access by those processes)
All 8 processes having RMA access to all processes win window.

What I wish to do is to improve the performance with intra-host access without excluding inter-host access (and not having each process using 2 windows to do this).

Note, I am not currently setup to make this test.

Jim Dempsey

0 Kudos
8 Replies
AThar2
Beginner
2,868 Views

Hi Jim, 

I would not use MPI_COMM_WORLD  to allocate the shared memory. I don't think you can do this since as you pointed out the ranks associated to that WORLD are not necessarily  on the same SMP node. 

What you need to do is to create a communication world for each SMP or in fact any group of ranks that do share memory. 

One way is to do the following call: 

 

MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, key, info, MPI_SHM_COMM,mpierr);

 

 

Now, if your MPI_COMM_WORLD has 8 processes with 0-4 at host 0 and 5-8 at host 2. The new communicator(s) MPI_SHM_COMM are here two. One corresponds to the ranks 0-4 and another communicator corresponds to 5-8. So you can now use that communicator to allocate shared memory, since MPI_SHM_COMM now only contains ranks which are on the same host/SMP etc.

 

best

Ali

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,869 Views

Ok, but note now each host has two comms, one for all ranks and two different collections, one for each shared node.

However, this does not answer the nuances of the question I asked.

Using split, the ranks on each host can use the respective comm-shared to obtain a shareable direct/RMA window within their respective hosts...

... (nuance comming) IIF the MPI_WIN_ALLOCATE_SHARED occurs in the same MPI_WIN... create sequence amongst all ranks, can an RMA occur to the same sequenced node on the different host?

MPI_Win_allocate(...,MPI_COMM_WORLD,..,win_world) // RMA all ranks to all ranks
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, key, info, comm_shared);
MPI_Win_allocate_shared(...,comm_shared,..,win_host) // direct/RMA within host
**** can inter-host rank access a different host's win_host via RMA (MPI_Get/MPI_Put) ****

If such access is not permitted, then each rank would require and maintain two buffers. And would complicate programming .OR. require use of RMA only.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,869 Views

How about flipping this around

First have each rank perform

MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, key, info, comm_shared);
MPI_Win_allocate_shared(...,comm_shared,..,win_host) // direct/RMA within host

Then have each rank perform

MPI_Win_create(b_buff_from_Win_allocate_shared,..., win_world)

IOW supply the address of the intra-rank buffer to the inter-rank window

Jim Dempsey

0 Kudos
AThar2
Beginner
2,869 Views

Hi Jim, 

Okay, So you trying to get both possibilities at the same time. You would like to create a shared buffer among inter node ranks while you want to have the intra-node ranks being able to mpi_get/pull to that shared buffer? 

Actually, I don't know myself for sure if that is possible. I would like to know myself, for now I am handling all intra-node communication by MPI_ISEND/IRECV.  
Even if that possible you would still need a mechanism to check if you're trying to access data not allocated on the same host. Because when you access data from rank within the same host, you don't need mpi_get/pull, as you would directly access the data by deferencing the pointer

 

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,869 Views

>>So you trying to get both possibilities at the same time.

Precisely

>>I am handling all intra-node communication by MPI_ISEND/IRECV.

These incur unnecessary memory-to-memory copies, at least 2 or 3 times. While Intel Hydra should be able to eliminate I/O via a switch, it will still incur the latencies and induce unnecessary memory channel overhead for the intra-node communication.

In my case, I am using one-sided communication by MPI_Get/MPI_Put for inter-node xfers and direct updates via CPU instructions for intra-node "xfers". Of course I must be careful to maintain cache coherency but this is doable.

I am hoping that what I posted in #4 works. With MPI_Win_create, the programmer supplies the base address of the buffer... which presumably can be the address of the buffer using MPI_Win_create_shared (using different subset comm of MPI_COMM_WORLD). This though requires 2 win handles, I would prefer to use 1 as this aid in eliminating a "whose on first" situation. I will experiment using 2 win handles and see how it goes.

Jim Dempsey

0 Kudos
AThar2
Beginner
2,869 Views

Hi Jim

no I got my use of words completely wrong.

I am using shared memory MPI for intra node communication using Mpi_win_create_shared and handling inter node with mpi_irecv and mpi_isend. 
i was myself considering to replace mpi_irecv and mpi_isend with moi_get/pull. 
 

while I understand what you want to achieve, I just wonder that even in the case you example would work you will need a mechanism to check if you are “accessing/retrieving” data from a rank within the same host or outside. 
since when I want to get data from a rank from the same host, I would not use any mpi commands since i just need to access the content of the pointer.
While if you accessed data from another host you then would have to use mpi_get/pull. So I am wondering why not just create separate windows for inter node? 
 

also, how do you do CPU instructions for intra node communication. Sounds interesting do you have a reference 

 

best

Ali

0 Kudos
Kevin_O_Intel1
Employee
2,832 Views

Hi,

From this thread,  Looks like your questions were answered. If so I will close the issue.

Regards

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,869 Views

>>how do you do CPU instructions for intra node communication

sharedVariable = value

or their atomic alternatives if necessary

In the specific application that I am writing the functionality is that of an aggregator. The intra-rank nodes produce an aggregate, and the inter-node accumulator using MPI_Get to perform the aggregation.

Jim Dempsey

0 Kudos
Reply