Solved: Mark, Thanks a lot for your

ArthurRatz · ‎01-27-2016

Dear Collegues,

Recently, I've developed an MPI program that performs sorting of an array of N=10^6 data items. All data items have a type of __int64. The entire sorting is workshared between multiple > 10 processes. To share data between processes I've created an MPI window using MPI_Win_allocate_shared function. While performing an actual sorting of the array of N=10^6 data items running it using 10 or more processes the program hangs (e.g. the sorting process is endless). The program performs the correct sorting only by running it using no more than 2 processes.

Can you help me to figure out why this program cannot be executed using more than 2 processors (see attachment) ?

I've compiled and run the program as follows:

mpiicpc -o sortmpi_shared.exe sortmpi_shared.cpp

mpiexec -np 10 sortmpi_shared.exe

Thanks a lot. Waiting for your reply.

Cheers, Arthur.

Mark_L_Intel · ‎01-29-2016

Hello Arthur,

Can you provide a source? I found only executable and cfg files in your zip file.

Thanks,

Mark

View solution in original post

Mark_L_Intel · ‎01-29-2016

Hello Arthur,

Can you provide a source? I found only executable and cfg files in your zip file.

Thanks,

Mark

ArthurRatz · ‎01-30-2016

This problem is already solved. Thank you.

Really, I've got another question: under unknown reason my MPI program doesn't work (hangs) when you launch processes on different nodes (hosts). In my program I use MPI_Win_allocate_shared function to allocate shared memory using RMA window. And I'm wondering what is the possible cause why my program doesn't work. Do I actually need to implement intercommunicators for that purpose?

I'm sorry but I can't provide any sources yet.

Waiting for your reply.

Cheers, Arthur.

Mark_L_Intel · ‎02-01-2016

You do not need to implement intercommunicators. This paper

http://goparallel.sourceforge.net/wp-content/uploads/2015/06/PUM21-2-An_Introduction_to_MPI-3.pdf

contains links to the downloadable sources illustrating MPI-3 Shared Memory programming model in the multi-node setting, e.g.:

http://tinyurl.com/MPI-SHM-example

Could you try to run this first example from the paper on your cluster (and provide results)?

Here is another quote from the paper that might help: "The function MPI_Comm_split_type enables programmers to determine the maximum groups of MPI ranks that allow such memory sharing. This function has a powerful capability to create “islands” of processes on each node that belong to the output communicator shmcomm". Do you use this function?

You'd also need to distinguish between ranks on the node versus ranks belonging to different nodes. As you can see, we used MPI_Group_translate_ranks for this purpose.

Cheers,

Mark

ArthurRatz · ‎02-01-2016

Hello, Mark.

I've already tested the following example you have provided on my cluster. Here's results:

E:\>mpiexec -n 4 -ppn 2 -hosts 2 192.168.0.100 1 192.168.0.150 1 1.exe
Fatal error in MPI_Win_lock_all: Invalid MPI_Win, error stack:
MPI_Win_lock_all(158): MPI_Win_lock_all(MPI_MODE_NOCHECK, win=0x0) failed
MPI_Win_lock_all(103): Invalid MPI_Win
Fatal error in MPI_Win_lock_all: Invalid MPI_Win, error stack:
MPI_Win_lock_all(158): MPI_Win_lock_all(MPI_MODE_NOCHECK, win=0x0) failed
MPI_Win_lock_all(103): Invalid MPI_Win
Fatal error in MPI_Win_lock_all: Invalid MPI_Win, error stack:
MPI_Win_lock_all(158): MPI_Win_lock_all(MPI_MODE_NOCHECK, win=0x5f) failed
MPI_Win_lock_all(103): Invalid MPI_Win
Fatal error in MPI_Win_lock_all: Invalid MPI_Win, error stack:
MPI_Win_lock_all(158): MPI_Win_lock_all(MPI_MODE_NOCHECK, win=0x98) failed
MPI_Win_lock_all(103): Invalid MPI_Win

E:\>mpiexec -n 4 1.exe
i'm rank 2 with 2 intranode partners, 1 (1), 3 (3)
load MPI/SHM values from neighbour: rank 1, numtasks 4 on COMP-PC.MYHOME.NET
load MPI/SHM values from neighbour: rank 3, numtasks 4 on COMP-PC.MYHOME.NET
i'm rank 3 with 2 intranode partners, 2 (2), 0 (0)
load MPI/SHM values from neighbour: rank 2, numtasks 4 on COMP-PC.MYHOME.NET
load MPI/SHM values from neighbour: rank 0, numtasks 4 on COMP-PC.MYHOME.NET
i'm rank 1 with 2 intranode partners, 0 (0), 2 (2)
load MPI/SHM values from neighbour: rank 0, numtasks 4 on COMP-PC.MYHOME.NET
load MPI/SHM values from neighbour: rank 2, numtasks 4 on COMP-PC.MYHOME.NET
i'm rank 0 with 2 intranode partners, 3 (3), 1 (1)
load MPI/SHM values from neighbour: rank 3, numtasks 4 on COMP-PC.MYHOME.NET
load MPI/SHM values from neighbour: rank 1, numtasks 4 on COMP-PC.MYHOME.NET

*BUT*, I actually can't figure out how the following sample can be used to solve the problem I stated ?

Solving this problem my goal is not to use MPI_Send/MPI_Recv between processes of different nodes.

Normally I need to use MPI_Comm_split_type, MPI_Win_allocate_shared, MPI_Win_shared_query functions.

In your recent post, you've told me that MPI_Comm_split_type has a powerful capability to create process islands on

different nodes (hosts). Can you tell me or provide a sample how to do it ?

Thanks in advance. Waiting for your reply.

Cheers, Arthur.

Mark_L_Intel · ‎02-02-2016

I'd need to reproduce this error.

Quick comments regarding your questions.

MPI-3 SHM should not be confused with PGAS (with its global address space) or one-sided/RMA even it relies on MPI-3 RMA framework. MPI-3 SHM programming model enables MPI ranks within a shared memory domain (typically processes on the same node) to allocate shared memory for direct load/store access. In this sense, it is exactly like hybrid MPI +OpenMP (or threads) model. So, when you said that you do not want to use MPI_Send/MPI_Recv between the nodes - what mechanism/functions then do you want to use instead?

The sample and paper (I referenced in my previous post)already contain all API functions you mentioned including recommended use model for MPI_Comm_split_type. Figure 2 in the paper hopefully might be helpful too. Said that, please do not hesitate to ask additional questions.

Mark

ArthurRatz · ‎02-02-2016

Mark, Thanks a lot for your answer. I'd so much appreciate it.

ArthurRatz · ‎02-02-2016

And one more question: is it possible to implement global address space shared between multiple nodes (hosts) using MPI and not using PGAS ? Can you point me at what particular framework like MPI-3 RMA can be used for that purpose ? Thanks in advance.

ArthurRatz · ‎02-02-2016

And the last question: how can PGAS be used along with MPI library ? Can you post an example if it's possible ?

ArthurRatz · ‎02-02-2016

And one more thing, recently I've tried to allocate shared memory on multiple nodes using RMA window.using MPI_Win_Create, MPI_Get, MPI_Put functions and it worked for me similarly as if I've used MPI_Send, MPI_Recv function. Can you explain me why it doesn't work only when you use MPI_Win_allocate_shared, MPI_Comm_split_type functions ??

Mark_L_Intel · ‎02-04-2016

Yes, PGAS can be implemented using MPI-3 RMA, e.g., please see (and references therein)

DART: http://arxiv.org/pdf/1507.01773.pdf

OpenSHMEM: http://www.csm.ornl.gov/workshops/openshmem2013/documents/ImplementingOpenSHMEM%20UsingMPI-3.pdf

http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/hammond.pdf

These two preprints from ANL are also excellent:

http://www.mcs.anl.gov/uploads/cels/papers/P4014-0113.pdf

http://www.mcs.anl.gov/papers/P4062-0413_1.pdf

Yes, PGAS can be used along with MPI, e.g., MVAPICH team at OSU, supports such MPI/PGAS hybrid model through its MVAPICH2-X offering :

http://mvapich.cse.ohio-state.edu/

This is a good presentation from this group on the subject:

http://mvapich.cse.ohio-state.edu/static/media/talks/slide/osc_theater-PGAS.pdf

on your last question:

"I've tried to allocate shared memory on multiple nodes using RMA window.using MPI_Win_Create, MPI_Get, MPI_Put functions and it worked for me similarly as if I've used MPI_Send, MPI_Recv function. Can you explain me why it doesn't work only when you use MPI_Win_allocate_shared, MPI_Comm_split_type functions ??"

As I said above, MPI-3 SHM model (using MPI_Win_allocate_shared, MPI_Comm_split_type, etc.) is closer to the hybrid MPI + Open MP model rather than to RMA even it relies on RMA. If you look under the hood, MPI-3 SHM provides direct load/store memory access exactly like in the case of threads (btw with all its well-known pitfalls such as data races, etc.).

Citing http://www.mcs.anl.gov/~thakur/papers/shmem-win.pdf,

while in

"one-sided communication interface, the user allocates memory and then exposes it in a window. This model of window creation is not compatible with the inter-process shared-memory support provided by most operating systems",

in MPI-3 SHM, through the mechanism described in this last paper, we end up with the truly shared memory environment, so for example,

"Load/store operations do not pass through the MPI library; and, as a result, MPI is unaware of which locations were accessed and whether data was updated"

Best,

Mark

ArthurRatz · ‎02-04-2016

Thanks for reference links, Mark. I'm going to read this documentation.

ArthurRatz · ‎02-09-2016

Can you give me an example of using openshmem along with MPI library?

Can you help me to figure out why this program cannot be executed using more than 2 processors ?