I have a master/slave type MPI program, and I'd like the master to dynamically spawn the slave processes. I tried MPI_Comm_spawn, but it seems that I could only start slave processes on nodes where mpd.py has already been started, ie., nodes specified in mpd.hosts. However, in my case, I'd like to assume that I don't know which nodes I will use when I start the program using mpirun. The nodes where the slave processes will run are determined at run-time.
I tried to use the hydra process manager in Intel 4.1, and MPI_Comm_spawn failed. Does hydra support spawning at all?
Could anyone give me some insight or advice on how to solve my problem? Thanks.
Do you have a reproducer? I have been able to use MPI_Comm_spawn with Hydra. Let me check on the exact method for specifying a host outside of the provided host list for launching.
Technical Consulting Engineer
Intel® Consulting Engineer
Thanks for replying.
I am attaching the test program I've been using. I compiled them using:
mpiicpc -o master master.cpp
mpiicpc -o worker worker.cpp
and ran it using:
mpirun -n 1 -env MPI_UNIVERSE_SIZE 3 ./master.
The program completed successfully with impi 4.0.2.003, but when run with impi 4.1.1.036 on the same nodes, I got the following output:
universe_size = 8
node1:2d9d: dapl_cma_connect: rdma_connect ERR -1 Function not implemented
[0:node1] unexpected DAPL connection event 0x4006 from 7
Assertion failed in file ../../dapl_poll_rc.c at line 1679: 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Interrupt (signal 2)
I tried with mpdboot & mpiexec, and got the same error. So it's not the hydra manager's fault. Do you know what is wrong? Thanks.