Hi,

Tofu · ‎07-29-2012

In the production environment, it happens that some nodes crash once in a while. What's the behavior of Intel's MPI when an MPI program encounters lost contact of some of its processes? Would there be any difference if the node crashed contains rank 0? Is there any option of Intel's MPI to control the behavior of such situation so that the program will be cleaned up in case one of the MPI processesis lost?Thank you very much,Tofu

James_T_Intel · ‎07-30-2012

Hi Tofu,

If a node containing a process crashes, the entire job will end. You can use the -cleanup option (or I_MPI_HYDRA_CLEANUP) to create a temporary file that will list the PID of each process, and the mpicleanup utility will use this file to clean the environment if the job does not end correctly. You can also use I_MPI_MPIRUN_CLEANUP if you are using MPD instead of Hydra.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

YY_C_ · ‎12-01-2013

Hi,

I found a similar situation that the mpirun command does not terminate even some of the processes do not start up properly. Here I have two nodes p0 and p1 running the program /opt/intel/impi/4.1.0.024/test/test.f90. Here is what I've done:

cp /opt/intel/impi/4.1.0.024/test/test.f90 /path/to/shared/storage

cd /path/to/shared/storage

mpiifort test.f90

mpirun -hosts p0,p1 -n 32 ./a.out

Hello world: rank 0 of 32 running on
p01

Hello world: rank 1 of 32 running on
p01

Hello world: rank 2 of 32 running on
p01

Hello world: rank 3 of 32 running on
p01

Hello world: rank 4 of 32 running on
p01

Hello world: rank 5 of 32 running on
p01

Hello world: rank 6 of 32 running on
p01

Hello world: rank 7 of 32 running on
p01

Hello world: rank 8 of 32 running on
p01

Hello world: rank 9 of 32 running on
p01

Hello world: rank 10 of 32 running on
p01

Hello world: rank 11 of 32 running on
p01

Hello world: rank 12 of 32 running on
p01

Hello world: rank 13 of 32 running on
p01

Hello world: rank 14 of 32 running on
p01

Hello world: rank 15 of 32 running on
p01

Hello world: rank 16 of 32 running on
p02

Hello world: rank 17 of 32 running on
p02

Hello world: rank 18 of 32 running on
p02

Hello world: rank 19 of 32 running on
p02

Hello world: rank 20 of 32 running on
p02

Hello world: rank 21 of 32 running on
p02

Hello world: rank 22 of 32 running on
p02

Hello world: rank 23 of 32 running on
p02

Hello world: rank 24 of 32 running on
p02

Hello world: rank 25 of 32 running on
p02

Hello world: rank 26 of 32 running on
p02

Hello world: rank 27 of 32 running on
p02

Hello world: rank 28 of 32 running on
p02

Hello world: rank 29 of 32 running on
p02

Hello world: rank 30 of 32 running on
p02

Hello world: rank 31 of 32 running on
p02

Now, on p02, I umount the shared storage and then issue the command again:

mpirun -hosts p01,p02 -n 32 ./a.out

[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)

However, the mpirun process is not terminated and the ps tree shows the following:

100776 pts/14 S 0:00 \_ /bin/sh /opt/intel/impi/4.1.0.024/intel64/bin/mpirun -hosts p01,p02 -ppn 1 -n 2 ./a.out
100781 pts/14 S 0:00 | \_ mpiexec.hydra -hosts p01 p02 -ppn 1 -n 2 ./a.out
100782 pts/14 S 0:00 | \_ /usr/bin/ssh -x -q p01 /opt/intel/impi/4.1.0.024/intel64/bin/pmi_proxy --control-port metro:36671 --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk slurm --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --proxy-id 0
100783 pts/14 Z 0:00 | \_ [ssh] <defunct>

Just wonder if there is any option that can help in this situation so that mpirun can terminate properly instead of hanging.

regards,

C. Bean

TimP · ‎12-02-2013

Could you use mpdallexit?

James_T_Intel · ‎12-02-2013

We have corrected some problems related to ranks not exiting correctly. Please try with Version 4.1 Update 2 and see if this resolves the problem.

Tofu · ‎12-02-2013

Hi,

Our situation is slightly different but still encountered similar problem; though we're using the 4.1 version update 2. We started running HPL benchmark test and one of the node was crashed in the middle. However, the mpiexec.hydra does not terminate:

28351 pts/4 Ss 0:00 \_ /bin/bash
32295 pts/4 S+ 0:00 \_ /bin/sh /opt/intel/impi/4.1.2.040/intel64/bin/mpirun -hosts node107,node213 -n 32 ./xhpl_intel64_dynamic
32300 pts/4 S+ 0:00 \_ mpiexec.hydra -hosts node107 node213 -n 32 ./xhpl_intel64_dynamic
32301 pts/4 Z 0:00 \_ [ssh] <defunct>
32302 pts/4 S 0:00 \_ /usr/bin/ssh -x -q node213 /opt/intel/impi/4.1.2.040/intel64/bin/pmi_proxy --control-port master:49817 --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk slurm --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1138594473 --proxy-id 1

Any clue?

regards,

Tofu

YY_C_ · ‎12-03-2013

The 4.1 update 2 Intel MPI works fine for the initial start up issue; i.e., if a node does not mount the shared storage, the mpirun terminates properly.

We also tried unplugging a compute node in the middle of a run and found mpirun hangs with [ssh] <defunct>. Any way to cause mpirun terminates in such situation?

regards,

C. Bean

Tofu · ‎12-09-2013

Any update on this issue? We tried compiling application using MVAPICH2 and their mpiexec.hydra, the whole application is terminated whenever a compute node is down.

regards,

tofu

MPI program behavior on node crash ...