- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the production environment, it happens that some nodes crash once in a while. What's the behavior of Intel's MPI when an MPI program encounters lost contact of some of its processes? Would there be any difference if the node crashed contains rank 0? Is there any option of Intel's MPI to control the behavior of such situation so that the program will be cleaned up in case one of the MPI processesis lost?Thank you very much,Tofu
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If a node containing a process crashes, the entire job will end. You can use the -cleanup option (or I_MPI_HYDRA_CLEANUP) to create a temporary file that will list the PID of each process, and the mpicleanup utility will use this file to clean the environment if the job does not end correctly. You can also use I_MPI_MPIRUN_CLEANUP if you are using MPD instead of Hydra.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I found a similar situation that the mpirun command does not terminate even some of the processes do not start up properly. Here I have two nodes p0 and p1 running the program /opt/intel/impi/4.1.0.024/test/test.f90. Here is what I've done:
cp /opt/intel/impi/4.1.0.024/test/test.f90 /path/to/shared/storage
cd /path/to/shared/storage
mpiifort test.f90
mpirun -hosts p0,p1 -n 32 ./a.out
Hello world: rank 0 of 32 running on
p01
Hello world: rank 1 of 32 running on
p01
Hello world: rank 2 of 32 running on
p01
Hello world: rank 3 of 32 running on
p01
Hello world: rank 4 of 32 running on
p01
Hello world: rank 5 of 32 running on
p01
Hello world: rank 6 of 32 running on
p01
Hello world: rank 7 of 32 running on
p01
Hello world: rank 8 of 32 running on
p01
Hello world: rank 9 of 32 running on
p01
Hello world: rank 10 of 32 running on
p01
Hello world: rank 11 of 32 running on
p01
Hello world: rank 12 of 32 running on
p01
Hello world: rank 13 of 32 running on
p01
Hello world: rank 14 of 32 running on
p01
Hello world: rank 15 of 32 running on
p01
Hello world: rank 16 of 32 running on
p02
Hello world: rank 17 of 32 running on
p02
Hello world: rank 18 of 32 running on
p02
Hello world: rank 19 of 32 running on
p02
Hello world: rank 20 of 32 running on
p02
Hello world: rank 21 of 32 running on
p02
Hello world: rank 22 of 32 running on
p02
Hello world: rank 23 of 32 running on
p02
Hello world: rank 24 of 32 running on
p02
Hello world: rank 25 of 32 running on
p02
Hello world: rank 26 of 32 running on
p02
Hello world: rank 27 of 32 running on
p02
Hello world: rank 28 of 32 running on
p02
Hello world: rank 29 of 32 running on
p02
Hello world: rank 30 of 32 running on
p02
Hello world: rank 31 of 32 running on
p02
Now, on p02, I umount the shared storage and then issue the command again:
mpirun -hosts p01,p02 -n 32 ./a.out
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
However, the mpirun process is not terminated and the ps tree shows the following:
100776 pts/14 S 0:00 \_ /bin/sh /opt/intel/impi/4.1.0.024/intel64/bin/mpirun -hosts p01,p02 -ppn 1 -n 2 ./a.out
100781 pts/14 S 0:00 | \_ mpiexec.hydra -hosts p01 p02 -ppn 1 -n 2 ./a.out
100782 pts/14 S 0:00 | \_ /usr/bin/ssh -x -q p01 /opt/intel/impi/4.1.0.024/intel64/bin/pmi_proxy --control-port metro:36671 --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk slurm --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --proxy-id 0
100783 pts/14 Z 0:00 | \_ [ssh] <defunct>
Just wonder if there is any option that can help in this situation so that mpirun can terminate properly instead of hanging.
regards,
C. Bean
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you use mpdallexit?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have corrected some problems related to ranks not exiting correctly. Please try with Version 4.1 Update 2 and see if this resolves the problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Our situation is slightly different but still encountered similar problem; though we're using the 4.1 version update 2. We started running HPL benchmark test and one of the node was crashed in the middle. However, the mpiexec.hydra does not terminate:
28351 pts/4 Ss 0:00 \_ /bin/bash
32295 pts/4 S+ 0:00 \_ /bin/sh /opt/intel/impi/4.1.2.040/intel64/bin/mpirun -hosts node107,node213 -n 32 ./xhpl_intel64_dynamic
32300 pts/4 S+ 0:00 \_ mpiexec.hydra -hosts node107 node213 -n 32 ./xhpl_intel64_dynamic
32301 pts/4 Z 0:00 \_ [ssh] <defunct>
32302 pts/4 S 0:00 \_ /usr/bin/ssh -x -q node213 /opt/intel/impi/4.1.2.040/intel64/bin/pmi_proxy --control-port master:49817 --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk slurm --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1138594473 --proxy-id 1
Any clue?
regards,
Tofu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The 4.1 update 2 Intel MPI works fine for the initial start up issue; i.e., if a node does not mount the shared storage, the mpirun terminates properly.
We also tried unplugging a compute node in the middle of a run and found mpirun hangs with [ssh] <defunct>. Any way to cause mpirun terminates in such situation?
regards,
C. Bean
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Any update on this issue? We tried compiling application using MVAPICH2 and their mpiexec.hydra, the whole application is terminated whenever a compute node is down.
regards,
tofu
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page