- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have been experiencing hangs with our MPI-based application and our investigation led us to observing the following behaviour of mpirun:
mpirun -n 1 -host <good_hostname> hostname works as expected
mpirun -n 1 -host <bad_hostname> hostname hangs, during which ps shows:
21465 pts/11 S+ 0:00 | | | \_ /bin/sh /opt/soft1/intel-mpi/impi/5.0.1.035/intel64/bin/mpirun -n 1 -host bad_hostname hostname
21470 pts/11 S+ 0:00 | | | \_ mpiexec.hydra -n 1 -host bad_hostname hostname
21471 pts/11 Z 0:00 | | | \_ [ssh] <defunct>
Once I press Enter on the terminal from which I ran the mpirun command, the command exits with no output and exit code 141:
$ mpirun -n 1 -host bad_hostname hostname; echo $?
141
$
Tried running it with strace and it seems like the command gets stuck in the following wait4() system call:
...
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f0610228a10) = 19905
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x43e840, [], SA_RESTORER, 0x307a635cd0}, {SIG_DFL, [], SA_RESTORER, 0x307a635cd0}, 8) = 0
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}], 0, NULL)
Full mpirun -v and strace output attached.
Tried it with both 4.1.3.049 and 5.0.1.035 and the behaviour is the same.
Any help is much appreciated.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Adrian,
Such scenario (hang in case of incorrect hostname or incorrect network settings) is supposed to be controlled by I_MPI_JOB_TIMEOUT.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Artem,
I'm afraid that's not a solution:
[adrian@tomcat ~]$ mpirun -n 1 -hosts tomcat bash -c "sleep 5; echo Job finished"
Job finished
[adrian@tomcat ~]$ export I_MPI_JOB_TIMEOUT=3
[adrian@tomcat ~]$ mpirun -n 1 -hosts tomcat bash -c "sleep 5; echo Job finished"
APPLICATION TERMINATED WITH THE EXIT STRING: job ending due to timeout = 3
What I mean is that I_MPI_JOB_TIMEOUT will simply terminate the job after the specified number of seconds, whereas we don't have an upper limit for how long our job might take.. and even if we had, it would not be feasible to wait that long.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Adrian,
Unfortunately for now this is the only option for such hanging scenarios.
You can submit a ticket via Intel Premier Support to improve this behavior in future Intel MPI Library releases.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately, our Premier Support subscription expired, so we have to renew it before I can file a ticket.
But I have to say I find it very surprising that this is a limitation of the current version of Intel MPI, rather than just a silly bug / config issue. The fact that the command seems to be just waiting for some input before winding down is what makes me think that.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page