Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

mpirun with bad hostname hangs with [ssh] <defunct> until Enter is pressed

Adrian_I_
Beginner
1,707 Views

We have been experiencing hangs with our MPI-based application and our investigation led us to observing the following behaviour of mpirun:

mpirun -n 1 -host <good_hostname> hostname works as expected

mpirun -n 1 -host <bad_hostname> hostname hangs, during which ps shows: 

21465 pts/11   S+     0:00  |   |   |   \_ /bin/sh /opt/soft1/intel-mpi/impi/5.0.1.035/intel64/bin/mpirun -n 1 -host bad_hostname hostname
21470 pts/11   S+     0:00  |   |   |       \_ mpiexec.hydra -n 1 -host bad_hostname hostname
21471 pts/11   Z      0:00  |   |   |           \_ [ssh] <defunct>

Once I press Enter on the terminal from which I ran the mpirun command, the command exits with no output and exit code 141:

$ mpirun -n 1 -host bad_hostname hostname; echo $?

141
$

Tried running it with strace and it seems like the command gets stuck in the following wait4() system call:

...

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f0610228a10) = 19905
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x43e840, [], SA_RESTORER, 0x307a635cd0}, {SIG_DFL, [], SA_RESTORER, 0x307a635cd0}, 8) = 0
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}], 0, NULL)

 

Full mpirun -v and strace output attached.
Tried it with both 4.1.3.049 and 5.0.1.035 and the behaviour is the same.

Any help is much appreciated.

0 Kudos
4 Replies
Artem_R_Intel1
Employee
1,707 Views

Hello Adrian,

Such scenario (hang in case of incorrect hostname or incorrect network settings) is supposed to be controlled by I_MPI_JOB_TIMEOUT.

0 Kudos
Adrian_I_
Beginner
1,707 Views

Hi Artem,

I'm afraid that's not a solution:

[adrian@tomcat ~]$ mpirun -n 1 -hosts tomcat bash -c "sleep 5; echo Job finished"
Job finished
[adrian@tomcat ~]$ export I_MPI_JOB_TIMEOUT=3
[adrian@tomcat ~]$ mpirun -n 1 -hosts tomcat bash -c "sleep 5; echo Job finished"
APPLICATION TERMINATED WITH THE EXIT STRING: job ending due to timeout = 3

What I mean is that I_MPI_JOB_TIMEOUT will simply terminate the job after the specified number of seconds, whereas we don't have an upper limit for how long our job might take.. and even if we had, it would not be feasible to wait that long.

0 Kudos
Artem_R_Intel1
Employee
1,707 Views

Hi Adrian,

Unfortunately for now this is the only option for such hanging scenarios.
You can submit a ticket via Intel Premier Support to improve this behavior in future Intel MPI Library releases.

0 Kudos
Adrian_I_
Beginner
1,707 Views

Unfortunately, our Premier Support subscription expired, so we have to renew it before I can file a ticket.

But I have to say I find it very surprising that this is a limitation of the current version of Intel MPI, rather than just a silly bug / config issue. The fact that the command seems to be just waiting for some input before winding down is what makes me think that.

 

0 Kudos
Reply