Community
cancel
Showing results for 
Search instead for 
Did you mean: 
ptsouts
Beginner
659 Views

INTEL MPI Hydra Crash

i am getting the following message arbitrarily at times when running a parallel job using the latest intel fortran compiler and intel mpi.
[proxy:0:12@n020] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds.revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:12@n020] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event
[mpiexec@n032] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[mpiexec@n032] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[mpiexec@n032] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion
[mpiexec@n032] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion
[proxy:0:12@n020] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds.revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed[proxy:0:12@n020] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event[mpiexec@n032] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting[mpiexec@n032] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion[mpiexec@n032] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion[mpiexec@n032] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion
I am currently using the following command:
mpirun -np N ./a.exe
Should i specify anything else in order to ensure this error will not happen again?
0 Kudos
4 Replies
James_T_Intel
Moderator
659 Views

Hi ptsouts,

The error you are seeing is caused because one of the processes in your job ended incorrectly. However, by itself the information you have provided isn't sufficient to pin down the cause of the error. Can you try running this command:
[bash]mpirun -np N -check_mpi ./a.exe[/bash]

That will give additional information regarding the MPI calls being made. Please post the output of this command, preferably from one of the failed runs.

Can you provide any details of the program you are attempting to run? It would be best if you could provide a small snippet of the program that shows this behavior, so I can attempt to reproduce it here. Or if it is a publicly available code, a link to the source would work as well.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
James_T_Intel
Moderator
659 Views

Hi ptsouts,

Have you tried running your program with the -check_mpi option? Are you able to provide any of the source code for the program, or another that can reproduce this behavior?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
gryghash
Beginner
659 Views

I seem to be having a similar problem. I either get the same error messages as the original post, or I get "APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)"

I am running this example program: http://en.wikipedia.org/wiki/Message_Passing_Interface#Example_program

It has worked with OpenMPI. It also works with Intel MPI on a single node, multiple cores. However, the multi-node runs all crash.

I am using mpiexec.hydra's Torque/PBS integration, and it works: it finds all the assigned nodes, and knows how many cores per node are to be used.

Here's the job:
cd ${PBS_O_WORKDIR}
mpiexec.hydra -verbose -rmk pbs -tmpdir /scratch/${PBS_JOBID} ./hello_mpi

I am attaching the verbose output, redacted.

I have tried turning of the firewall on the compute nodes, but it didn't help. The errors remained the same.

Thanks for your attention,
--Dave Chin
James_T_Intel
Moderator
659 Views

Hi Dave,

What version of OFED are you using? What does your /etc/dat.conf file look like?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
Reply