Community
cancel
Showing results for 
Search instead for 
Did you mean: 
158 Views

Debugging problems with mpiexec.hydra

Hello,

I have provided an ifort + intelMPI build of a CFD solver to a customer using the Intel MPI runtime environment.   The customer's cluster uses a SGE scheduler.    The user is attempting to run on a single node with mpirun using the default Hydra process manager, but the job hangs without any meaningful error messages.    I instructed the user to run with I_MPI_HYDRA_DEBUG=1 and I_MPI_DEBUG=6 but I don't see anything unusual or unexpected in the output.   The last lines in the output are the following:

[mpiexec@compute-31] Launch arguments: /cm/shared/apps/sge/current/bin/linux-x64/qrsh -inherit -V compute-31.cm.cluster /a/fine/10.2/fine102/LINUX/_mpi/_impi5.0.3/intel64/bin/pmi_proxy --control-port compute-31.cm.cluster:46905 --debug --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk user --launcher sge --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1687715305 --usize -2 --proxy-id 0 
[mpiexec@compute-31] STDIN will be redirected to 1 fd(s): 9

I ran a similar test on another cluster which worked correctly.  The following lines with the [proxy:0:0@nmc-0066] prefix appear directly after the last lines reported by the customer cluster. 

[mpiexec@nmc-0066] STDIN will be redirected to 1 fd(s): 9
[proxy:0:0@nmc-0066] Start PMI_proxy 0
[proxy:0:0@nmc-0066] STDIN will be redirected to 1 fd(s): 9

From these tests I think the hang is coming from pmi_proxy.   Are there any additional verbose or debugging modes that could help identify the problem on the user's cluster?  

Thank your for your help,

-David

 

 

0 Kudos
1 Reply
James_T_Intel
Moderator
158 Views

David,

Those options enable the full debug information that we provide.  If you would like assistance with resolving this issue, can you submit a ticket at Intel® Premier Support or provide the full Hydra debug output?

James.
Intel Developer Support

Reply