I have provided an ifort + intelMPI build of a CFD solver to a customer using the Intel MPI runtime environment. The customer's cluster uses a SGE scheduler. The user is attempting to run on a single node with mpirun using the default Hydra process manager, but the job hangs without any meaningful error messages. I instructed the user to run with I_MPI_HYDRA_DEBUG=1 and I_MPI_DEBUG=6 but I don't see anything unusual or unexpected in the output. The last lines in the output are the following:
[mpiexec@compute-31] Launch arguments: /cm/shared/apps/sge/current/bin/linux-x64/qrsh -inherit -V compute-31.cm.cluster /a/fine/10.2/fine102/LINUX/_mpi/_impi5.0.3/intel64/bin/pmi_proxy --control-port compute-31.cm.cluster:46905 --debug --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk user --launcher sge --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1687715305 --usize -2 --proxy-id 0
[mpiexec@compute-31] STDIN will be redirected to 1 fd(s): 9
I ran a similar test on another cluster which worked correctly. The following lines with the [proxy:0:0@nmc-0066] prefix appear directly after the last lines reported by the customer cluster.
[mpiexec@nmc-0066] STDIN will be redirected to 1 fd(s): 9
[proxy:0:0@nmc-0066] Start PMI_proxy 0
[proxy:0:0@nmc-0066] STDIN will be redirected to 1 fd(s): 9
From these tests I think the hang is coming from pmi_proxy. Are there any additional verbose or debugging modes that could help identify the problem on the user's cluster?
Thank your for your help,
Those options enable the full debug information that we provide. If you would like assistance with resolving this issue, can you submit a ticket at Intel® Premier Support or provide the full Hydra debug output?
Intel Developer Support