Intel® HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2089 Discussions

Using Intel MPI with PBSPro and Kerberos



We have some troubles on our cluster to use Intel MPI with PBSPro under a Kerberized environment.

The thing is PBSPro doesn't forward Kerberos tickets which prevents us to have a password-less ssh. Security officers rejects ssh keys without a passphrase, beside, we are expected to rely on Kerberos in order to connect through ssh.

As you can expect, a simple

mpirun -l -v -n $nb_procs "${PBS_O_WORKDIR}/" # that simply calls bash builtin echo

fails because of pmi_proxy that hangs, and in the end the walltime is exceeded, and we observe:

[] Launch arguments: /work/logiciels/rhall/intel/parallel_studio_xe_2017_u2/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin/pmi_proxy --control-port --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1939201911 --usize -2 --proxy-id 0
[] Launch arguments: /bin/ssh -x -q /work/logiciels/rhall/intel/parallel_studio_xe_2017_u2/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin/pmi_proxy --control-port --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1939201911 --usize -2 --proxy-id 1
[] Start PMI_proxy 0
[] STDIN will be redirected to 1 fd(s): 17
[0] node: 0 /  /
=>> PBS: job killed: walltime 23 exceeded limit 15
[] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

If instead we log onto the master node, execute kinit, and then run mpirun, everything works fine. Except this isn't exactly an acceptable workaround.

I've tried to play with the fabrics as the nodes are also connected with infiband, but I had no luck there. If I'm not mistaken, pmi_proxy does require password-less ssh whatever fabrics we have. Am I right ?

BTW, I've also tried to play with Altair PBSPro's pbsdsh. I've observed that the parameters it expects are not compatible with the one fed by mpirun. Besides, even if I encapsulate pbsdsh, pmi_proxy still fails with a

[] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "" to "" (Connection refused)
[] main (../../pm/pmiserv/pmip.c:461): unable to connect to server at port 49813 (check for firewalls!)

So. My question, is there a workaround? Something that I've missed? Every clue I can gather googling and experimenting points me towards "password-less ssh". So far the only workaround we've found consist in using another MPI framework :(


0 Kudos
1 Reply

I answer my own question, it appears the solution lies in PBSPro User Guide §



fixed my issue. I was mislead by the fact we don't have `rsh` installed.

0 Kudos