Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Luc_H_
Beginner
211 Views

Using Intel MPI with PBSPro and Kerberos

Hello,

We have some troubles on our cluster to use Intel MPI with PBSPro under a Kerberized environment.

The thing is PBSPro doesn't forward Kerberos tickets which prevents us to have a password-less ssh. Security officers rejects ssh keys without a passphrase, beside, we are expected to rely on Kerberos in order to connect through ssh.

As you can expect, a simple

mpirun -l -v -n $nb_procs "${PBS_O_WORKDIR}/echo-node.sh" # that simply calls bash builtin echo

fails because of pmi_proxy that hangs, and in the end the walltime is exceeded, and we observe:

[...]
[mpiexec@node028.sis.cnes.fr] Launch arguments: /work/logiciels/rhall/intel/parallel_studio_xe_2017_u2/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin/pmi_proxy --control-port node028.sis.cnes.fr:41735 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1939201911 --usize -2 --proxy-id 0
[mpiexec@node028.sis.cnes.fr] Launch arguments: /bin/ssh -x -q node029.sis.cnes.fr /work/logiciels/rhall/intel/parallel_studio_xe_2017_u2/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin/pmi_proxy --control-port node028.sis.cnes.fr:41735 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1939201911 --usize -2 --proxy-id 1
[proxy:0:0@node028.sis.cnes.fr] Start PMI_proxy 0
[proxy:0:0@node028.sis.cnes.fr] STDIN will be redirected to 1 fd(s): 17
[0] node: 0 /  /
=>> PBS: job killed: walltime 23 exceeded limit 15
[mpiexec@node028.sis.cnes.fr] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@node028.sis.cnes.fr] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@node028.sis.cnes.fr] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@node028.sis.cnes.fr] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@node028.sis.cnes.fr] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@node028.sis.cnes.fr] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

If instead we log onto the master node, execute kinit, and then run mpirun, everything works fine. Except this isn't exactly an acceptable workaround.

I've tried to play with the fabrics as the nodes are also connected with infiband, but I had no luck there. If I'm not mistaken, pmi_proxy does require password-less ssh whatever fabrics we have. Am I right ?

BTW, I've also tried to play with Altair PBSPro's pbsdsh. I've observed that the parameters it expects are not compatible with the one fed by mpirun. Besides, even if I encapsulate pbsdsh, pmi_proxy still fails with a

[proxy:0:0@node028.sis.cnes.fr] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "node028.sis.cnes.fr" to "node028.sis.cnes.fr" (Connection refused)
[proxy:0:0@node028.sis.cnes.fr] main (../../pm/pmiserv/pmip.c:461): unable to connect to server node028.sis.cnes.fr at port 49813 (check for firewalls!)

So. My question, is there a workaround? Something that I've missed? Every clue I can gather googling and experimenting points me towards "password-less ssh". So far the only workaround we've found consist in using another MPI framework :(

Regards,

0 Kudos
1 Reply
Luc_H_
Beginner
211 Views

I answer my own question, it appears the solution lies in PBSPro User Guide §6.2.6.1.

Setting

export I_MPI_HYDRA_BOOTSTRAP=rsh
export I_MPI_HYDRA_BOOTSTRAP_EXEC=pbs_tmrsh

fixed my issue. I was mislead by the fact we don't have `rsh` installed.

Reply