- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
We have some troubles on our cluster to use Intel MPI with PBSPro under a Kerberized environment.
The thing is PBSPro doesn't forward Kerberos tickets which prevents us to have a password-less ssh. Security officers rejects ssh keys without a passphrase, beside, we are expected to rely on Kerberos in order to connect through ssh.
As you can expect, a simple
mpirun -l -v -n $nb_procs "${PBS_O_WORKDIR}/echo-node.sh" # that simply calls bash builtin echo
fails because of pmi_proxy that hangs, and in the end the walltime is exceeded, and we observe:
[...] [mpiexec@node028.sis.cnes.fr] Launch arguments: /work/logiciels/rhall/intel/parallel_studio_xe_2017_u2/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin/pmi_proxy --control-port node028.sis.cnes.fr:41735 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1939201911 --usize -2 --proxy-id 0 [mpiexec@node028.sis.cnes.fr] Launch arguments: /bin/ssh -x -q node029.sis.cnes.fr /work/logiciels/rhall/intel/parallel_studio_xe_2017_u2/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin/pmi_proxy --control-port node028.sis.cnes.fr:41735 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1939201911 --usize -2 --proxy-id 1 [proxy:0:0@node028.sis.cnes.fr] Start PMI_proxy 0 [proxy:0:0@node028.sis.cnes.fr] STDIN will be redirected to 1 fd(s): 17 [0] node: 0 / / =>> PBS: job killed: walltime 23 exceeded limit 15 [mpiexec@node028.sis.cnes.fr] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor) [mpiexec@node028.sis.cnes.fr] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy [mpiexec@node028.sis.cnes.fr] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream [mpiexec@node028.sis.cnes.fr] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status [mpiexec@node028.sis.cnes.fr] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event [mpiexec@node028.sis.cnes.fr] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion
If instead we log onto the master node, execute kinit, and then run mpirun, everything works fine. Except this isn't exactly an acceptable workaround.
I've tried to play with the fabrics as the nodes are also connected with infiband, but I had no luck there. If I'm not mistaken, pmi_proxy does require password-less ssh whatever fabrics we have. Am I right ?
BTW, I've also tried to play with Altair PBSPro's pbsdsh. I've observed that the parameters it expects are not compatible with the one fed by mpirun. Besides, even if I encapsulate pbsdsh, pmi_proxy still fails with a
[proxy:0:0@node028.sis.cnes.fr] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "node028.sis.cnes.fr" to "node028.sis.cnes.fr" (Connection refused) [proxy:0:0@node028.sis.cnes.fr] main (../../pm/pmiserv/pmip.c:461): unable to connect to server node028.sis.cnes.fr at port 49813 (check for firewalls!)
So. My question, is there a workaround? Something that I've missed? Every clue I can gather googling and experimenting points me towards "password-less ssh". So far the only workaround we've found consist in using another MPI framework :(
Regards,
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I answer my own question, it appears the solution lies in PBSPro User Guide §6.2.6.1.
Setting
export I_MPI_HYDRA_BOOTSTRAP=rsh export I_MPI_HYDRA_BOOTSTRAP_EXEC=pbs_tmrsh
fixed my issue. I was mislead by the fact we don't have `rsh` installed.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page