I'm trying to run Intel MPI-3.2.1 on a SGI Altix Linux cluster under SGE-6.2. It fails with following error:
cat output.32.Hello /var/sge/default/spool/r1i0n12/active_jobs/32.1/pe_hostfile r1i0n12 r1i0n12 r1i0n12 r1i0n12 r1i0n12 r1i0n12 r1i0n12 r1i0n12 mpdroot: cannot connect to local mpd at: /tmp/32.1.all.q/mpd2.console_root_r1i0n12 probable cause: no mpd daemon on this machine possible cause: unix socket /tmp/32.1.all.q/mpd2.console_root_r1i0n12 has been removed mpiexec_r1i0n12 (__init__ 1162): forked process failed; status=255
But, if job is submitted without using SGE(i.e. from command line) then it works well on the same set of nodes
The mpi job is submitted using mpiexec command and mpd's are already booted by root and user has MPD_USE_ROOT_MPD=1 in .mpd.conf file in his home directory.
It seems to me that SGE changes TMPDIR environment variable and after that mpdroot cannot find console file. Could you set I_MPI_MPD_TMPDIR=/tmp before you create an mpd ring and give it a try? Don't forget to set this variable for the user.