Intermittently Cannot Connect To Local MPD

Richard_R_4 · ‎09-23-2013

We are intermittently seeing this error message when running an MPI job with the latest MPI Run-Time Library V4:

/usr/diags/mpi/impi/4.1.1.036/bin64/mpiexec -genv LD_LIBRARY_PATH /usr/diags/mpi/impi/4.1.1.036/lib64 -machinefile /tmp/mymachlist.103060.run -n 32 /usr/diags/mpi/intel/intel/bin/olconft.intel RUNTIME=2
mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
probable cause: no mpd daemon on this machine
possible cause: unix socket /tmp/mpd2.console_root has been removed
mpiexec_A00A6D99 (__init__ 1524): forked process failed; status=255

Any idea what caues this error or can you help us determine the exact reason for the fork failure?

Thanks.

James_T_Intel · ‎09-23-2013

Hi Richard,

Are you able to run with Hydra? Please use either mpirun or mpiexec.hydra instead of mpiexec, and all of the other options should remain the same.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Richard_R_4 · ‎09-23-2013

I updated our olconft Perl script to use Hydra instead of MPD by setting this evnironemnt variable before executing mpboot:

$ENV{"I_MPI_PROCESS_MANAGER"} = "hydra";

olconf Start time: Mon Sep 23 15:09:48 CDT 2013
Running /usr/diags/mpi/intel/intel/bin/olconft.intel on nodes: A00A6D61.

/usr/diags/mpi/impi/4.1.1.036/bin64/mpiexec -genv LD_LIBRARY_PATH /usr/diags/mpi/impi/4.1.1.036/lib64 -machinefile /tmp/mymachlist.42867.run -n 32 /usr/diags/mpi/intel/intel/bin/olconft.intel RUNTIME=2
mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
probable cause: no mpd daemon on this machine
possible cause: unix socket /tmp/mpd2.console_root has been removed
mpiexec_A00A6D61 (__init__ 1524): forked process failed; status=255
Error: Return Status for /usr/diags/mpi/intel/intel/bin/olconft.intel is: 65280

Richard_R_4 · ‎09-25-2013

We found that by removing this sysctl command from our MPI execution script the MPD problem was eliminated:

sysctl -w vm.drop_caches=3

This command clears page cache and inode/dentry cache and was used as a work-around for another unrelated problem about a year ago. We do not know why running this command seems to cause the fork error seen as the MPD connect failure.

However, we also are intermittently seeing a second error related to MPD:

mpdboot failed: Inappropriate ioctl for device at /usr/diags/bin/../lib/mpi_setup.pm line 173.

when attempting to execute this mpdboot command from our Perl script:

system("$MPDBOOT --rsh=/usr/bin/ssh --totalnum=$nhosts_plus1 --file=$NODE_LIST_FIL E > /dev/null 2>&1 ") && die "mpdboot failed: $!";

This problem also goes away when the sysctl command is removed, but still occurs with the sysctl command and running Hydra.

In summary, running the sysctl to clear caches intermittently causes two distinct MPD-related errors. Removing the sysctl command eliminates both errors.

Leaving the sysctl command in our script and running with Hydra eliminates the connect failure caused by the fork error, but not the ioctl MPD error.