I get the following error on my cluster when I submit jobs
mpiexec_node050: cannot connect to local mpd (/tmp/mpd2.console_sudharshan); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
While I see that this error has been discussed in the threads before, what I see is that the error pops up quite unpredictably. While my job runs fine with a particular number of processors, and when I submit it again with a different number of processors, this error comes up. It is not clear under what conditions I get this issue. I have been getting this error for the same number of processors with which I have been able to run jobs fine, with the same scripts and with the same code. Any siggestion/help shall be sincerely appreciated.
Before you execute mpiexec command, does mpdtrace show list of all the nodes on which you want to run your job?Are you using -machinefile option in your mpiexec command?