`mpiexec -genv I_MPI_PERHOST 1 -n 2 hostname`
`mpiexec -genv I_MPI_PERHOST 1 -n 2 ./testcpp`
`mpirun -r ssh -f ./mpd.hosts -genv I_MPI_PERHOST 1 -n 2 ./testcpp`
[shell]Runing on the host n2114.nodesDoes anybody know how to fix this problem using only user access to cluster?
This jobs runs on the following processors:
n2114.nodes n2114.nodes n2113.nodes n2113.nodes n2112.nodes n2112.nodes
running mpdallexit on n2114.nodes
LAUNCHED mpd on n2114.nodes via
RANNING: mpd on n2114.nodes
LAUNCHED mpd on n2113.nodes via n2114.nodes
LAUNCHED mpd on n2112.nodes via n2114.nodes
mpd_boot_n2114.nodes (handle_mpd_output 730): Failed to establish a socket connection with n2112.nodes:43606 : (111, 'Connection refused')
mpd_boot_n2114.nodes (handle_mpd_output 747): Failed to connect to mpd on n2112.nodes
Fromthe first part ofyour question it seems to me that Firewall doesn't allow to establish connetion between mpiexec and mpd. To get more info you can use --verbose switch.
Could you send log file from cluster-slave? This is the most interesting file.
Second part: mpd itself is smart enough to change anmpd-ring. And the message: "with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring" says that there will be new ring - in your case only one node has left - only your task will be executed on one node only.
You can start all processes on one node - no problem, but you'll get performance not as good as you start them in parallel.
You can write me directly dmitry.kuzmin (at) intel.com