I am a little out of my depth here so bear with me. I am trying to configure mpirun and mpiexec to run software called Materials Studio on a 1 node, 2 processor, 12 core cluster. The submission scheme is PBS. I had everything set up properly and where I could submit jobs and they would work well but after a few days I ran into issues where I would get this sort of error:
mpiexec_server.org: cannot connect to local mpd (/tmp/mpd2.console_user); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option)
It seemed like the daemon for mpd was somehow set up but eventually terminated. I had luck adding this to my submission script:
mpdboot -n 1 -f ~/mpd.hosts
nohup mpd &
/data1/opt/MD/Linux-x86_64/IntelMPI/bin/mpiexec -n 6 /data1/opt/MD/2.0/TaskServer/Tools/vasp5.3.3/Linux-x86_64/vasp_parallel
The job now submits and runs properly but times out after 30 minutes or so. I tried adding '-r ssh' without quotes to the end of the mpdboot line but I am not sure if that is the right strategy to take. Also, I am a little confused about why I need to run this daemon in this script and why I need to call a hosts file when I run- I thought that PBS creates that when the job picks up. Could anyone please give me some advice on where to go next? Basically how can I prevent a job that is running from quitting because of something to do with the mpi daemon.
Thanks so much for your help!
Try using the following instead:
mpirun -n 6 /data1/opt/MD/2.0/TaskServer/Tools/vasp5.3.3/Linux-x86_64/vasp_parallel[/plain]
The first line will set up the PATH and LD_LIBRARY_PATH environment variables for you. By using mpirun (or mpiexec.hydra) instead of mpiexec, you will use Hydra, which is simpler and more scalable than MPD. Please let me know if this helps.
Technical Consulting Engineer
Intel® Cluster Tools
Thanks so much for the response! Unfortunately, I can't seem to locate mpivars.sh in that folder or in the lib folder. I think it might be due to the version number, here is what I am told the software was compiled with, I believe it is version 3.2.
Ok, try simply removing the calls to start the MPD within your script. Start an MPD ring ahead of time, and use that ring for all of your jobs.
Also, would it be possible to get a version compiled with a current version of the Intel® MPI Library and then try running with that version?
Okay so I think I figured out the issue and it ended up being in no way related to a failure on intel mpi. My administrator set up a task with cron several years ago to kill any jobs with a certain string that this program by dumb luck happened to match. Everything now runs beautifully.
Thanks again for all the help,