- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here're the steps, assume you have two host host1 and host2
1.
start a mpd ring on the two hosts under normal user:
export MPD_CON_EXT=1234
mpdboot -n 2 -f $hfile
in which hfile contains two hosts host1 and host2
2.
on host1, kill -9 the mpd process and all intel mpi process in one shot
3.
on host2 (remote host), you see a left over mpd.py process
Is there a way to make the mpd ring exit by itself in a clean way?
Thanks.
- Jin
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you try out a script like:
export MPD_CON_EXT=1234
mpdboot -n 2 -f $hfile #(might be you need -r ssh)
mpiexec -n NNN ./my_application
mpdcleanup -a
I hope it works.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
if first node crashes, I would not be able to know MPD_CON_EXT on other nodes.
mpdcleanup -a may terminates other mpd rings which do not include the first node.
The key reason of MPD_CON_EXT is to start up individual mpd ring to each applications.
So it seems like mpdcleanup -a may not work for me.
What Iexpect is that mpd ring terminates itself automatically if one mpd in the ring
stops responding. and each mpd process should exit in a clean way, no left over processes.
Thanks.
- Jin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Seems I didn't understand the problem.
If you need to run each application with individual ring you can use 'mpirun' utility. It uses unique MPD_CON_EXT internally, creates mpd ring and destroys this mpd ring when application is finished. All mpds related to this task will be killed automatically.
Other mpds will not be affected.
If you create your own MPD_CON_EXT and start new ring using 'mpdboot' mpds will live until you kill them. They will NOT be killed automatically.
If one mpd stops responding your mpd ring will be one node smaller.
Would you like to change existing behaviour?
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I checked mpirun, it's script and it does the same thing:
construct a host file, set a unique MPD_CON_EXT, and then call mpdboot command.
In this case, if the first host crashes (where mpirun is running),
I'd assume I can still see the same problem I described: left over mpd.py processes
on other hosts. Is that right? Is that a behavior by design?
Thanks!
424 # Start an exclusive MPD ring by setting an unique MPD_CON_EXT variable
425 if [ -n "$ENVIRONMENT" -a -n "$QSUB_REQID" -a -n "$QSUB_NODEINF" ] ; then
426 export MPD_CON_EXT=$QSUB_REQID # Called under Fujitsu NQS
427 else
428 export MPD_CON_EXT=`date +%y%m%d.%H%M%S`
429 fi
430 #echo "mpdboot -n $np_boot $hosts_opt $other_mpdboot_opt"
431 #echo "HOSTFILE:"
432 #cat $hosts_file
433 mpdboot -n $np_boot $hosts_opt $other_mpdboot_opt
434 #mpdtrace
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are talking about abnormal termination than 'yes' - mpirun will not be able to stop all mpds and mpds don't know that they need to be killed.
According to existing logic a node (mpd) can disappear from a ring at any time - no problem.
In version 4.0 of the Intel MPI library there is an experimental Process Manager called Hydra.
You can run 'mpiexec.hydra' instead of 'mpirun' (all the rest parameters are the same). All processes running on remote nodes should be killed automatically in case of any abnormal termination of the application. Let's try.
Regards!
Dmitry
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page