I'm using a IntelMPI with PBS.
When I send a SIGTERM signal using qdel to my job mpirun exits immediatly and my program that is called by mpirun has no time to finish its cleanup work.
if [ x$PBS_ENVIRONMENT != x ]; then
trap "" SIGTERM
in my ~/.profile to prevent any shell from exiting when it gets the SIGTERM)
How can I tell IntelMPI's mpirun not to exit on SIGTERM?
Thanks for posting here.
Personnally I don't understand why you need to send SIGTERM and execute cleanup code.
Anyway, I've tried to kill mpirun (it was SIGKILL really instead of SIGTERM, but I think it is not so important):
[user1@mpiserver100 spawn1]$ mpirun -r ssh -f mpd.hosts -n 2 IMB-MPI1 > out_IMB
From another console:
[user1@mpiserver100 spawn1]$ ps xf
PID TTY STAT TIME COMMAND
20989 pts/0 Ss 0:00 -bash
23276 pts/0 R+ 0:00 _ ps xf
14865 pts/6 Ss+ 0:00 -bash
23269 pts/0 S 0:00 python /user1/intel/impi/4.0/intel64/bin/mpiexec -n 2 IMB-MPI1
23270 pts/0 Z 0:00 _ [sh]
23255 ? S 0:00 python /user1/intel/impi/4.0/intel64/bin/mpd.py --ncpus=1 --myhost=mpiserver100 -e -d -s 2
23271 ? S 0:00 _ python /user1/intel/impi/4.0/intel64/bin/mpd.py --ncpus=1 --myhost=mpiserver100 -e -d -s 2
23274 ? R 0:09 | _ IMB-MPI1
23272 ? S 0:00 _ python /user1/intel/impi/4.0/intel64/bin/mpd.py --ncpus=1 --myhost=mpiserver100 -e -d -s 2
23273 ? R 0:09 _ IMB-MPI1
So, you can see that mpiexec and application itself are still running. mpirun doesn't send signals further. Probably this is PBS responsible for the problem you mentioned - seems PBS can kill not only parent processes but all children as well. Could you tell me your version of PBS and I'll try to reproduce the problem.
I have the same problem
If I send SIGUSR1 it gets passed to the subproceesses they can save there state and shutdown cleanly.
If I send a SIGINT (Ctrl-C) mpirun exits and my processes get killed without being able to save state. How do I make mpirun signore all signals and pass them on to the subprocesses?