Signal propagation with Intel MPI

John_S_1 · ‎01-07-2017

Our lab recently added another HPC cluster and updated most other clusters to RHEL-7 (from 6). The PBS system was also updated to a newer supported version. My codes implement signal handlers to affect a controlled shutdown with or without checkpoints. I have been using the 2015 version of the Intel tool chain, in particular Intel MPI to run the jobs. I use SIGUSR1 and SIGUSR2, and before the upgrade (late December 2016) this all worked correctly, now it does not. My code always works with OpenMPI as well (tested on many different systems, including large national lab computers).

After the upgrade SIGUSR1 is not passed to the MPI processes and SIGUSR2 crashes the code. I have tested with with a very small program that just catches the signals and prints something. This is independent of PBS (ie, running the job interactively from a head node on the cluster). Note that the Intel tool chain, and presumably the MPI libraries within it have not changed! I cannot find anything on using these signals, other than one web post from years ago (2013?) indicating its a "bug". Also, SIGINT and SIGTERM are passed, even though the manual (for hydra exec variables, https://software.intel.com/en-us/node/528782) indicates that they should not be. Changing the I_MPI_JOB_SIGNAL_PROPAGATION has no effect on this behavior.

1. Does anyone understand what is going on?

2. What is Intel MPI's position on passing USR1 and USR2 to the mpi processes? Why would these ever be blocked or not propagated? Even if something else used these signals (like a checkpoint library) my handler should catch the signals first and supercede all others.

3. Why do some hydra parameters not work as advertised?

tnx ... John G. Shaw

Laboratory for Laser Energetics
University of Rochester