- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Our lab recently added another HPC cluster and updated most other clusters to RHEL-7 (from 6). The PBS system was also updated to a newer supported version. My codes implement signal handlers to affect a controlled shutdown with or without checkpoints. I have been using the 2015 version of the Intel tool chain, in particular Intel MPI to run the jobs. I use SIGUSR1 and SIGUSR2, and before the upgrade (late December 2016) this all worked correctly, now it does not. My code always works with OpenMPI as well (tested on many different systems, including large national lab computers).
After the upgrade SIGUSR1 is not passed to the MPI processes and SIGUSR2 crashes the code. I have tested with with a very small program that just catches the signals and prints something. This is independent of PBS (ie, running the job interactively from a head node on the cluster). Note that the Intel tool chain, and presumably the MPI libraries within it have not changed! I cannot find anything on using these signals, other than one web post from years ago (2013?) indicating its a "bug". Also, SIGINT and SIGTERM are passed, even though the manual (for hydra exec variables, https://software.intel.com/en-us/node/528782) indicates that they should not be. Changing the I_MPI_JOB_SIGNAL_PROPAGATION has no effect on this behavior.
1. Does anyone understand what is going on?
2. What is Intel MPI's position on passing USR1 and USR2 to the mpi processes? Why would these ever be blocked or not propagated? Even if something else used these signals (like a checkpoint library) my handler should catch the signals first and supercede all others.
3. Why do some hydra parameters not work as advertised?
tnx ... John G. Shaw
Laboratory for Laser Energetics
University of Rochester
Link Copied
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page