- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm running a WRF with over 300 processes. There are situation that one of the processes crashes, but the other processes are still keep burning the cpu's. Is there any way that Intel's mpi can terminate the program automatically whenever one of the processes exits?
Thank you very much
Tofu
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tofu,
The default behavior should be for the entire job to end if one ranks fails. Is something different happening?
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When WRF starts up, every process needs to access a directory. It happens that some compute node didn't mount the shared folder and cannot access the directory. Processes on that node crash but the other processes keep running. Of course we can re-mount the directory to fix the problem. But we wonder if Intel's MPI has any settings to force all processes to terminate whenever one or more processes crashes.
Tofu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tofu,
There are settings to prevent that behavior, but it is the default. By setting I_MPI_FAULT_CONTINUE=on, a program that is designed to handle errors will be allowed to continue running.
Please send the output from
[plain]env | grep I_MPI[/plain]
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is only one variable starts with I_MPI_*, which is I_MPI_ROOT=/opt/intel/...
Tofu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We used Intel's HPL for a similar test and the situation is similar. Some xhpl_intel64 processes are crashed (because of directory access problem) but the rest are kept burning the CPU. Even setting the I_MPI_FAULT_CONTINUE=off, the behavior is similar.
Tofu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tofu,
We do have a utility for cleaning up after a failed job launched from Hydra. To use this, set I_MPI_HYDRA_CLEANUP=enable before the run. A cleanup file will be generated in /tmp (you can set I_MPI_TMPDIR to a different path to change this). The file will be named mpiexec_${username}_$PPID.log. Run the command
[plain]mpicleanup -i <cleanup file>[/plain]
to clean up after the failed job. If the job completes successfully, the file will automatically be deleted.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Instead of I_MPI_HYDRA_CLEANUP, we have already used the -cleanup option. But it doesn't help to terminate the mpi program when a process of it crashes at startup. Setting I_MPI_DEBUG=5 doesn't give any hint either. Any other options that would be helpful to give extra debugging information?
Thank you very much
Tofu
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page