- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a strange error that I am not sure how to deal with. I have a scientific numerical code that uses multiprocessors to run. If I run on a single process (CPU), the code can run essentially forever. However, when I use multiprocessors, e.g. it crashes after a fixed number of 'iterations' (over a million, which is quite a lot), in this case it runs for a couple of days before crashing with error:
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libc.so.6 00000031EDACF3A7 Unknown Unknown Unknown
libopen-pal.so.5 00002AF51C68D6CF Unknown Unknown Unknown
libmpi.so.1 00002AF51BB098BD Unknown Unknown Unknown
libmpi.so.1 00002AF51BB36987 Unknown Unknown Unknown
libmpi_mpifh.so.2 00002AF51C1F3666 Unknown Unknown Unknown
zeusmp.time 0000000000450265 bvalemf2_ 672 bvalemf.f
I debugged the code and found that it stalls due to two possible calls:
1. MPI_WAIT
2. MPI_ALLREDUCE
What I noticed is that ALL the simulations crash after a fixed number of iterations. So, I believe that somehow MPI_REDUCE call fills up memory somewhere, and when that reaches maximum, the code crashes.
Here is my 'ulimit -a'
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 514818
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I searched the net to see if others encountered anything similar, but so far I was unsuccessful in finding a solution.One suggestion was to compile with -heap-arrays, but this did nothing at all for this case. Any suggestion will be highly welcome.
Compilations options I used: /act/openmpi-1.7/intel/bin/mpif90 -w -FI -c -O2 -Wno-globals -traceback -fpe0 -heap-arrays -L/act/openmpi-1.7/intel/lib -ldl -lpthread -lmpi
and I run it with:
/act/openmpi-1.7/intel/bin/mpirun -n 16
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've seen reports of this behavior before. In all cases it was due to some sort of "watchdog" process killing programs it thought were "runaway".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Right, thanks for your reply. it is eventually killed by MPI_WAIT or MPI_ALLREDUCE. But what is the best way to prevent such a behavior? I have compiled with --CB (check bounds), I also tried with gfortran, with the same results but crash occurs earlier, after fewer although substantial number of iterations (Over a million iterations).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Talk to your system/cluster administrator. It's an outside process that is killing your job.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page