Talk to your system/cluster

Asif_u_ · ‎05-18-2015

I have a strange error that I am not sure how to deal with. I have a scientific numerical code that uses multiprocessors to run. If I run on a single process (CPU), the code can run essentially forever. However, when I use multiprocessors, e.g. it crashes after a fixed number of 'iterations' (over a million, which is quite a lot), in this case it runs for a couple of days before crashing with error:

forrtl: error (78): process killed (SIGTERM)

Image PC Routine Line Source
libc.so.6 00000031EDACF3A7 Unknown Unknown Unknown
libopen-pal.so.5 00002AF51C68D6CF Unknown Unknown Unknown
libmpi.so.1 00002AF51BB098BD Unknown Unknown Unknown
libmpi.so.1 00002AF51BB36987 Unknown Unknown Unknown
libmpi_mpifh.so.2 00002AF51C1F3666 Unknown Unknown Unknown
zeusmp.time 0000000000450265 bvalemf2_ 672 bvalemf.f

I debugged the code and found that it stalls due to two possible calls:
1. MPI_WAIT
2. MPI_ALLREDUCE

What I noticed is that ALL the simulations crash after a fixed number of iterations. So, I believe that somehow MPI_REDUCE call fills up memory somewhere, and when that reaches maximum, the code crashes.

Here is my 'ulimit -a'
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514818
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I searched the net to see if others encountered anything similar, but so far I was unsuccessful in finding a solution.One suggestion was to compile with -heap-arrays, but this did nothing at all for this case. Any suggestion will be highly welcome.

Compilations options I used: /act/openmpi-1.7/intel/bin/mpif90 -w -FI -c -O2 -Wno-globals -traceback -fpe0 -heap-arrays -L/act/openmpi-1.7/intel/lib -ldl -lpthread -lmpi

and I run it with:

/act/openmpi-1.7/intel/bin/mpirun -n 16

Steven_L_Intel1 · ‎05-29-2015

I've seen reports of this behavior before. In all cases it was due to some sort of "watchdog" process killing programs it thought were "runaway".

Asif_u_ · ‎05-31-2015

Right, thanks for your reply. it is eventually killed by MPI_WAIT or MPI_ALLREDUCE. But what is the best way to prevent such a behavior? I have compiled with --CB (check bounds), I also tried with gfortran, with the same results but crash occurs earlier, after fewer although substantial number of iterations (Over a million iterations).

Steven_L_Intel1 · ‎05-31-2015

Talk to your system/cluster administrator. It's an outside process that is killing your job.