Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

forrtl: error (78): process killed (SIGTERM)

Asif_u_
Beginner
10,176 Views

I have a strange error that I am not sure how to deal with. I have a scientific numerical code that uses multiprocessors to run. If I run on a single process (CPU), the code can run essentially forever. However, when I use multiprocessors, e.g. it crashes after a fixed number of 'iterations' (over a million, which is quite a lot), in this case it runs for a couple of days before crashing with error:

forrtl: error (78): process killed (SIGTERM)

Image              PC                Routine            Line        Source             
libc.so.6          00000031EDACF3A7  Unknown               Unknown  Unknown
libopen-pal.so.5   00002AF51C68D6CF  Unknown               Unknown  Unknown
libmpi.so.1        00002AF51BB098BD  Unknown               Unknown  Unknown
libmpi.so.1        00002AF51BB36987  Unknown               Unknown  Unknown
libmpi_mpifh.so.2  00002AF51C1F3666  Unknown               Unknown  Unknown
zeusmp.time        0000000000450265  bvalemf2_                 672  bvalemf.f

I debugged the code and found that it stalls due to two possible calls:
1. MPI_WAIT
2. MPI_ALLREDUCE 

What I noticed is that ALL the simulations crash after a fixed number of iterations. So, I believe that somehow MPI_REDUCE call fills up memory somewhere, and when that reaches maximum, the code crashes. 

Here is my 'ulimit -a'
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514818
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited 

I searched the net to see if others encountered anything similar, but so far I was unsuccessful in finding a solution.One suggestion was to compile with -heap-arrays, but this did nothing at all for this case. Any suggestion will be highly welcome.

Compilations options I used: /act/openmpi-1.7/intel/bin/mpif90 -w -FI -c  -O2 -Wno-globals -traceback -fpe0 -heap-arrays -L/act/openmpi-1.7/intel/lib -ldl -lpthread -lmpi

and I run it with:

/act/openmpi-1.7/intel/bin/mpirun -n 16

 

 

 

0 Kudos
3 Replies
Steven_L_Intel1
Employee
10,176 Views

I've seen reports of this behavior before. In all cases it was due to some sort of "watchdog" process killing programs it thought were "runaway".

0 Kudos
Asif_u_
Beginner
10,176 Views

Right, thanks for your reply. it is eventually killed by MPI_WAIT or MPI_ALLREDUCE. But what is the best way to prevent such a behavior? I have compiled with --CB (check bounds), I also tried with gfortran, with the same results but crash occurs earlier, after fewer although substantial number of iterations (Over a million iterations). 

0 Kudos
Steven_L_Intel1
Employee
10,176 Views

Talk to your system/cluster administrator. It's an outside process that is killing your job.

0 Kudos
Reply