Vague error message while executing MPI-Fortran program

Michael_M_2 · ‎10-24-2017

Dear all,

When compiling and running a Fortran program on Linux (OpenSUSE Leap 42.3) I get an undefinable error message stating, that some "Boundary Run-Time Check Failure" ocurred for variable "ARGBLOCK_0.0.2". But this variable I don't know or use in my code and the compiler is tracing me back to the line of a "CONTAINS" statement in a module.

I am using the Intel Fortran Compiler from Intel Composer XE 2013 with the following Options:

ifort -fPIC -g -traceback -O2 -check all,noarg_temp_created -warn all

Furthermore, the program uses Intel MKL with the functions

DGETRF, DGETRS, DSYGV, DGEMM, DGGEV

The complete error message looks like:

Boundary Run-Time Check Failure for variable 'ARGBLOCK_0.0.2'

forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source             
libc.so.6          00007F2BF06CC8D7  Unknown               Unknown  Unknown
libc.so.6          00007F2BF06CDCAA  Unknown               Unknown  Unknown
geops              00000000006A863F  Unknown               Unknown  Unknown
libmodell.so       00007F2BF119E54D  strukturtest_mod_         223  strukturtest_mod.f90
libmodell.so       00007F2BF1184056  modell_start_             169  modell_start.f90
geops              000000000045D1A3  Unknown               Unknown  Unknown
geops              000000000042C2C6  Unknown               Unknown  Unknown
geops              000000000040A14C  Unknown               Unknown  Unknown
libc.so.6          00007F2BF06B86E5  Unknown               Unknown  Unknown
geops              000000000040A049  Unknown               Unknown  Unknown

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

The program has the following structure:
- basic functions linked into static library (*.a), containing only modules --> using MKL routines
- main program linked into a dynamic library, containing 1 bare subroutine, modules else
- calling program (executed with mpiexec), calls mentioned subroutine in main program

Without the calling program (in Open MPI) the subroutine runs without problems. But when invoking it with the MPI program I get the error message above.

So maybe some of you encountered a similar problem and is able to help me. I would be really grateful.

Thanks,

Michael

jimdempseyatthecove · ‎10-24-2017

The error message indicates a runtime check for array accessed outside of boundary.

Try running the program with MPI *** specifying 1 process. If that runs, then I suspect that you may have a programming error where you may have partitioned the work by the number of ranks (number of processes), however, each rank attempts to iterate over the entire size of some array that was split. The ARGBLOCK_0.0.2 sounds like you are using the new Fortran BLOCK / ENDBLOCK sections for code. I'd start by looking for something wrong about those BLOCKS.

Jim Dempsey

Michael_M_2 · ‎11-05-2017

Thank you for the hint. I tried it out but the error still appeared in the same way.

What I now did was to change some code within my basic functions library. I had a module with PRIVATE variables (directive) changing its values according to the case of calculation. That means some special procedures were called to set up the private variables. So a main procedure could use this special setup to run a general routine.

Maybe the example code can somehow illustrate my intent.

MODULE my_module

...
REAL*8, PRIVATE :: some_variables
!~ <comprising REALs, INTEGERs, ARRAYs, ...>
...

CONTAINS


SUBROUTINE general_sub(return_arg)
...
  REAL*8, INTENT(out) :: return_arg
...
  !~ <do some special things with "some_variables">
  return_arg = 2.d0 * some_variables
...
END SUBROUTINE


SUBROUTINE special_sub1(some_variables_arg1)

  REAL*8, INTENT(in) :: some_variables_arg1
...
  some_variables = some_variables_arg1
  !~ <assigning argument values to private variables, allocating-deallocating of arrays included>
...
  CALL general_sub(...)
...

END SUBROUTINE


SUBROUTINE special_sub2(some_variables_arg2)

  REAL*8, INTENT(in) :: some_variables_arg2
...
  some_variables = some_variables_arg2
  !~ <assigning argument values to private variables, allocating-deallocating of arrays included>
...
  CALL general_sub(...)
...

END SUBROUTINE



END MODULE

Now I changed it to avoiding those private variables by passing it to the main procedure (in this module) as arguments - and it works.

So maybe you have an idea why it was a problem to use private variables as a special setup? Well, if not there would not be a problem any more. But it would be interesting for me not only to figure out how I could solve but also why I could solve it this way.

Best regards

Michael

jimdempseyatthecove · ‎11-06-2017

You provided too little information to resolve the problem.

Using the original code (with the problem), if you can, run the executable as non-MPI program, in the debugger. When error occurs it should trap into the debugger, and then you can examine the state of the variables causing the error.

If the error does not show up in the debugger but shows up when running without the debugger, then you can use the trace back to help you to identify the section of code causing the error. It is a little harder to determine the error this way by visually inspecting the code to determine the error. In your original post you had:

Boundary Run-Time Check Failure for variable 'ARGBLOCK_0.0.2'

forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source             
libc.so.6          00007F2BF06CC8D7  Unknown               Unknown  Unknown
libc.so.6          00007F2BF06CDCAA  Unknown               Unknown  Unknown
geops              00000000006A863F  Unknown               Unknown  Unknown
libmodell.so       00007F2BF119E54D  strukturtest_mod_         223  strukturtest_mod.f90
libmodell.so       00007F2BF1184056  modell_start_             169  modell_start.f90
...

The Boundary Run-Time Check Failure is generally a subscript out of bounds error. This can be an actual occurrence of indexing an array out of bounds, or it can be something that, to the runtime system, looks like an array indexing out of bounds.

With the above dump, disregard the Source Unknown entries for the libc.so.6 lines.
geops does not have a line number nor source file. This may be a procedure from a 3rd party (static) library or a procedure you wrote, but are compiling without trace back information. This makes it harder to determine what inside geops caused the error. Due to lack of information within geops, you then look higher up (lower down in above trace back list) in the call stack to locate what the caller is passing, then deduce the error from there. From the description of the error, you are likely passing incorrect arguments. At line 223 in strukturtest_mod.f90.

From the name of the procedure "geops" I will guess that this is a routine to obtain geo-positioning information. Though I could be wrong. If it is, you may be passing a Fortran CHARACTER string (which is not NULL terminated) to a C function that expects a NULL terminated string. You may have forgotten to TRIM the trailing spaces from the Fortran string then append a null character.

This is just a guess.

Jim Dempsey