Tracking down random bugs

Dishaw__Jim · ‎07-06-2007

When I compile with the options

/nodebug /check:none /G7 /O2 /Og /QaxN /QxK /Qglobal-hoist /Qip /Qopenmp /Qparallel

I experience two types of crashes randomly. When I say random, I mean random. When I run the program against the test suite of problems it may get through the entire test suite or it may crash with some variety of parameters. The set of parameters that causes the crash is not consistent.

One type of crash is an "Access Violation" and I am not sure where it occurs because the debugging information does not exist.

The other type of crash is a bit more interesting. The second crash occurs in a routine that uses INQUIRE to determine if a file exists and then does an OPEN and then a CLOSE with STATUS=DELETE to remove the file (yes it is arcane but it is legacy piece of code that I do not control). The CLOSE fails because the file does not exist. The typical way I run this code is where all the IO is being done on a network share--I need to experiment with a local filesystem and see if the error occurs.

When I compile the debug version with

/nopdbfile /Od /debug:minimal

I do not get any crashes. My primary question is what is a good way to identify the location of the "Access Violation?" Would adding /traceback perturb the optimization sufficiently to make the bug go way?

Steven_L_Intel1 · ‎07-06-2007

/traceback won't affect the generated code but MIGHT affect how the program is laid out in memory.

What I usually do is start removing one switch at a time to find out which one is triggering the problem. That doesn't necessarily mean there's a bug in the implementation of the switch, but the fewer variables the better. You've got parallel and CPU dispatch code in there, both of which make significant changes to the generated code.

jimdempseyatthecove · ‎07-06-2007

James,

On the delete file problem. I am experiencing a similar problem with a non-Fortran program (C++) that I wrote back in 1994. It was working fine up until Windows XP, Windows Server 2003, etc. Then it began experienceing problem

The program is a file synchronization utility. When I push synchronize (e.g. From Local Folder to Network Shared Folder) I now receive random but frequent BAD_NET_PATH error codes from the process as well as FILE_NOT_FOUND. The push synchronization will complete if I run it saveral times. A pull synchronization always works (from remote system to local system).

The systems are running Anti-Virus and I have not experimented with turning that off. However, I expect the FILE_NOT_FOUND is due to a Lazy Delete or directory cache operation in Windows. i.e. on the second and later run the directory scan (FindNext) sees the old and deleted file on the remote system after it was deleted on a prior pass then the Delete process fails due to the file not actually being in the target directory.

On your crash problem I notice you are using parallel programming (/QOpenMP as well as /Qparallel). Access violation typically occures from dereferencing a NULL or uninitialized pointer (or trashed pointer). I suspect you have a parallel programming bug. These are hard to find when they only show up in the release build. You can compile the release build with the debug information. Full optimization may make the line numbers difficult to determine however the module name where the crash occures usualy will show up and this will get you close to the error. Then the technique to use would be to insert trace and diagnosticinformation into the offending module. Or if you are lucky, compiling that module only with full Debug info will produce a failing execuitable.

A "gotcha" that gets me from time to time withparallel programmingis to forget to include the ", automatic" to the small temporary arrays. Example: Use "REAL, AUTOMATIC :: VECTOR(3)" else the VECTOR(3) mightdefault toSTATIC.

Jim Dempsey

Dishaw__Jim · ‎07-10-2007

I think that I have gotten closer to the source of the "Access Violation" bug. I littered my code with WRITE statements and tracked it down to a gemm call in MKL (actually dgemm because I am using the Fortran 95 interface). I'm guessing the problem lies in gemm() because I have bracketed the call with WRITE statements and I only see one of the two WRITE's.

I'm invoking gemm with the call

CALL gemm(sigma, sub, temp, 'N', 'N', 1.0_dp, 0.0_dp)

The array "temp" was allocated in the same subroutine as the gemm() call as

REAL(dp) :: temp(3*nMax, (3+5*spatial_order)*nMax)

where spatial_order can either be 1 or 0 and nMax can be in the range of 10 to 200. The sigma and sub arrays were passed to the subroutine via the following call

CALL compute_star_matrix(&
   sigmatrix(:,:,region), Asub(:,:,region))

The sigmatrix and Asub matrices were dynamically allocated via the ALLOCATE call as

ALLOCATE(Asub(1:nMax * (2*spatial_order + 1), &
              1:4*(spatial_order + 1) * nMax, &
              1:region_max))
ALLOCATE(sigmatrix(1:nMax * (2*spatial_order + 1), &
                   1:nMax * (2*spatial_order + 1), &
                   1:region_max))

The program will run and produces results that I have verified are correct. Occasionally, it will apparently terminate in the gemm() routine with the following output

forrtl: severe (157): Program Exception - access violation
Image       PC         Routine     Line      Source
di.dll      1038508A   Unknown     Unknown   Unknown

I decided to replace the gemm call with a MATMUL and I still experience the random crashes. When I have the MATMUL version, I do get a traceback which points to the MATMUL call.

It appears the bug only occurs when spatial_order = 1. If I setup my test cases such that the "spatial_order = 1" test case occurs first, I don't experience the random crash. The only difference between being first is that no memory has been allocated and then deallocated.

It may be possible that I don't have a matched DEALLOCATE call for every ALLOCATE call. Of the matrices involved in the multiplication I have checked that they are fine. Could this be an alignment problem? Any suggestions on how to proceed? Incidentally, this problem is occuring with the Debug build of my program. I haven't been able to experience the problem when I run within the Visual Studio debugger. I experience the problem when I run from the command line.

Dishaw__Jim · ‎07-10-2007

One quick update, I did get a crash when I moved the "spatial_order =1" test case first.

jimdempseyatthecove · ‎07-10-2007

James,

Since your code appears to be calling compute_star_matrix one region at a time how about this

Allocate Asub and sigmatrix last index to 0:region_max+1.

Then fill in the 0't and region_max+1 elements with sentinals.Such as123454321 or some identifying data that would not be part of the data set (do not use all zeros). Do this immediately after allocation.

Then prior to processing the loops, check the 0't and region_max+1 elements for the sentinals. If sentinals not found then something overwrote the arrays.

Also, prior to DEALLOCATE, modify the sentinals to a different sentinal, such as 5432112345.

If when you process your 1 to region_max loop you discover a sentinal error and if in the sentinal you see 5432112345 then this indicates a programming error on your part where you are known to be using a DEALLOCATED buffer.

If you see some other junk you could still be using a DEALLOCATED buffer or something in the library you called is overwriting the buffer.

Second approach

Maintain a log (ring buffer) containing a list of LOC of the regions allocated. Upon delete zero out the corisponding LOC entrys. Prior to call of compute_star_matrixcheck the ring buffer for existence of entry with LOC value. If not found then your code is passing a deleted entry into the subroutine.

By the way, I assume you have run this with Uninitialized Variables Check and Array Bounds Check enabled.

Often, if you are "absolutely sure" you did everything right you might find that you aren't. e.g. using 1-based indexing for call expecting 0-based indexing.

Good luck in your bug hunt

Jim Dempsey