program hangs when a stack corruption is detected with /check:stack

Karanta__Antti · ‎05-07-2019

I recently turned on /check:stack . Seems to detect stack corruptions nicely. If a debugger is attached, it stops right at the end of the function / subroutine and tells around which variable the corruption occurred.

However, if no debugger is attached, the process hangs. Inspecting I found that the thread where the problem occurred had just died and this resulted in another thread (of ours) waiting for it forever. Both threads are started from .NET, so I would have expected to see some exception, but did not.

It would be fine if e.g. the whole process was terminated with some appropriate information about the occurred problem written to the console.

Any ideas how I could proceed with this? What kind of exception should the ifort compiled code be raising?

environment: ifort 2019.3, windows 10

Steve_Lionel · ‎05-07-2019

Normally it would give an error message, but if the stack is corrupted that can prevent the error from being reported outside of the debugger. In the debugger it gets picked up as it is raised, but otherwise it goes through various error handling processes before dumping out at the end.

Karanta__Antti · ‎05-07-2019

The hang happened also in cases where the stack corruption was not very severe and did not overwrite anything of consequence, e.g. an integer*4 being passed to a routine who wrote it as integer*8. What I mean, this should not really corrupt any return addresses or such on the stack, so the execution could continue normally.

What could be preventing the diagnostic message from being reported?

The process hanging is a real problem, as it causes problems in our continuous integration builds + test runs.

andrew_4619 · ‎05-08-2019

integer*4 being passed to a routine who wrote it as integer*8
As an aside you should really be catching most of those at compile time. Explicit interfaces , check interfaces etc. I pretty much never seem to have stack issues these days.

jimdempseyatthecove · ‎05-10-2019

While the return address might not get corrupted (due to the return values being passed into the subroutine by reference) the integer*8 result will (can) overwrite something important following the expected integer*4 result. Say an index into an array, thus on return, the next element index referenced of an array following the call (with argument mismatch) can be off in never-never land. This can:

a) immediately crash
b) cause a cascade of errors that does not exhibit symptoms until much later
c) be benign (error does not cause incorrect behavior of program)
d) the errant write corrupting the stack return address of higher level calls

Andrew is right about (at least once) enabling all of the compile time diagnostics, as well as runtime diagnostics.

Jim Dempsey

Karanta__Antti · ‎05-12-2019

I expressed myself unclearly. I did not mean that the overwrite in the described case is harmless. I wanted to point out that it seems unlikely to be the cause for the runtime check to hang the program instead of reporting the error, which was the reason I posted the question starting this thread. Hanging the program makes this very useful check painful to use, especially as it will (already has) hang our continuous integration builds.

I am in the process of turning all the possible checks on. Fixing the found issues does not happen instantly for a large code base.

andrew_4619 · ‎05-13-2019

Karanta, Antti wrote:
I am in the process of turning all the possible checks on. Fixing the found issues does not happen instantly for a large code base.

I fully understand your pain! In my case I found the effort was time well spent and pretty eliminates many really hard to find bugs.

jimdempseyatthecove · ‎05-13-2019

Inserting runtime checks moves code around and/or may affect code optimization. When a given errant section of code has distant side effects (e.g. writing to code, modifying data it is not intended to, corrupting stack, etc...) then the point of execution where the symptom (?crash?) occurs often changes or is not observed.

Most mysterious errors (they are only mysterious until found) are programming errors.

Some (few) are compiler generated errors, typically associated with new optimization features. Turning off optimization for the affected source file will usually fix the problem.

And a few instances some errors are really bazar and most difficult to locate the cause. Several years ago I had a case where the program execution didn't make any sense. After many days using the debugger and inspecting assembly code, I could only deduce that either a) the CPU was executing the instruction incorrectly (which was not believable), or b) the instruction displayed in the debugger was not the instruction observed in the debugger. Fortunately the "instruction" at issue caused an invalid instruction trap. To catch this error, I instrumented the code to snapshot the series of bytes surrounding the faulting instruction. Then of program abort, by examining the capture buffer it was noticed that a single byte of the instruction stream was overwritten with 0x03. This byte (when appearing as the first byte in the instruction stream) is the instruction for trap. IOW what is inserted for break points. Note, the debugger should only insert this byte as the first byte of an instruction. In this case it wasn't. Using the IDE to examine the break points, there was no break point anywhere near this location. My first approach was to delete any breakpoints located in the source file exhibiting the error. This did not work. Then I delete each remaining breakpoints trying to locate the problematic one. This too did not work. Now I was completely perplexed. The IDE is inserting a breakpoint (to an incorrect location) without any breakpoints specified. When the error occurs, examining the disassembly shows no evidence of the error, yet the trace buffer shows the problem. Just for a whim, I clicked on the XX in the breakpoints property page, and to my delight, the problem went away.

I hope that the compile time diagnostics or runtime diagnostics locates your issue. That failing, experiment with optimizations. If you fall into the IDE issue, then I wish you good luck.

One more note. On a different thread on IDZ a user had a problem that resulted from making a call to a 3rd party library using an incorrect argument type. In his case, passing an INTEGER(4) to a subroutine that was requiring an INTEGER(8). The INTEGER(4) was introduce by substituting a literal 1 where formerly the code used an INTEGER default variable with defaults set to (8). To complicate this, the subroutine interface was not defined in a module such that the compiler could not check the arguments. Please assure that your interfaces are correct.

Jim Dempsey

FortranFan · ‎05-13-2019

Karanta,Antti wrote:
.. Inspecting I found that the thread where the problem occurred had just died and this resulted in another thread (of ours) waiting for it forever. Both threads are started from .NET, so I would have expected to see some exception, but did not. ..
Any ideas how I could proceed with this? What kind of exception should the ifort compiled code be raising? ..

Karanta,Antti wrote:
I expressed myself unclearly. I did not mean that the overwrite in the described case is harmless. I wanted to point out that it seems unlikely to be the cause for the runtime check to hang the program instead of reporting the error, which was the reason I posted the question starting this thread. Hanging the program makes this very useful check painful to use, especially as it will (already has) hang our continuous integration builds.
I am in the process of turning all the possible checks on. Fixing the found issues does not happen instantly for a large code base.

My suggestion will be to not fix anything in the "large code base" YET. But to get to the root of the matter with a simpler reproducer, as small a working example as possible to construct a prototype of the "large code base" focusing mainly on the interface(s) between .NET and Fortran and spawning threads from .NET to recreate the stack corruption that results in a "hung" scenario. As alluded to in the other comments, the stack issue is highly likely due to some form of data misalignment in the interface(s) between Microsoft's 'managed memory' environment in .NET and the so-called 'unmanaged' one with Fortran.

See Quote #7 in this other thread: https://software.intel.com/en-us/node/807421#comment-form. More often than not, a trivial example like the one in Quote #7, but with similar interface(s) and calling mechanisms (e.g., STDCALL) in a "large code base", is sufficient to reproduce the problem and to devise a proper resolution to the issue.

Karanta__Antti · ‎05-14-2019

Thanks for all the advice!

To make sure we are speaking of the same thing, I found four stack corruptions using the /check:stack. All of them had previously gone unnoticed without causing any observable symptoms (which is not to say they did not cause anything, we just did not notice).

When turning on /check:stack and no debugger attached, all the four mentioned test cases hung. That's why I suspect there may be something wrong with the notification mechanism, not just our code (which definitely was faulty).

When a debugger was attached (prior to the stack corruption), the problematic spots were very clearly pointed out and easy to fix.

jimdempseyatthecove · ‎05-14-2019

Glad to see you found and fixed the coding problem. Program hang, for no apparent reason, are difficult to find. I hope you found all the problems.

Jim Dempsey