My Kingdom for an Error Message

Dogbite · ‎07-21-2009

My program has now taken to stopping (without completion) without an error message. I have added some tell-tales to the code to try to locate the point of failure, but the failure moves around as I add tell-tales.

I am debugging my OMP implementation, but am setting max processors to one. I got error messages for the previous bug (record number out of range) but not for this current problem.

Is there any way to force the output of the trace stack?

Does the lack of an error message suggest anything about the nature (or location) of the problem?

Steven_L_Intel1 · ‎07-21-2009

I have seen this when memory is severely corrupted. You might try setting the stack reserve size higher or linking with the DLL libraries to see if that changes the behavior.

Dogbite · ‎07-22-2009

Quoting - Steve Lionel (Intel)

I have seen this when memory is severely corrupted. You might try setting the stack reserve size higher or linking with the DLL libraries to see if that changes the behavior.

Well, I hadn't thought of stack size as a possibility. I had set stack reserve to 6,000,000, but had not specified a value for stack commit. Setting them both to 6,000,000 made no difference, however. Neither did setting them both to 6,0000,000. For that matter, I get the same results after setting them both to 6.

Ditto for using kmp_set_stacksize_s to set individual thread size to 16000K.

If I understand the (frankly murky) documentation, since most of my thread-specific data is defined in threadprivate common blocks, that wouldn't be placed on the stack anyway. Or should I be setting some reserve values for the heap?

I don't understand your suggestion about linking with the DLL libraries -- which libraries and specified exactly where?

Steven_L_Intel1 · ‎07-22-2009

Fortran > Libraries > Use Run-Time Library. Change to "Multithreaded DLL".

Alternatively, try adding this under Fortran > Command line > Additional Options: /Qopenmp-link:static

I'm just grasping at straws here. But I have seen programs "just exit" when the stack has been severely corrupted.

Dogbite · ‎07-22-2009

Quoting - Steve Lionel (Intel)

Fortran > Libraries > Use Run-Time Library. Change to "Multithreaded DLL".

Alternatively, try adding this under Fortran > Command line > Additional Options: /Qopenmp-link:static

I'm just grasping at straws here. But I have seen programs "just exit" when the stack has been severely corrupted.

Well, changing to multithreaded DLL certainly provoked a change: "The system cannot execute the specified program." Oh my.

And the link:static alternative returned the program to its previous error pattern.

Steven_L_Intel1 · ‎07-22-2009

Hmm - do you have any very large COMMON arrays? You may want to reset the stack settings back to the defaults (or blank).

Dogbite · ‎07-24-2009

Quoting - Steve Lionel (Intel)

Hmm - do you have any very large COMMON arrays? You may want to reset the stack settings back to the defaults (or blank).

Steve,

The program has many arrays in Common, but I don't know as you'd consider them large.

I have compiled without the OMP translation, and I'm getting error messages, although they still seem a bit squirrely. I think part of what is happening is that the program is more sensitive to memory allocations than before. (Heck, even now the program does no runtime memory allocation -- it's rather staid in that way.)

But I did trace one FP overflow error. In the routines lying within the OMP region, I had dutifully moved all locally preset data to shared common areas where the values would be available to all threads. It turns out that in the original implementation, one of the variables was spelled differently (DRIVDEN vs. DRIVEDEN) in its declaration and in the body of the procedure. This worked before because the compiler assumed implicit declaration, local variables were SAVEd,and everything was jake. Now, even with OMP not active, the compiler is no longer instructed to SAVE locals, and the value on entry to the procedure was, eventually, big enough to produce an overflow. At least that's what I'm telling myself.

But I'm still getting the curious affect where adding tell-tales changes the results, usually shooting right by previous error points.

But a question: do I understand correctly that each time I enter a routine, the local variables allocated to the stack will have different values? (In one case, the errmsg identifies an invalid FP at a tell-tale where I'm printing out the value of a variable that hadn't yet been set by the routine.)

Steven_L_Intel1 · ‎07-24-2009

Any local variables that do not have the SAVE attribute will have undefined values on entry to the routine.

Dogbite · ‎07-31-2009

Quoting - Steve Lionel (Intel)

Any local variables that do not have the SAVE attribute will have undefined values on entry to the routine.

Well, I've spent the interim cleaning up a few unitialized variables in builds with the OMP translation disabled. (It occurred to me that I should ensure that the changes to data and control mechanisms hadn't been the problem.)

Compiling with the OMP enabled, the program runs with a single thread. (Number of threads is controlled bya run-time input to the program.) Running with two threads, the program gets barely started before it dies -- again without an error message. The stack settings were: reserve = 6000000; commit = 0; kmp_set_stacksize_s(16000).

I increased kmp_set_stacksize_s to (16000000) and Shazam! I've got error messages again. In fact, "(157) Program Exception - access violation" runs right off the top of the window! These messages are followed by a single "(170) Program Exception - stack overflow" statement.

Reading the tattletales, I find that thread0 died shortly after thread1 was initialized. Thread0 put out a standard tt message, thread1 did the same, and the next message from thread0 indicates that a crucialvariable (located in threadprivate common) is set at the bad value of zero.In getting to this error message, thread0 passed over atattle tale message conditioned on the same flag (also in a threadprivate common) that prompted the previous tt msg. The bottom line here being that thread0 lost its connection to its threadprivate data.

So, is this a step forward, or a step back? And where do I step next?

I suppose that the stack overflow message came after the earlier access violations. In the hope that the first message might have more specific data (fault address), how do I set up the runtime environment to copy the screen output to a file? (Yes, I've seen the documentation on qdiag, but it seems to be for compiler diagnostic messages.)

I imagine that perhaps the stack overflow message might be generated in response to the error handling. Is that a reasonable guess?

And are there any glaring problems with the stack settings I'm using -- coming long ago from a mainframe background, I've never been comfortable with stack architecture. Should I try and stuff stuff on the heap?

Steven_L_Intel1 · ‎07-31-2009

Setting kmp_stacksize doesn't help if you haven't also increased the linker's stacksize setting ("stack reserve size"). You may find that adding /heap-arrays reduces stack overflow issues.