M6201 runtine error (DOMAIN)

TommyCee · ‎07-28-2012

I have a rather large F77 program that I have carefully converted to F90. To further move it to F90, I am tediously removing each of a number of COMMON blocks. I have successfully dismantled 4 of them, and am working on a 5th. At each stage, I verify output vs. the original test case to ensure numerical fidelity.

In general, when I remove a COMMON block, I add its variables to the Declaration section up in the Module of the program, and then, for each subroutine that had that used that COMMON block, I add 'use ModuleName'. You know the drill. This has worked fine.

I'm using, btw, the CVF v6.6C compiler (sorry) w/ Windows XP Pro.

In the latest case, I removed the 5th COMMON block and it compiled fine. However, during execution, the program choked on a certain line in one of its many subroutines- a line which involved SQRT. 3 key lines are as follows (real U is passed in as an argument; real array Y of 8 elements is set in an other subroutine and also passed in as an argument):

[bash]W = Y(2)/Y(1) sum = (w**2 + u**2) V = SQRT(W**2+U**2)[/bash]
I traced the problem to this point and added some diagnostic writes just below this choke point. Strangely,
(1) real array Y is not even in the COMMON block I removed.
(2) the subroutine is executed 128,646 times before any problem arises, at which time the values are as follows:

Y(2) = Infinity Y(1) = Infinity w = NaN u = 3.5 Sum = NaN

When Y(2) & Y(1) go to infinity, their quotient is NaN, and SQRT(NaN) is BUST!

Mind you , in the version before I removed the 5th COMMON block, this sub. is executed 473,351 times. So the program only went to ~27% completion before crashing.

FWIW: Y is used in a couple subs. and always passed in or out via ARGs (never via COMMON or any other way).

I realize that most of you will have a difficult time deciphering this w/o the code, which I can't provide b/c it's lengthy & proprietary. But I'm looking for general Fortran principles. Why would a variable (array Y) that worked fine before suddenly begin taking on values that reach infinity, thereby crashing the intrinsic SQRT function? Can I try any declaration tricks to "stabilize" Y? Any thoughts or suggestions are appreciated.

Steven_L_Intel1 · ‎07-28-2012

This has the hallmarks of classic data corruption - probably caused by accessing some variable outside its declared bounds. When you change the memory layout by removing COMMONs, you "move" the corruption until it hits a spot that causes trouble down the road.

This sort of thing is difficult to track down. Since you're using CVF you're missing many of the "tools" available to users of Intel Visual Fortran, such as generated interface checking, but I will suggest initially that you enable /fpe0 (Under Floating Point > Exception handling). This should cause an error when a variable goes to Inf, but only if it does so during a computation. If it's just garbage data being written somewhere that's harder. I assume you have array bounds checking on, but that won't help with assumed-size arrays.

You could try a data breakpoint in the debugger to see when that array element changes value, but this option is not terribly reliable in the VS debugger. What I do when all else fails is start selectively disabling earlier parts of the program to see if the symptom changes. Often, but not always. that leads me to the culprit.

TommyCee · ‎07-30-2012

Thanks a lot for your usual sage comments & advice, Steve. I meant to post back on Sunday but the day got day from me. As many things in Fortran do, it came to me in in my sleep Sat. night that it may not matter that the Y-array is not contained in the COMMON block I removed. Rather, I surmised, it might be that the Y-array is dependent somehow on something passed via that COMMON block. I knew that the Y-array was set in another sub. and imagined that that sub. had once used the COMMON block I removed. Sure enough, I checked and that was exactly the case. But which variable was teh culprit? It remained for me to soup up my debugger to trace the actual choke point. I added the other switch you suggested (actually, bu a little treial & error, I found that it was /fpe:0 , added to the list under Project|Settings|Fortran Project Settings:). This added a lot of horsepower. At runtime, the choke point was actually trapped upstream in the earlier subroutine where Y() was set, and instead of a SQRT problem, it was

forrtl: error(73): floating divide by zero

A gas constant value got corrupted (details spared here) and was, shall we say, not really constant. This trace enabled me to fix the problem. The fpe switch really helped.

Your point about the superiority of iF is well taken. I'm making plans to go there soon, I swear.

On another note, I have noticed in my manipulation to convert this complex program to at least F90 standards (including the removal of COMMON blocks) an interestnig artifact. At each incremental step, I compare the vast numerical output w/ a previous benchmark test case. Only as a result of "changing the memory layout", I see somwwhat random occurrences of differences in the 3rd, 4th, or even 5th decimal place. Does this make any sense? Has anyone else seen ths kind of behavior? I wonder if there have been any studies (reports) that test changes only having to do w/ rearranging memory access in a computational algorithm. I'd be curious to know of any citations or anecdotes. I'm hopeing some weighs in and tells me that this does (can), indeed, happen.

Steven_L_Intel1 · ‎07-30-2012

You should have found a project setting for /fpe under Floating Point - I think. It's been a very long time since I've seen the CVF/VS6 interface.

It is remotely possible that relocating a variable caused some permutation in the optimizer so that the order of operations changed. Or that an intermediate result was kept in the 80-bit x87 registers longer (or shorter). Or it may be you still have data corruption and you move the target of corruption. If it is really important, then I suggest you "instrument" the program, dumping to a file intermediate results, run it for both versions, and see where the results start to diverge. You'll have to REALLY want to track this down to go this route - it can be brain-numbing.