Re: Trouble with long equation - Page 2

h-faber · ‎09-09-2005

Hi,
I have some strange behaviour with a "normal" equation.
Consider this (hard to read, I know) line:

C(7)=(E(8)-(2*E(10)+2*E(46)+E(42)))*PI/E(5)-E(42)

All single values are of type REAL, all of them have neither NULL nor 0 value.
When running through this line, the result I get is frustrating "NaN". I have several more such examples, same behaviour. Now when I split this long equation into

HILF1C7=2*E(10)
HILF2C7=2*E(46)
HILF3C7=HILF1C7+HILF2C7+E(42)
HILF4C7=E(8)-HILF3C7
HILF5C7= HILF4C7*PI
HILF6C7=E(5)-E(42)
C(7)=HILF5C7/HILF6C7

guess what. No more NaN but a usual (desired) value I can continue calculating with.
I have dozens of such equations so I cannot believe to need to split all of them into shortest pieces.

This behaviour occurs under Intel Visual Fortran Compiler 8.1 within Visual Studio 2003.
Does anyone here have an idea what to do to avoid a time-spending effort to split these equations?

Thanks again in advance.
Harald

anthonyrichards · ‎09-13-2005

It would help a lot if

a) you gave the result of the simple computation for the values you give as being applicable when you get a Nan, and

b) a set of values, including the answer you get, for one of the preceding steps through the loop in the actual complicated program when you do not get a Nan.

That is, we need inputs for a,b,c,d,e etc and output X when you try to compute X=a/b/(c-d/e), or whatever, in both the program that gives the NAn and the simple example that fails to reproduce it. You must be very careful that you copy the equation exactly when running the simplified version.

h-faber · ‎09-13-2005

OK, I can be more concrete now with a much smaller equation.
Consider these lines of code within a SUBROUTINE:

BLABLA1 = ATAN(0.0036992922)
BLABLA2 = ATAN(0.047968611/12.96697)
ZZZZ = 0.047968611
YYYY = 12.96697
BLABLA3 = ATAN(ZZZZ/YYYY)

N=ATAN(R1/X(1))

Now consider that R1 and X(1) have the same values as ZZZZ and YYYY.
The strange behaviour is: In the problematic project, BLABLA3 as well as N are computed as NaN. Of course when I create a new project with

PROGRAM NaNTest
REAL BLABLA1,BLABLA2,BLABLA3,ZZZZ,YYYY
BLABLA1 = ATAN(0.0036992922)
BLABLA2 = ATAN(0.047968611/12.96697)
ZZZZ = 0.047968611
YYYY = 12.96697
BLABLA3 = ATAN(ZZZZ/YYYY)
STOP
END

all results are the same, no NaN in sight.

The master question is: Why does the program give NaN in our context, while the same calculation alone in a testprogram does not? I admit I have no more idea at the moment.

hansr · ‎09-14-2005

I assume, this program was once ok. Whatwas changed? I have had terrible NaN-problems while we went from VMS IBM-Fortran to PC (IVF).

Hans

h-faber · ‎09-14-2005

Hi Hans,
this is still some code which once worked on an old BS2000 or so. I will take a look whether there is a trial version or s.th. of the IVF 9.x to check the problem on this newer compiler.

hansr · ‎09-14-2005

craig wrote:

remember: a/b/c=ac/b

I think,

a/b/c = (a/b) / c <> ac/b

IVF9 calculates so, from left to right as the standard says.

Hans

Message Edited by hansruopp on 09-14-2005 01:58 AM

h-faber · ‎09-14-2005

Hi Hans,
but this does not explain the mystery with ATAN e.g.

hansr · ‎09-14-2005

No, sooory.

Please report here your experience with 9.0.

But I fear, that the error is caused by different compiler behaviour on Host and PC. As Steve wrote: You may be addressing an array out of bounds.

And IVF is less tolerant.

Good luck

Hans

Message Edited by hansruopp on 09-14-2005 02:26 AM

emc-nyc · ‎09-14-2005

hansruopp wrote:
I assume, this program was once ok. What was changed? I have had terrible NaN-problems while we went from VMS IBM-Fortran to PC (IVF).

Hans

Possibly, it may have seemed be OK, but was, nonetheless, broken, e.g., where IVF is giving NaN, the B2000 compiler gave 0.0. I think this may require some serious delving into the behavior of the compiler used on the B2000 system, possibly even talking to the guys who wrote the original program back when Fortran didn't have If..then..else..endif or a character data type (I'm that old...)

I, too, once worked with various flavors of IBM's Fortran compilers (big blue boxes in the bowels of the building). What I remember is that they did not use IEEE arithmetic, did not have NaN, did not maintain guard bits during floating point arithmetic, and (in some incarnations) had their optimizers introduce intriguing and difficult to trace bugs (why would one get a divide check after removing a write statement?).

The moral of this is that different hardware systems and different compilers do not necessarily behave the same.

h-faber · ‎09-15-2005

Hi Hans,
the bad news is: No change with the 9.0.018 compiler. If you want to take a look, see http://img390.imageshack.us/my.php?image=nan3zk.jpg
All explanations, also concerning array index out of bounds, do not consider that a trivial (!) function as
ATAN(X/Y) should never give NaN with the X-Y-values I have provided and are available in the watch window in VS in the screenshot I provided in the link above. And: Why does this work in a DO 200 I=1,3 loop for 2 times, but for I=3 it does not? It might be a compiler issue, but I do not see any logic in this problem. This really puzzles us.

anthonyrichards · ‎09-15-2005

I looked at the image.Several things immediately make me suspicious.

1) there are undefined variables, and at least one variable, E, used as array when not defined as such.

2) Your programming practice involves defining variables such as I, K1, N as REAL, when you also refer to a Do-loop 'Do I=1,3'. I always thought it best to keep variables beginning I->O as INTEGER, just for safety.

3) you do not show HELPR1X1 orBLABLA4 value in your watch window.

I recommend adding IMPLICIT NONE everywhere. I also recommend that youshouldwatch the whole of an array whose index is being looped over, not just particular elements, then you can see which ones have been addressed and given values during computations (I believe in DEBUG mode, REALS are initialised to very large values).

Finally, it will probably save a lot of time if you list the WHOLE of the ACTUAL routine where the computation goes wrong, rather than edited and therefore altered extracts.

I agree with Steve Lionel, that the problemalmost certainlygoes back to overwriting due to array index misaddressing problems,possibly caused byREAL/INTEGER confusion, andalso maybecombined with uninitialised variables. I think we have taken your programming problem about as far as it can go now, as I believe that is what it is, not a compiler one.

hansr · ‎09-15-2005

Hi,

if I cut some of your source to a test programm in a DO 200 I=1,9999999 loop, it works ok.

This means, that the error is causedby a statement at totally another source place by overwriting some in the addresses in the working storage (sorry for my english).

This occured for us sometimes, if arrays went out of bond or parameters mismatched.

I found it with the tool ftnchek, but it is very time-killing.

Hans

h-faber · ‎09-15-2005

Hi,
thanks for all your patience.
Yes, a standalone program does the job perfectly.
Now we have a little change in this program/problem:
I changed the order of some
DIMENSION
INTEGER
REAL
lines in a way that the declarations are now before the DIMENSION instruction. The - surprising - result: No more NaN here on my machine!

BUT: This change does not help a bit for my co-worker. She still has NaN.
We first thought it was the new 9.0 compiler I used. So I uninstalled the 9.0 and installed the same 8.1 as she uses. Same result: On my machine it still works. So she uninstalled the 8.1 and reinstalled it. Still NaN. We compared the project settings within Visual Studio - all the same. Of course she uses the same sources as I do.

And if that would not be confusing enough: I copied the .exe from my machine to the trouble computer. We executed this .exe with the correct parameters and access to the same files. And the mistery continues: Still NaN. So I think the compiler and Visual Studio can not be responsible. But what may influence and cause the different results?

What is different between these two PCs?
- Different CPU: P-III 1GHz vs. Celeron 2GHz
- Different OS : Win 2000 Server vs. Win XP

Any hints welcome.

greldak · ‎09-16-2005

The different OSs would mean that you are probably using different versions of system DLLs so potentially this may change the results you get. Also check you have exactly the same compiler version (I'd be surprised if it wasn't) and runtime components on each machine.

I wouldn't expect the different hardware to affect it at all - unless one of the machines has an unpatched FPU bug - did this affect PIIIs or Celerons at all? (or of course if you had optimised your code for a very specific target processor and I can't see the default setting doing this)

Another thing I would check between the versions compiled on each machine is the size of the executables - if they're not identical I would suspect you had missed some compiler or linker setting differences.

Another possibility could be some other piece of software with a memory leak accessing the same memory locations as your code - especially if you haven't initialised your arrays - this would be a nightmare to track down though, or of course some other part of your program misbehaving and corrupting your arrays- check COMMON blocks and EQUIVALENCE statements as well as array bounds.

h-faber · ‎09-16-2005

Hi Craig,

>The different OSs would mean that you are probably using different versions of system DLLs so potentially this may change the results you get.

In the meantime we checked a 3rd PC with also Win2k on it - again NaN (incorrect) result.

>Also check you have exactly the same compiler version (I'd be surprised if it wasn't) and runtime components on each machine.

All the same.

>I wouldn't expect the different hardware to affect it at all - unless one of the machines has an unpatched FPU bug - did this affect PIIIs or Celerons at all? (or of course if you had optimised your code for a very specific target processor and I can't see the default setting doing this)

There is no optimization enabled. The running exe is on P-III, the incorrect results are on Intel Celeron 2GHz and on AMD Sempron 2.4GHz.

>Another thing I would check between the versions compiled on each machine is the size of the executables - if they're not identical I would suspect you had missed some compiler or linker setting differences.

We compared them in the project properties step by step.

>Another possibility could be some other piece of software with a memory leak accessing the same memory locations as your code - especially if you haven't initialised your arrays - this would be a nightmare to track down though, or of course some other part of your program misbehaving and corrupting your arrays- check COMMON blocks and EQUIVALENCE statements as well as array bounds.

I will look at that. Although it will be very hard to find out.
See my other post, what is frustrating is that even the working .exe alone does not work on the other two machines.

sabalan · ‎09-20-2005

I would add a new point to all others, without going into the details of your problem:

In your example you use such constants as 0.0036992922 and 0.047968611 and as you can see in your own watch window, these are truncated to some other values with fewer decimals and completed with E-3 or E-2 etc. This depends on that you are trying to use a larger number of digits than what REAL(4) gives you. My suggestion would then be:

1- Use IMPLICIT NONE in all of your program units (subroutines, functions);

2- Declare all variables;

3- Use double precition real variables (REAL(8));

4- Type "D" with all real constants, e.g. 0.5D0 rather than 0.5 and 0.0036992922D0 rather than without D0.

Sabalan.