numerics & optimizations

Nick2 · ‎05-17-2006

Hello,

I am considering to upgrade CVF 6.6 to the Intel 9.1 compiler. Most people in my company have a "just get me something that works" mentality (hence we stuck with CVF this long), and we recently ran accross some issues...I'd like to know if Intel addresses them.

The software codes we develop are pretty much big integrators from time=0 to time=94829048 (whatever).

In producing "optimized" versus "non-optimized" code, inoperable statements have a significant impact on our result. For example, by adding a
WRITE(*,*) 'test_print'
line at a random place, or two, my final numerical answers significantly differ. I confirmed this with G95/Linux compiler, by using -O0 vs. -O2.

First question - can I expect to see the same issue with Intel 9.1? And why?

The next problem can be summarized like this:

program hi
real a,b,c,d
real x,y,z
a=.013
b=.027
c=.0937
d=.79
y=-a/b + (a/b+c)*EXP(d)
print *,y
z=(-a)/b + (a/b+c)*EXP(d)
print *,z
x=y-z
print *,x
end

C:>g95 -o hi -mfpmath=387 -fzero -ftrace=full -fsloppy-char hi.for
C:>hi

0.78587145
0.7858714
5.9604645E-8

So, in other words, using the x87 as-is, how you arrange your algebra determines where 80-bit intermediate results get truncated to 64-bit, and a simple re-arrangement like that has a significant impact on the final answers. (both CVF and G95) Our customers are not very happy about this. I recommended the -d8 -ffloat-store flags, to immediately (truncate precision?) of all intermediates, and keep variables in memory in 64-bit.

Then I realized that using the SSE2 FPU is much much faster.

So, second question...does Intel 9.1 address this? Does it allow me to either exclusively use SSE2, or otherwise truncate the precision of the x87 arithmetic?

Third, what is the default FPU with the Intel compiler? using 64-bit (either truncated x87 or SSE2) gives me significantly different results versus 80-bit (which is a bad idea anyway because of what I described earlier). What is your position on this?

Our customers migrating from the 64-bit FPU SUN/SPARC to x86 Linux...blah never mind this one :)

Nick

TimP · ‎05-17-2006

The value of Intel compilers is strongly dependent on using the SSE2 code generation options, particularly when your code permits vectorization.
The 9.1 compilers include the (Windows) option -fp:precise which allows more conservative optimization, like CVF -fltconsistency, without changing from SSE/SSE2 to x87, as the fltconsistency option does.
As you hinted, changes in results with re-arrangement of the code are most likely to occur with the use of x87 instructions. I don't know whether g95 normally uses x87 instructions at -O0, even when SSE code generation options are set. While it is possible to generate SSE2 code with ifort -O0, that is not a normally used option.
Options of the -ffloat-store variety are an extremely slow way to avoid use of extra precision. Typically, -O0 implies that option. As you suggest, SSE code generation options (with optimization raised above -O0) round results much more efficiently to declared precision.
It's hard to imagine anyone moving from a 64-bit system to a 32-bit one nowadays, when even most laptop systems will be 64-bit capable soon. If non-extended precision is your goal, you get much the same result with SSE regardless of whether you choose 32-bit or 64-bit system.

Nick2 · ‎05-18-2006

Thanks for the great info!

The G95, based on GCC, allows for any combination, so pick one from each category:

-O0 (default), -O1, ...
-fmpath=387 (default) -fpmath=sse -fpmath=387,sse (use both)

If I compile a program with IVF, will debug versus release (default configuration for each, or after tweaking) both produce SSE2 (and no 387) instructions? If they don't, I'm in trouble come customer-has-a-crash-at-time-78453s.

Another thing we find very important, CVF has a choice in handling denormalized numbers in two ways:
1) keep going
2) crash

Needless to say, tracing a NaN backwards in time through a 10MB dll is not fun. Does IVF maintain this option?

Nick2 · ‎05-18-2006

I forgot one more thing...does IVF allow all variables be initialized to 0 at the beginning of code?

Steven_L_Intel1 · ‎05-18-2006

There is /fpe:0 which should cause an error on generation of or access to a NaN.

/Qzero initializes most static local variables to zero. I recommend fixing your code instead.

If you use /QxW or a "higher" switch, then SSE2 will tend to be used for arithmetic operations. The x87 instructions will be used for some things (such as returning function results.)

dherndon · ‎05-19-2006

Hi,
I think both your problems are related and stem from the fact that the computer cannot represent most numbers exactly. A number is stored at the "closest" value the computer can represent. Numbers are truncated or rounded to this "closest" value.
In your test case, you are using single precision (32 bit) variables although the fp computation is done with 80 bit registers. Whenever the compiler stores and then reuses a variable, the value will not be the same. The lost bits are gone forever. This will introduce an "error" into the result. It seems that the compiler stores a different intermediate result in the two expressions and therefore gets different final results. In actuality, the results only differ by 1 in the least significant bit and should probably be considered equal. If you run the test case with double precision values, both expressions give identical results. (This can be done with the /real_size:64 compiler option)
When optimizing, the compiler will try and keep as many intermediate results in registers as possible. Any statement that causes the compiler to store intermediate results will introduce a precision error. Inoperable statements that cause different intermediate results storage will have different effects on the final answer.
It has been my experience that long integrations cause the storage of many intermediate results. If you are using single precision variables, you will see significant differences depending upon how many times values are stored and then reused.
Hope this helps.
Dave

Nick2 · ‎05-19-2006

Dave,

I agree with everything you said; I have come to very similar conclusions.

In fact, I took it a bit further, and recommended that my company no longer use the 387 FPU at all; otherwise, the customer will ask, "if I use (-A)/B versus -A/B I get a significantly different final result; which one is correct?"

Got me. They should probably do sensitivity studies on input parameters anyway. Ultimately, this comes down to (overall) precision/consistency being perceived as more valuable than relative accuracy. So be it; I recommend to everyone to dump the 387 and use SSE2 instead.

With single vs. double precision, I get significantly different final results; but oh well, at least they're consistent, with or without optimizations, and no matter what silly code modification you do. Some day they will make 128-bit FPU's common place and we'll all be laughing at this. Or something like that.

TimP · ‎05-19-2006

x87 precision mode should be initialized to 53-bit by default when you run ifort .exe. For 32 bits, ifort conforms with MS practice by setting precision mode at the top of the .exe. For 64 bits, Windows sets 53-bit mode before starting a .exe, and there is no x87 code generation mode anyway.
So, if you get differences between SSE2 and x87 double precision code, you may be running into underflow, and your single precision results would be suspect. If you do run single precision SSE, it might be important to find out where you are losing accuracy, or at least set gradual underflow (-Qftz- for 32-bit).

Intel_C_Intel · ‎05-19-2006

Hello,

I can confirm that switching to Intel Fortran 9.1 and 64-bit SSE2 double precision fp representation gives sufficient accuracy, in particular if options /fpconstant /real_size:64 /QxN are being used.

Lars Petter

Nick2 · ‎05-19-2006

It all sounds great...so I downloaded a trial ed. of the compiler, and used the flags,

/nologo /QaxN /QxN /Qzero /fpe:0 /module:"$(INTDIR)/" /object:"$(INTDIR)/" /libs:static /threads /c

and

/nologo /QaxN /QxN /integer_size:64 /real_size:64 /Qzero /fpe:0 /module:"$(INTDIR)/" /object:"$(INTDIR)/" /libs:static /threads /c

I must say, the code runs much much faster, ~3 times faster when I compared one run in single prec. versus either G95 or CVF!

But running the same case, with versus without randomly inserted
WRITE(*,*) 'test_print'
statements ... gives me different numerical answers...

I did notice quite a bit of underflow messages. But an inoperable statement like this should not change the numerics.

Ideas?

TimP · ‎05-20-2006

I guess that /QxN overrides /QaxN, but it does cloud the issue a bit, since /QaxN requests both a non-SSE and an SSE code path. Likewise, I'm not certain of the effect of /fpe:0 with respect to your concerns. It might be interesting to know the effect of -fp:precise.
Beyond that, more specifics about the situations you are concerned about could be useful. Your code changes could prevent recognition of common expressions, causing them to be calculated differently, or prevent constant propagation.

Les_Neilson · ‎05-22-2006

How different are the results? A possible cause could be different actual to dummy arguments kinds.

One of my colleagues recently had a problem which changed when he added debug write statements.

He eventually found it was caused by himpassing a real constant 0.0 (4 bytes) to a subroutine that expected a double precision argument.

Les

Nick2 · ‎05-22-2006

I tried it without the /QaxN; good catch. I also killed /fpe:0 for the moment. My command line is now,

/nologo /QxN /Qzero /module:"$(INTDIR)/" /object:"$(INTDIR)/" /libs:static /threads /c

And yet, I am still having the same issue. I tried looking at the disassembly, and I got eg.

A = B**(10.0E0*T + 11.0E0)

disassembles to the instructions

add, fld, fstp, fmulp, fchs, faddp

errrr...aren't these 387 ??? Bad 387 :)

Nick2 · ‎05-22-2006

Les,

Depending on which output parameter you look at, the differences could be anywhere from 0% to 30%. IMO the biggest difference is that with print statements versus without print statements I get a different number of diagnostic non-convergence messages on some of the numerical schemes. I am not sure if there might be any mismatched arguments (though there shouldnt beas we automatically generate source code for both single & double precision).

In fact, this issue is not something we discovered; one of our customers did (after upgrading from an all-64-bit system to Linux with the darn 387 FPU), and then blamed it on a memory management issue. Well, as it turns out, with G95 and Linux, I compiled the code to use SSE2 math only, and this solved the problem. Excepthehwe dont officially support Linux and our Win compiler of choice is CVF which doesnt do SSE2 to my knowledge. My ultimate goal is to replicate my Lin results with Win.

Nick2 · ‎05-22-2006

I forgot...my debug command line is

/nologo /Zi /Od /QxN /Qzero /module:"$(INTDIR)/" /object:"$(INTDIR)/" /traceback /check:bounds /libs:static /threads /dbglibs /c

Nick2 · ‎05-22-2006

Looks like we got a winner! fp:precise it is.

Though it gives me warnings like this:

Command line warning: /fp:precise evaluates in source precision with Fortran.

for every source file. And it still seems to be generating 387 instructions. What's the warning for?

TimP · ‎05-22-2006

I've never seen this warning, but I would always set at least -O1 when using -Zi or -fp:precise, so as to engage the generation of SSE code.

Nick2 · ‎05-22-2006

Hmm I tried that...I'm getting a funny message when I build in debug mode. With O1 or O2, a fortcom.exe is choking on one of my files; it keeps using very little CPU time, and yet picks up 600MB of virtual memory.

When I'm done with building the project, I get

Error 529 error PRJ0019: A tool returned an error code

And I get no additional details...

If I run the debug mode with the /fp:precise, and no-opt, it takes way too long to execute (and gives me different results from release version).

Nick2 · ‎05-22-2006

I'm moving this linker error to a new thread, too different of an issue