Compiling with SIMD (SSE2, SSE3) dramatically lowers precision, why?

Christian_Altenbach · ‎11-11-2010

I have a complicated fortran program to simulate EPR spectra. It is built into a DLL and used inside a LabVIEW program for nonlinear fitting.

I noticed some serious instabilities in the numerical partial derivatives calculations that seem to depend on the actual fitting parameter and the parameter increment in a chaotic manner.Tiniestchanges in inputs causes weirdfluctuationsin the result. Playing will all compile options, I noticed that the default was to generate SSE2 code. The same problem showed up when selecting SSE3.

Disabling all SIMD (option IA32) miraculously stabilized the computations and made things much more predictable.

The attached image shows my testbench LabVIEW program to analyse the issues in detail and compare DLLs side by side. There are no changes in the fortran code. The only difference here is /arch:SSE3 vs /arch:IA32 in the code generation option.Compiled with SSE3 (red) or IA32 (white).

To show problems with the partial derivative computations, I calculate

Sqrt(sum((f(x)-f(x+dx))/(x+dx))^2)

and graph it vs. dx from 1e0 to 1e-12.

For very small dx (left side), we get numerical errors because we divide by a very small number and for very large dx(right side), we get deviation because the linear approximation is no longer valid.

IA32: looks very nice. Partial derivatives are flat for good increments over eight orders of magitude(!!). In the flat region, the partial derivatives are independent of dx as they should be.

SSE3: Shows oscillations for dx<1e-1 and there is a weird jump between 1e-3 and 1e-4. Overall, things get bad many orders of magnitude earlier than the same code compiled with IA32. Basically there is no dx that gives a stable partial derivative and levenberg marquardt often gets stuck way before the minimum. tiniest changes in x or dx cause dramatic changes in the details of the shown curves. I'ts like trying to play golf in a field full of rabbit holes.

When compiled under IA32, levenberg-marquardt behaves perfectly!!!

So, is there an explanation??? Is it due to reordering? Do some operations get implicit padded arrays to fit the SIMD registers? What else could it be? Not only are the SSE3 results slightly different from the IA32 results, the SSE3 show serious artifacts over nearly the entire range. The artifacts vary strongly for even smallest changes is either x or dx.

(All test were done on a core2duo T7600).

mecej4 · ‎11-11-2010

You presented many conclusions, but little evidence. We need some example code that demonstrates the problems that you describe.

Consider the evaluation of the derivative of tan(x) at x = 1, using forward differences. Here is the program:

[fortran]program tstder
implicit none
double precision x,dx,der,exder
integer i
x=1d0
exder=1d0/(cos(x))**2
dx=1d0/65536
do i=1,20 
   der=(tan(x+dx)-tan(x))/dx
   write(*,10)i,dx,der,der/exder-1d0
   dx=dx*0.5d0
end do
10 format(i3,3ES14.6)
end program tstder
[/fortran]

With x87 code, the results are

\delta f'_{approx} rel. error in f'

[bash]  1  9.765625E-04  3.430738E+00  1.523542E-03
  2  4.882812E-04  3.428126E+00  7.611113E-04
  3  2.441406E-04  3.426822E+00  3.803910E-04
  4  1.220703E-04  3.426170E+00  1.901544E-04
  5  6.103516E-05  3.425844E+00  9.506690E-05
  6  3.051758E-05  3.425682E+00  4.753088E-05
  7  1.525879E-05  3.425600E+00  2.376480E-05
  8  7.629395E-06  3.425560E+00  1.188223E-05
  9  3.814697E-06  3.425539E+00  5.941075E-06
 10  1.907349E-06  3.425529E+00  2.970521E-06
 11  9.536743E-07  3.425524E+00  1.485253E-06
 12  4.768372E-07  3.425521E+00  7.425505E-07
 13  2.384186E-07  3.425520E+00  3.711654E-07
 14  1.192093E-07  3.425519E+00  1.857448E-07
 15  5.960464E-08  3.425519E+00  9.276254E-08
 16  2.980232E-08  3.425519E+00  4.491205E-08
 17  1.490116E-08  3.425519E+00  2.316182E-08
 18  7.450581E-09  3.425519E+00  5.761643E-09
 19  3.725290E-09  3.425519E+00 -2.938447E-09
 20  1.862645E-09  3.425519E+00 -2.033863E-08
[/bash]

With SSE3, the results change to

[bash]  1  1.525879E-05  3.425600E+00  2.376480E-05
  2  7.629395E-06  3.425560E+00  1.188223E-05
  3  3.814697E-06  3.425539E+00  5.941075E-06
  4  1.907349E-06  3.425529E+00  2.970521E-06
  5  9.536743E-07  3.425524E+00  1.485253E-06
  6  4.768372E-07  3.425521E+00  7.425505E-07
  7  2.384186E-07  3.425520E+00  3.711654E-07
  8  1.192093E-07  3.425519E+00  1.857448E-07
  9  5.960464E-08  3.425519E+00  9.276254E-08
 10  2.980232E-08  3.425519E+00  4.491205E-08
 11  1.490116E-08  3.425519E+00  2.316182E-08
 12  7.450581E-09  3.425519E+00  5.761643E-09
 13  3.725290E-09  3.425519E+00 -2.938447E-09
 14  1.862645E-09  3.425519E+00 -2.033863E-08
 15  9.313226E-10  3.425519E+00 -2.033863E-08
 16  4.656613E-10  3.425519E+00 -8.993935E-08
 17  2.328306E-10  3.425519E+00  4.926209E-08
 18  1.164153E-10  3.425518E+00 -2.291408E-07
 19  5.820766E-11  3.425518E+00 -2.291408E-07
 20  2.910383E-11  3.425514E+00 -1.342752E-06
[/bash]

From numerical analysis, we know that the optimum dx for evaluating one-sided differences is ~ \sqrt( \delta /f''(x)), where \delta is the error in the function evaluation. In this example f, f' and f'' are all O(1), so the optimum dx is O(\sqrt \epsilon_{machine}).

Looking at the printouts above, and locating the line with the smallest relative error in f', we see that the results agree with the theory quite well, since \epsilon_{machine} ~ 10^{-16} for 64-bit IEEE (SSEx) and ~ 10^{-20} for 80-bit floating point (x87).

Steven_L_Intel1 · ‎11-11-2010

The reason IA32 looks "nice" is that you are getting single precision calculations done in double precision many places, whereas SSE always uses declared precision. Perhaps you should switch to double precision for these calculations rather than relying on the unpredictable effects of the x87 floating point instructions.

Christian_Altenbach · ‎11-11-2010

Thanks. All floating point variables are declared double or double complex. Am I missing something?

Christian_Altenbach · ‎11-11-2010

OK, setting "/real_size:64"explicitlyfixes the SSE3 problems and at the same time makes it even faster by a few percents. Good! Can somebody point me to some documentation that describes all that?

Thanks!

(Sorry, last time I've done fortran was way before SIMD, back in the Compaq days. ;))

TimP · ‎11-11-2010

If /real_size:64 made a difference, you must have missed double precision declaration somewhere. One of the simpler precautions is to set implicit none so as to force all variables to be declared. You must also give attention to constants, in those case where single precision isn't exact.

Christian_Altenbach · ‎11-11-2010

Well, the bulk of the code goes way back to the late 80'ies to mid 90'ies, is not mine and is pretty massive. I don't think I will spend days hacking around... Maybe i'll try to contact some of the original programmers. :)

(Only the dll interface code is relatively new.).

The code is generally very polished and classic style. Everywhere I looked, there is an "implicit none" statement, but of course it is possible that something is missing in some odd corner. Thanks again.

mecej4 · ‎11-12-2010

We have not seen the code. If, indeed, "all variables are declared double..." is true, /real_size:64 should have no effect.

Perhaps, you have expressions whose dominant type is real*4. If so, assigning the value of the expression to a double precision variable should still count as a single-precision calculation. An example:

[fortran]double precision :: pi
..
..
pi = 4*atan(1.0)
..
..[/fortran]

Single-precision floating point evaluations cause numerically evaluated derivatives to have relative errors in the range of 1E-4 to 1E-3.

Christian_Altenbach · ‎11-12-2010

mecej4 wrote: We have not seen the code. If, indeed, "all variables are declared double..." is true, /real_size:64 should have no effect.

The bulk of the code can be found here, specifically have a look at the programNLSL.MOMD. It is probably beyond my expertise to go much further with this investigation. If there is no bug in the compiler, there must be a subtle language construct somewhere that throws it to single precision without warning. I have not found it.

I am just glad I found a solution:/real_size:64 is the key! (... but it is not yet clear why it is actually needed here!)

The speed increase using SSE3 is not that dramatic in this particular implementation, but still worth it.

I have not investigated fortran optimizations for multicore. The intrinsic parallel nature of LabVIEW naturally lends itself for optimization in this regard and I am much more familiar with the language. Since I migrated all the fitting code outside of the fortran DLL anyway, it is better if each dll call is single core.

(This is maybe a question for some other time and new thread: In order to be able to call the dll 4x in parallel in LabVIEW, I currently make 4 copies of it, each with a different name (...A.dll, ...B.dll, etc.). without that, they step on each other. Works great though! Can anyone think of a way to keep it a single dll?)

(OTOH, disabling SSE3 has of course the advantage that my program also runs on older hardware, such as an old athlon XP that is still sitting in a lab corner somewhere. However, a fast modern machine is of course much more desirable for all these long computations :))

This forums is great! Thanks for all the expert advice and the quick solution!

mecej4 · ‎11-12-2010

The README.1st file in the repository that you listed contains the following note:

"It should be noted that the program overruns array boundaries in the parsing 
routines. When stack and array bound checking is used in the compilation, 
errors occur (?)"

And, indeed, it does. Not only are arrays overrun, but at least one variable was used without initialization. Such code might have worked correctly on some machine some day. Today, I would consider it buggy.

abhimodak · ‎11-12-2010

I did a few random searches and I see:

(1) "rmi=0.25*(dx-dy)/ct" as the third executable line in file /NLSL.MOMD/matrll.f
(2) No implicit none in caldlm.f

While this does not mean that it is the cause of the issue you see, as mecej4 and others have commented one needs to be careful about numerical precision. There are some very good articles by Steve Lionel on this topic.

It is good that using /real:64 is working for you now. Just in case that you have to write new Fortran program, you may want to read the precision related articles.

Abhi

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Compiling with SIMD (SSE2, SSE3) dramatically lowers precision, why?