loss of precision when replacing division by multiplication

mguy44 · ‎04-15-2011

Hello,

I search in the forum and in the knowledge base before writing this but I didnt't find anything relevant !

I work on a Fortran/MPI application and I have to optimize it. The heart of the code is a fluxes computing routine with some divisions by the same number :
DO k=Kmin-3,Kmax+3
DO j=Jmin-3,Jmax+3
DO i=Imin-3,Imax+3
Rho = Q(5,i,j,k)
...
Ux = Q(1,i,j,k)/Rho
Uy = Q(2,i,j,k)/Rho
Uz = Q(3,i,j,k)/Rho

...
END DO
END DO
END DO

If I replace these divisions by multiplications by the reciprocal, say Rhoq = 1.0d0 / Rho, like :
DO k=Kmin-3,Kmax+3
DO j=Jmin-3,Jmax+3
DO i=Imin-3,Imax+3
Rho = Q(5,i,j,k)
Rhoq = 1.0d0 / Rho
...
Ux = Q(1,i,j,k) * Rhoq
Uy = Q(2,i,j,k) * Rhoq
Uz = Q(3,i,j,k) * Rhoq
...
END DO
END DO
END DO

I get quite large differences after some iterations when i compare unformatted files of the solution for these two versions :
Errors on Q(1:5,:,:,:) : 0.66905E-10 0.10867E-09 0.00000E+00 0.56811E-07 0.14566E-12

All variables are declared and in double precision ; all scalars are promoted to double precision too.
Of course, this is the only thing that has changed between the two runs.

The sequence of compilation options is
ifort -xSSE4.2 -O3 -pad -opt-prefetch -mp1 -ftz -prec-div -fp-model strict -132 -implicitnone -traceback -g -sox -c

I think that I already have the right compilation options to keep a good precision with divisions.

The ifort version is ifort (IFORT) 11.1 20100806
Jobs are runned on a SGI Altix ICE 8400 LX with processors Intel Xeon X5650.

What could be done in order to get more closer results in this situation ?
Thank you in advance for your advices.

Regards

Ron_Green · ‎04-15-2011

A lot depends on the typical values of Rho relative to 1.0 and Q. What are typical values of Rho, roughly: very large numbers, very small numbers? And what about Q - small, large, around 1.0?

Also, you're flushing denormals with -ftz - is that important? are you generating denormalized values? Have you tried leaving off -ftz and using -no-ftz -fpe0 to trap numeric issues?

ron

mguy44 · ‎04-15-2011

Hello Ronald,

It seems tant Rho and Q may have very different values (from 1E-10 to 1E+10 depending the place and the moment) , because a shock wave is running across the computing domain.
I thought the "magnitude" of the numbers (very large or very small) doesn't matter in multiplication and division ? Of course, It's important with additions and substractions.

I flush denomals with -ftz because if I do not, I have problems with some math. functions like ATAN :
IF ( ABS (uxIP) < eps) THEN
theta = 0.5d0 * Pi
ELSE
theta = ATAN (ABS(uyIP/uxIP) )
END IF

This kind of computation raises a underflow FPE with uyIP/uxIP near +1E-10 in the call of ATAN
So I flush to zero and the program runs fine.

Moreover, the code runs well too (no crash) with the following options :
-O0 -ftrapuv -fpe0 -debug all -check all -warn general -debug-parameters all -fp-model strict -fp-stack-check -debug inline-debug-info -ftz

I did not try -no-ftz -fpe0 but I will do it.

jimdempseyatthecove · ‎04-15-2011

You'd think using doubles the difference would be within 15 places or so.
Try adding code to make the determination.

Rho = Q(5,i,j,k)
Rhoq = 1.0d0 / Rho
Rhoqq = 1.0d0 / Rhoq

Assuming you can compile the above without the program optimization eliminating the (a) statement and variable, what is the difference between Rho and Rhoqq relative to Rho?

Diff = dabs(Rho - Rhoqq)
Ratio = Diff / dabs(Rho)

If your reciprocal was use for optimization purposes (under the assumption that multiplication is faster than division) then possibly you could select if you were to use the inverse and multiply based on the relative size of Rho to 1.0d0

Run test to check for performance difference (and value difference)

Jim Dempsey

TimP · ‎04-15-2011

Ah, the old days of CFD shock capture algorithms. Variation in the captured shock position due to numerical changes beyond 15 significant decimals casts doubt on robustness of the algorithm (at least for the purpose of finding that position).
My customers (not in CFD) do a fair number of these optimizations, and keep them only when they can measure improvement in performance with no QA indications. I saw that you set -prec-div to stop such replacements happening where you don't specify them.

mguy44 · ‎04-15-2011

Here are what I get from the test in the code at the end of the time iterations :

Proc = 15 Rhoqmax = 0.1115489914955E-15

Proc = 17 Rhoqmax = 0.0000000000000E+00

Proc = 16 Rhoqmax = 0.1569561440952E-15

Proc = 23 Rhoqmax = 0.0000000000000E+00

Proc = 15 Rhoqmax = 0.1115489914955E-15

And in the code I try :

Rhoq = one / Rho

Ux = Q(1,i,j,k) * Rhoq

Uy = Q(2,i,j,k) * Rhoq

Uz = Q(3,i,j,k) * Rhoq

Rhoqq = one / Rhoq

rDiff = ABS (Rho - Rhoqq)

Rhoqmax = MAX ( rDiff / ABS(Rho), Rhoqmax )

and

REAL(8) :: Rhoq, rDiff, Rhoqmax

It seems OK, isn't it ?

mecej4 · ‎04-16-2011

> "It seems OK, isn't it ?"

I am afraid that this may be another instance of a programmer falling into a well-known trap of one's own making: a small change is made to a large code (in this case, multiplying three times by 1/_q instead of dividing that many times by _q), and changes in the results produced somewhere else in the program are believed to be caused solely by this change. While that is a possible explanation, there may be a hundred other more important causes.

Since you are asking us to support your conclusions as to how the entire program behaves but on the basis of having seen only a tiny fragment, all that I could say in answer to your question is an unenthusiatic "It may be OK, if you say so."

To me, at least, the output values of Proc and Rhoqmax mean nothing since I do not have an inkling of the complete picture.

If you think that you have found an instance where the compiler is not putting out correct code, you will need to invest time into the creation of a small but complete "reproducer". Then there will be something that all can work with without having to know subjects such as gas dynamics.

jimdempseyatthecove · ‎04-16-2011

I concur with mecej4.

If I were to guess at what is happening with your program...

You may have a convergence routine that is testing on IF(xxx .eq. yyy) as opposed to IF(dabs(xxx-yyy) .lt. Epsilon) where Epsilon is a reasonable value for the values of xxx and yyy together with your solution requirements.

For all practical purposes0.1115489914955E-15 ~= 0.0

You need to make your program insensitive to these small variations.

Jim Dempsey

mguy44 · ‎04-17-2011

Hello,

thank you for your messages.

I just ask for an opinion on the results of this small test, not on the behaviour of the whole program, that you do not have so it's hard to talk about something one does not have ! And I understand this, it's OK.

I now have to run the program with the -no-ftz -fpe0 options, to see if its behaviour changes.

Of course, this program is not mine and I can't send it to anyone, even if someone asks for it.

I take care of float comparisons when I have to compare real numbers, thank you for the advice.

styc · ‎04-18-2011

Variable names look like conservative-form NavierStokes to me. Are you dealing with turbulent flows, where the accumulated effect of even the smallest perturbations is expected to grow exponentially?

mguy44 · ‎04-19-2011

I'm sorry but I'm not the scientist ! My job is to optimize this application.
Ihave some informations abouit what is done in the code :
"The implementation of immersed boundary method using the direct-forcing concept is applied to a high-order WENO scheme to simulate the complex fluid-solid interactions.
Keywords: Immersed boundary method, Shock obstacle interaction, WENO scheme"

I ran the code with the -no-ftz -fpe0 and I get underflow elsewhere in the application.
So the problem seems to be more complex than I thought.

jimdempseyatthecove · ‎04-19-2011

>>I'm sorry but I'm not the scientist ! My job is to optimize this application.

On the newer processors floating point division (SSE) is between 22 and 35 clock ticks for latency and throughput, multiplication is is about 5 clock ticks for latency and 1 clock tick for throughput. Therefore it would seem appropriate for a loop such as you sketched to use a oox = 1.0D0 / x then use multiply. The problem as you have experienced is in some cases a very small loss in precision, which may be attributable to round-off errors, have a significant impact on the solution.

One thing to do is to revert back to using division and spend your time looking for other optimization opportunities. First thing to look for is to make sure all your vectorization opportunities are found, second thing is to remove unnecessary indexing (e.g. collapse multiple indexed arrays to single indexed dummy arguments via function call when possible). Then look at parallization opportunities. After these are tuned to the best of your ability rememberingyour statement"I'm sorry but I'm not the scientist ! My job is to optimize this application.", very carefully go about tweaking the working algorithms. Seemingly innocuous changes may have large undesired effects.

Now with respect to

oox = 1.0D0 / x ! then use multiply

try experimenting with

! produce s =~largest numberas power of 2
! that can hold an additional integer
!without drop in roundoff
s = scale(1.0D0,52)
sox =s / x ! use sox in place of oox then use multiply
... ! list of multiplies here
os = 1.0D0 / s
... ! repeat list of multiplies here

You may have to do some source code manipulation such that the compiler optimization does not undo your handywork. Do not combine the two multiplications on the same statement as the compiler will likely optimize this out and produce the same oox = 1.0D0 / x. You may need to do this using two loops, each loop in seperate source, with IPO disabled.

The above suggestion is untested. What you are attempting to do is trade-off a division against two multiplications but where the two multiplications might yield more consistent results to the use of division.
Using throughput clock counts one division is 22-30 clocks agains two multiplications 2 clocks, or 10x to 15x faster(plusaddidional arrayfetch and store). Experimentation on your part will indicate if this yields an improvement without inducing undesired effects onyour results data.

Jim Dempsey

TimP · ‎04-19-2011

Also, you might consider that a simd parallel divide ought to be faster on the latest generation Core I3,5,7 CPUs, while avoiding the numerical questions, if you could persuade the compiler to do it. So you're spending time on an "optimization" which might be inadvisable in the near future, even it works on last year's CPU.

mguy44 · ‎04-22-2011

Thank you your your suggestions.
Trying to find a solution makes me find some strange things in the code
- I have looked for underflows ; now the code does not produce any.
- I have looked for division par zero (at the beginning Q is null !) but nevertheless divisions are computed ! Now it 's not the case anymore.
Maybe something was done in the case Q was null that gives me underflow values (?)

So, after some discussions with the persons who developped the application, we modified some parts of the algorithms in order to avoid these behaviours. The results are still good but of course they can't be anymore compared with the initial ones.

This code is already parallelized with a classical decomposition of domain using MPI.

I'll take a look at the evolution of the CPU and their performance on floating point operations

Regards,