ftz vs. fpe0

David_M_17 · ‎10-06-2016

I am working with some fortran code that provides valid results when we compiler with -fpe0 -no-vec, if fails if we use -ftz -no-vec though.

The only difference I can see between -ftz and -fpe0 regarding floating point behavior is that both flush denormals to zero on SIMD registers while only -fpe0 also flushes denormals to zero in the x87 instructions. We don't believe there should be any x87 instructions in the code. The current compiler option is -xCORE-AVX2 -no-vec and either -ftz or -fpe0. This is with the Intel 16.0 compilers (16.0.4); Linux of course. Is there a way to turn off x87 code generation so everything is done in the SIMD execution units (even if it is just scalar mode)?

fpe0 causes code to abort when an exception occurs, while -ftz will let us go along happily calculation nans and infinities, but if everything is done in the SIMD registers what floating point values will behave differently?

David M

mecej4 · ‎10-06-2016

If you are running a 32-bit a.out on Linux-x86 or x64, the calling convention is to pass the return value of a function of type float or double in the x87 register ST0, even if the rest of the function code uses only the SIMD registers.

David_M_17 · ‎10-10-2016

That is very good to know. This is a 64 bit application and all of the values being passed in and out of the subroutine are arrays. I notice with -fpe0 and -xCORE-AVX2 the loops (peel, main, remainder) all vectorize. When I use -fpe0 and -xMIC-AVX512 the remainder and peel loops do not vectorize - but it seems the main loop does? Is there extra cost with -fpe?

Martyn_C_Intel · ‎10-21-2016

Hi David,

A significant difference between your two scenarios is that -fpe0 sets -fp-speculation strict, (since exceptions are being deliberately unmasked), whereas -ftz does not. I suppose that with -ftz, you are unmasking exceptions in some other way. Please try -fp-speculation strict with -ftz and see if it makes a difference.

It's true that we usually see the effect of -fp-speculation strict in vectorized loops. Background for the benefit of others reading the thread:

The vectorizer assumes the default floating-point environment (with FP exceptions masked). It may speculatively execute instructions that could cause exceptions, in order to enable efficient vectorization, e.g.:

DO I=1,N

IF(A(I).GE.0) B(I) = SQRT(A(I))

ENDDO

The compiler may evaluate the SQRT for all values of I, potentially causing exceptions when A(I) is negative, but store only those for which A(I) >= 0 . The way to prevent this is with -fp-speculation safe (or strict).

Regards,

Martyn

Joachim_Herb · ‎10-26-2016

Hi Martyn,

in our code, we also hit exactly such a problem: Division by zero in a vectorized loop.

IF(F1(K).GT.0.)X(N,K)=F2(K)/F1(K)

Now what is the recommended way to "fix" this problem?

Thank you for your help

Joachim

jimdempseyatthecove · ‎10-26-2016

TEMP = F1(K)
IF(TEMP.LE.0.) TEMP = 1.0
IF(F1(K).GT.0.)X(N,K)=F2(K)/TEMP

The above may work provided you can structure TEMP such that the compiler can figure out it will not be used later. Perhaps name it TempToProtectDivideByZero, then not use it afterwards.

Jim Dempsey

Martyn_C_Intel · ‎10-26-2016

> in our code, we also hit exactly such a problem: Division by zero in a vectorized loop. Now what is the recommended way to "fix" this problem?

> IF(F1(K).GT.0.)X(N,K)=F2(K)/F1(K)

As stated in my previous post. Compile with -fp-speculation safe (or -fp-speculation strict). This will prevent the compiler from speculatively doing the division before it has evaluated the IF condition.

Joachim_Herb · ‎10-28-2016

Thank you for your answers. The this compiler option in fact avoids the problem. I will report back, what impact it has on the overall performance of our code.

yuriisig · ‎05-07-2017

The known Gaussin program (www.gaussian.com) for translation uses at the same time following flags: -ftz,. -fpe0 and -fp-speculation = safe.

David_M_17 · ‎06-26-2017

To provide an update - changing from -fpe0 to -ftz -fp-speculation strict provided a 5% performance improvement. Thank you Martyn

Martyn_C_Intel · ‎06-26-2017

Thanks for letting us know.

Abrupt underflow (flushing denormal results to zero) is done in hardware for SSE instructions, so there's essentially no cost (whereas doing arithmetic with denormal values is slow).There is no hardware to flush denormal results to zero for x87 instructions. When you use -fpe0, results have to be tested and denormals flushed to zero in software. This might be why you see slightly worse performance with -fpe0. Even 64 bit applications may sometimes use x87. I think some math functions may use then for the extra precision or dynamic range, for example.