optimizer question - Page 2

Brian_Murphy · ‎03-13-2020

I have a program for which a Debug build runs ok, but a Release build doesn't (it runs, but results are wrong). So I changed Release Optimization from /O2 to /Od and then it runs ok. Is this an indication that the program has a bug somewhere?

A weird part about this is that the /O2 code fails to run correctly on one win7 system, but runs successfully on another win7 system. The program is built statically and its only dependencies are KERNEL32.DLL and IMAGEHLP.DLL

jimdempseyatthecove · ‎03-15-2020

I see an issue.

In the optimized code, the optimization chose to produce the reciprocal of a register as opposed to performing a divide.

IMHO the produced reciprocal is abizmally off.

0032103F 0F 53 DA rcpps xmm3,xmm2

where:

XMM20 = +2.00000E+000 XMM21 = +2.00000E+000 XMM22 = +0.00000E+000 XMM23 = +0.00000E+000 XMM30 = +4.99878E-001 XMM31 = +4.99878E-001 XMM32 = +1.#INF0E+000 XMM33 = +1.#INF0E+000

IOW the reciprocal of 2.0 became 0.499878

Add /fp:precise to the affected code

2.0000000000E+00 2.0000000000E+00 1.0000000000E+00 1.0000000000E+00

40000000 40000000 3F800000 3F800000

Steve, is there an option other than /fp:precise (which is broad in nature) that instructs to not perform a fast reciprocal?

Also, in this case, the accuracy of the approximate fast reciprocal is less than desirable.

Jim Dempsey

jimdempseyatthecove · ‎03-15-2020

BTW: rdpps: Compute the approximate reciprocal of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.

1.5 / 4096 = 0.0003662109375 (~ 3.3 decimal places)

Jim Dempsey

mecej4 · ‎03-15-2020

Jim has found the problem! If the processor has an instruction to compute a reciprocal to the same precision as multiply and divide instructions, using that (instead of a divide instruction) as a means of optimization would be fine. RCPPS does not meet this criterion. A low-accuracy approximation such as that generated by RCPPS is appropriate for use as a starting value in an iterative process. I feel that its use in the example code is inappropriate, and should be considered an optimizer bug.

Please discuss or state opposing opinions.

P.S., added on 16 March 2020:

Examination of the assembly code and running in the debugger shows that, in fact, after the initial approximation x0 to 1/N is produced using RCPPS, and one step of Newton-Raphson is applied to obtain an improved approximation x1. For N = 2.0, x0 = 0.49975589, and x1 = 0.49999997 (Z'3EFFFFFF'), which is off in the LSB.

I see now that my claim that integer powers of 2 are represented exactly in the FPU has been proven wrong. I had not been aware that the compiler generated reduced precision instructions such as RCPPS. A compiler option to avoid the use of RCPPS and its ilk would be helpful to users. With Gfortran, for instance, I observed the same behavior as above but after I used the options -O3 -funsafe-math-optimizations -ffinite-math-only -fno-trapping-math .

JohnNichols · ‎03-15-2020

Performs a SIMD computation of the approximate reciprocals of the four packed single-precision floating-point values in the source operand (second operand) stores the packed single-precision floating-point results in the destination operand. The source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM register. See Figure 10-5 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD single-precision floating-point operation.

The relative error for this approximation is:

|Relative Error| ≤ 1.5 ∗ 2−12

---------------------------------------------------------------------------------------------------------------------------------------------------

https://www.felixcloutier.com/x86/rcpps

Performing a reciprocal is always dangerous.

jimdempseyatthecove · ‎03-15-2020

Steve, I cannot locate a switch in the IVF V19.1 to specify to the compiler to use reciprocals together with one iteration of Newton-Raphson as outlined in the 64-ia-32-architectures-optimization-manual...pdf

Accuracy  Method                 SSE Performance        AVX Performance
 24 bits (V)DIVPS                Baseline               1X
~22 bits (V)RCPPS+Newton-Raphson 2.7X                   4.5X
~11 bits (V)RCPPS                6X                     8X

Is there such an option?

Jim Dempsey

Steve_Lionel · ‎03-15-2020

Sorry, I don't know. Please raise this to Intel Support.

JohnNichols · ‎03-16-2020

jimdempseyatthecove (Blackbelt) wrote:
Steve, I cannot locate a switch in the IVF V19.1 to specify to the compiler to use reciprocals together with one iteration of Newton-Raphson as outlined in the 64-ia-32-architectures-optimization-manual...pdf
Accuracy  Method                 SSE Performance        AVX Performance
 24 bits (V)DIVPS                Baseline               1X
~22 bits (V)RCPPS+Newton-Raphson 2.7X                   4.5X
~11 bits (V)RCPPS                6X                     8X
Is there such an option?

Jim Dempsey

this is like the old days - back with the 80286 processor -- you had limited memory and often had to choose how to make routines faster -- interesting trading the speed for accuracy numbers, I would have thought Newton Raphson would have caused a bigger hit -- it is quite a complex routine.

Got to love the modern era - now we have real viruses that Norton cannot kill

mecej4 · ‎03-16-2020

John Nichols wrote:
I would have thought Newton Raphson would have caused a bigger hit -- it is quite a complex routine.

The compiler would certainly not generate a call to a general purpose Newton-Raphson routine, external or inline.

The one-iteration Newton-Raphson calculation for calculating 1/A, given A, is

Use RCPPS to find x_0, a 12 bit approximation to x = 1/A.
Calculate x_1 = (2 - A.x_0) x_0, which gives the 22 bit approximation x_1.

jimdempseyatthecove · ‎03-16-2020

>>The compiler would certainly not generate a call to a general purpose Newton-Raphson routine, external or inline.

Yes, however, the compiler could generate the additional instructions in line.

Out of curiosity, I looked at Agner Fog's instruction set timings and I to not see how vrcpps + vmulps can beat vdivps. (at least not 8x).

Jim Dempsey

Barbara_P_Intel · ‎03-16-2020

Can you check -fltconsistency and -prec-div in the Developer Guide?