- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a program for which a Debug build runs ok, but a Release build doesn't (it runs, but results are wrong). So I changed Release Optimization from /O2 to /Od and then it runs ok. Is this an indication that the program has a bug somewhere?
A weird part about this is that the /O2 code fails to run correctly on one win7 system, but runs successfully on another win7 system. The program is built statically and its only dependencies are KERNEL32.DLL and IMAGEHLP.DLL
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see an issue.
In the optimized code, the optimization chose to produce the reciprocal of a register as opposed to performing a divide.
IMHO the produced reciprocal is abizmally off.
0032103F 0F 53 DA rcpps xmm3,xmm2
where:
XMM20 = +2.00000E+000 XMM21 = +2.00000E+000 XMM22 = +0.00000E+000 XMM23 = +0.00000E+000 XMM30 = +4.99878E-001 XMM31 = +4.99878E-001 XMM32 = +1.#INF0E+000 XMM33 = +1.#INF0E+000
IOW the reciprocal of 2.0 became 0.499878
Add /fp:precise to the affected code
2.0000000000E+00 2.0000000000E+00 1.0000000000E+00 1.0000000000E+00
40000000 40000000 3F800000 3F800000
Steve, is there an option other than /fp:precise (which is broad in nature) that instructs to not perform a fast reciprocal?
Also, in this case, the accuracy of the approximate fast reciprocal is less than desirable.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BTW: rdpps: Compute the approximate reciprocal of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.
1.5 / 4096 = 0.0003662109375 (~ 3.3 decimal places)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim has found the problem! If the processor has an instruction to compute a reciprocal to the same precision as multiply and divide instructions, using that (instead of a divide instruction) as a means of optimization would be fine. RCPPS does not meet this criterion. A low-accuracy approximation such as that generated by RCPPS is appropriate for use as a starting value in an iterative process. I feel that its use in the example code is inappropriate, and should be considered an optimizer bug.
Please discuss or state opposing opinions.
P.S., added on 16 March 2020:
Examination of the assembly code and running in the debugger shows that, in fact, after the initial approximation x0 to 1/N is produced using RCPPS, and one step of Newton-Raphson is applied to obtain an improved approximation x1. For N = 2.0, x0 = 0.49975589, and x1 = 0.49999997 (Z'3EFFFFFF'), which is off in the LSB.
I see now that my claim that integer powers of 2 are represented exactly in the FPU has been proven wrong. I had not been aware that the compiler generated reduced precision instructions such as RCPPS. A compiler option to avoid the use of RCPPS and its ilk would be helpful to users. With Gfortran, for instance, I observed the same behavior as above but after I used the options -O3 -funsafe-math-optimizations -ffinite-math-only -fno-trapping-math .
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Performs a SIMD computation of the approximate reciprocals of the four packed single-precision floating-point values in the source operand (second operand) stores the packed single-precision floating-point results in the destination operand. The source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM register. See Figure 10-5 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD single-precision floating-point operation.
The relative error for this approximation is:
|Relative Error| ≤ 1.5 ∗ 2−12
---------------------------------------------------------------------------------------------------------------------------------------------------
https://www.felixcloutier.com/x86/rcpps
Performing a reciprocal is always dangerous.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve, I cannot locate a switch in the IVF V19.1 to specify to the compiler to use reciprocals together with one iteration of Newton-Raphson as outlined in the 64-ia-32-architectures-optimization-manual...pdf
Accuracy Method SSE Performance AVX Performance 24 bits (V)DIVPS Baseline 1X ~22 bits (V)RCPPS+Newton-Raphson 2.7X 4.5X ~11 bits (V)RCPPS 6X 8X
Is there such an option?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, I don't know. Please raise this to Intel Support.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove (Blackbelt) wrote:Steve, I cannot locate a switch in the IVF V19.1 to specify to the compiler to use reciprocals together with one iteration of Newton-Raphson as outlined in the 64-ia-32-architectures-optimization-manual...pdf
Accuracy Method SSE Performance AVX Performance 24 bits (V)DIVPS Baseline 1X ~22 bits (V)RCPPS+Newton-Raphson 2.7X 4.5X ~11 bits (V)RCPPS 6X 8XIs there such an option?
Jim Dempsey
this is like the old days - back with the 80286 processor -- you had limited memory and often had to choose how to make routines faster -- interesting trading the speed for accuracy numbers, I would have thought Newton Raphson would have caused a bigger hit -- it is quite a complex routine.
Got to love the modern era - now we have real viruses that Norton cannot kill
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John Nichols wrote:
I would have thought Newton Raphson would have caused a bigger hit -- it is quite a complex routine.
The compiler would certainly not generate a call to a general purpose Newton-Raphson routine, external or inline.
The one-iteration Newton-Raphson calculation for calculating 1/A, given A, is
Use RCPPS to find x_0, a 12 bit approximation to x = 1/A. Calculate x_1 = (2 - A.x_0) x_0, which gives the 22 bit approximation x_1.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>The compiler would certainly not generate a call to a general purpose Newton-Raphson routine, external or inline.
Yes, however, the compiler could generate the additional instructions in line.
Out of curiosity, I looked at Agner Fog's instruction set timings and I to not see how vrcpps + vmulps can beat vdivps. (at least not 8x).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »