I've been writing some vectorized C++ code to compute the magnitude of a vector of 8-byte complex samples using AVX instructions on an Intel Xeon SandyBridge CPU. I recently noticed some minor differences between my output and that of the MKL vcAbs function. I've been able to mimic the speed of the MKL function by using the _mm256_rcp_ps() and _mm256_rsqrt_ps() functions to approximate the square root of the magnitude squared (vs. using _mm256_sqrt_ps()) but for some reason, my answers are coming out slightly different than those produced by vcAbs. For example, the hand-coded AVX might produce a magnitude value of 0.511695 and vcAbs will produce .511643 for that same sample. When running my test, I have the VML mode set to VML_EP. If I modify the code to use _mm256_sqrt_ps() instead and set the VML mode to VML_HA, I get answers that match. I'm just curious if there is more to the algorithm or something else going on inside the VML function that is affecting the answer when precision mode is VML_EP beyond just using the reciprocal approximations. I've tried re-ordering my function calls to use rcp followed by rsqrt as well as rsqrt followed by rcp but it doesn't seem to make a difference.
For reference, I am working on a CentOS 6 platform, using MKL v11.0, and compiling with Intel CompserXE 2013. I've been using the compile flags -O3, -xHost, -no-prec-div, -fp-model fast among others. Unfortunately this was all done at my office and I don't have access to my source code to post or the specific input/output values I was providing to the algorithm. Just wondering if someone can help me understand what other factors might cause a difference in the answers and what optimizations might be going on under the hood with VML_EP? I *think* the core algorithms (my rsqrt/rcp vs. VML) must be similar as the execution speed is similar on a vector of 16384 elements, looping 1e6 times.
Thanks in advance,
Sergey, thank you for your response. I understand that VML_EP trades accuracy for speed, what I am trying to determine is how it makes that trade off in the particular case of vcAbs. Before I can decide if the tradeoff is acceptable for my particular application, I need to understand how it is getting its speedup (i.e. "how imprecise is imprecise"). So far I've been able to simulate the accuracy/speed tradeoff for other MKL methods by using explicit AVX/SSE intrinsics, which has helped me understand whether those tradeoffs are acceptable. In this particular case I assumed the VML function was using rcp/rsqrt to achieve its speedup, however given that I can get similar speed but can not match the answers identically using these intrinsics I believe that either A) I have made a faulty assumption, or B) vcAbs is doing something else under the hood in addition to these instructions which is affecting the precision (either making it more or less precise, not sure which) but not affecting the speed.
Can you help me understand what instructions vcAbs is using (or what compiler flags I could use) that might make the answers match?
Regarding "how imprecise is imprecise", please refer to "Data Types, Accuracy Modes, and Performance Tips" from "Intel® Math Kernel Library Reference Manual": "In the VML_EP accuracy mode, approximately half of the mantissa bits are correct."
As both of you said, VML_EP will rely on rcp and rsqrt instructions with no effort to get more than about 12 bits precision.
For compiled source code, some of the relevant options for comparison might be
-complex-limited-range (included in -fp-model fast=2)
This will generate in-line code for divide and sqrt which doesn't protect against loss of range, so would be valid roughly for +-e18 in float data types. gcc has a similar option which is included in -ffast-math.
This (together with limited-range and -ftz) seems equivalent to VML_EP half precision.
-imf-accuracy-bits=20 might be expected to have about the same effect as -no-prec-div -no-prec-sqrt (which are implied by -fp-model fast).
You would probably need to exercise these options to generate asm code in order to get more insight about where those rounding differences occur, but I don't think any of these options are designed to help you get consistently 13 bits accuracy rather than 12.
Evgueni, thanks a lot for pointing me toward that documentation. I had been looking through the reference manual previously but somehow completely missed the note on EP and number of correct mantissa bits.
Tim, thanks for suggesting the compiler flags, I will try them tomorrow and see if I can get my vectorized answers to match MKL. The reason I'm so keen on getting the answers to match is that I am using this magnitude calculation as an interim step in a processing chain with other instructions (hence why I am using AVX intrinsics rather than chaining MKL commands), however elsewhere in my code I use MKL and I want to make sure my interim answer matches the output from the other parts of my code identically if possible.
Sergey, thank you, I will check the links out.
Tim, I tried the compiler flags but got the same results as before (though its good to know that there is that fine a level of control over floating point accuracy as previously I didn't really know about flags like -imf-accuracy-bits).
After trying all manner of optimization flags from the compiler reference manual to no avail, I realized I may be a victim of my own stupidity. I believe the difference between my implementation and VML is algorithmic, not flag related. I had been computing magnitude squared then using rsqrt followed by rcp to compute the magnitude. I'm betting VML is doing magnitude squared then rsqrt followed by a multiply of the magnitude squared times the reciprocal magnitude. This is not only fewer instructions, but I would guess it has higher accuracy as well since it is only subject to rounding error of the mag sq and mult whereas rcp is an approximation rather than a true inversion.
Unfortunately I won't be able to test this theory until monday on the AVX server at my office, but I will post how it turns out. Thank you all for your helpful posts, I have learned a lot about floating point precision and consistency in both MKL and icc/icpc.
Yes, the algorithm you describe seems a likely one for the combination of limited range and limited accuracy. Accuracy can be boosted to within 4 Ulps of full precision by iteration. That's what's expected from the compiler for the single precision "throughput" vectorization optimization when you set the options -complex-limited-range -no-prec-div -no-prec-sqrt. Both rsqrt and rcp are limited accuracy (half-float) instructions. Ivy Bridge also introduces 16-bit float format memory <=> 32-bit register moves.
I don't know what promises were exchanged between marketing and customers to induce all this 12+-bit precision float support.
Follow up: it was indeed an algorithmic issue not a flag one. When I switched my algorithm to multiply the magnitude squared by rsqrt of magnitude squared, I got the same results as the MKL VML vcAbs routine with mode=VML_EP.
On a side note, when I compare this calculation to a simple "for" loop using the intel math library cabs routine, I was unable to achieve any difference in either speed or accuracy no matter what compiler flags I used. I'm not sure if I missed something or if I'm just using the flags incorrectly, but it didn't seem to matter what I set for -imf-accuracy-bits, -complex-limited-range, -fp-model, -x, etc., the answer and speed remained constant. I also tried a "for" loop computing sqrtf of magnitude squared and once again was not able to change the speed or accuracy with the compiler flags. Should I have expected these flags to affect/change performance of methods in math.h or cmath.h?