AVX512 reciprocal approximations

TimP · ‎03-31-2016

In yesterday's Intel webinar it was stated that AVX512 reciprocal approximations are accurate to 28 bits precision. I guess they should have said this is the goal for reciprocal approximation plus one iterative refinement. Two of us asked about this in the chat but it wasn't answered.

It was mentioned that the reciprocal approximations support double data type for the first time. My take would be that the earlier reciprocal approximations would have required 3 iterative refinements for double, while 2 would be sufficient with AVX512, which may be the reason why compilers for AVX2 and earlier host targets don't choose this method. This was brought up the context of KNL; maybe it implies that KNL still won't have IEEE divide or sqrt instructions, or, if there are such, that they aren't recommended. There seems to be continued tendency to market this aspect of Intel(r) Xeon Phi(tm) in an obscure manner.

I suppose that AVX512 would imply adjustments in recommendations on -Qimf-accuracy-bits usage, but maybe that feature would be even less deserving of attention than in the past.

As to whether -Qimf-domain-exclusion would remain important for KNL, with unchanged value translations, that seems implied by the recommendation to tune on KNC.

Nikita_A_Intel · ‎04-01-2016

Please see the Intel® Architecture Instruction Set Extensions Programming Reference, at https://software.intel.com/en-us/isa-extensions. This describes also the new reciprocal approximation instructions, including VRCP28. The double precision version, e.g.VRCP28PD, approximates reciprocals with a relative error of at most 2^-28. This means that the result has 27 ‘correct’ bits in the significand, out of the total of 53. The single precision version will have 23 bits which are ‘correct’, since the result is rounded to single precision format, with 24 bits in the significand.

AVX512ER-capable machines, in particular KNL, support instructions such as VRCP28PD. AVX512F-capable machines (including again KNL) support also reciprocal approximation instructions such as VRCP14PD, which provide results with a maximum relative error of 2^-14.

These instructions support double precision format (IEEE 754-2008 binary64), unlike the earlier RCPPS/RCPSS which support only the single precision format (IEEE 754-2008 binary32).

Fully IEEE-conformant division is also available on KNL, e.g. in the VDIVPD instruction. However, using reciprocal approximation instructions one can implement near-IEEE (i.e. not correctly rounded) division operations which may offer better throughput than VDIVPD for most operands.

So throughput-oriented applications can trade the IEEE-conformant division for numerically-relaxed, higher-throughput implementations available by using the /Qprec-div- or -no-prec-div compiler switches with the Intel Compiler, and/or the more granular -[f|Q]imf-accuracy-bits, max-error, domain-exclusion controls. Note that programmers need to be aware of their applications’ numerical properties, and they should set the math functions accuracy requirements (including for division) accordingly.

Same applies to SQRT.

Note also that you can find reference code which emulates accurately several new approximation instructions (scalar versions only), at this location: https://software.intel.com/en-us/blogs/2016/01/13/compiling-for-the-intel-xeon-phi-processor-x200-and-the-intel-avx-512-isa.

McCalpinJohn · ‎04-02-2016

I just went back and reviewed the latency and throughput of the VDIVPD and VDIVPS instructions over the past several generations, and was surprised to see how much improvement Intel has managed....

Appendix C (Instruction Timings) in the most recent Intel Optimization Reference Manual shows latency and reciprocal throughput for Ivy Bridge, Haswell, Broadwell, and Skylake. I went back to earlier versions of the Optimization Reference Manual to get values for Core 2, Nehalem, Westmere, and Sandy Bridge.

The results are hard to summarize concisely, so I am including two graphs. The first graph shows the throughput for 32-bit FP divides for scalar, 128-bit SSE, and 256-bit AVX (where appropriate) relative to the throughput for 32-bit scalar divides on Sandy Bridge. The second graph shows the same ratios for 64-bit divides (using the throughput for 64-bit scalar divides on Sandy Bridge as the reference).

The graph for the 64-bit results is a bit cleaner. What stands out immediately to me:

The throughput for 128-bit (2x64) FP divides is the same as the throughput for 256-bit (4x64) FP divides.
For all processors except Broadwell, the 128-bit/256-bit throughput is almost exactly 2x the scalar throughput.
- It looks like Broadwell has an improved scalar divide that was not fully implemented for the SIMD instructions (until Skylake)

Looking at the graph for 32-bit results

The throughput for 128-bit (4x32) FP divides is approximately the same as the throughput for 256-bit (8x32) FP divides.
- Note that the reciprocal throughput values are small, so single-cycle differences show up as fairly large changes in the graph.
For all processors except Broadwell, the 128-bit/256-bit throughput is almost exactly 4x the scalar throughput.
- For Broadwell, the 128-bit/256-bit throughput is almost exactly 2x the scalar throughput, but that may just be a coincidence.
- Again it appears that Broadwell was given an improved scalar divide that was not fully implemented for the SIMD instructions (until Skylake)

The improvements in divide throughput are very impressive -- especially given the apparent limitation of the parallelism to only one of the two 128-bit pipelines.

If the excellent 256-bit FP divide performance of the Skylake client parts is carried forward to AVX-512-capable processors, the decision to use VDIVPD vs a vectorized iterative approach will be a tough call in many cases. I should be able to test this on KNL "Real Soon Now".... It will also be interesting to see whether the AVX-512 timings on KNL are significantly different than those on SKX, but that will require a bit more patience.....

TimP · ‎04-02-2016

We've had several CPU models in the past where the scalar and parallel fp divides were fast enough that there was no point in using the iterative methods. Sandy Bridge took a step backwards from the point of view that the AVX256 divide and sqrt showed little gain over AVX128. Even though Ivy Bridge didn't introduce true 256-bit parallelism in divide or sqrt, the performance was improved enough to eliminate much concern over the choice of methods.

I've heard that the Skylake server single CPU could give nearly the performance of a pair of Haswell or Broadwell CPUs. Apparently there may be other places where bottlenecks to 512-bit parallelism were eliminated.

Looking at compiler generated AVX512 code, I see it using a series of operations vscalefss vrsqrt28ss followed apparently by an iterative step to replace each of a pair of vsqrtfs outside an inner loop, so it still looks as if the compiler team expects the vrsqrtss to require combination with an iterative step to exceed 23 bit precision. A divide is treated in a similar way. If the compiler is not treating reciprocals as full 28-bit accurate instructions, the advertising seems misleading. The compiler makes the same choice for either KNL or SKX target.

It will be interesting to see your conclusions for KNL. For KNC, of course, the prec-div prec-sqrt options give large increases in run time even when there are no such operations in inner loops. The manual posted on line doesn't appear to have included AVX512 instruction timings, so no way to compare the latency of vsqrtss vs. the sequence vscale, vsqrt23, vmul, vfmnadd, vscale.

Nikita_A_Intel · ‎04-04-2016

Tim, regarding the compiler generated code: what you see is the correctly rounded sqrt sequence which does require iterative step (28 bits approximation is not enough to get 24 bits single precision correctly rounded result). Depending on your domain-exclusion settings you may also get some special scaling to handle the near zero values precisely. You may want to specify your accuracy requirements using the -fimf-*** switches.