Which AVX instructions are 256-bit enabled on SNB-EP?

Ron_B_ · ‎08-09-2011

We noticed that the VEX.256 encoding of vsqrtps has the same performance as VEX.128 / SSE. My understanding is that VEX.256 sqrt is implemented as two VEX.128 calls, so no expected performance benefit over SSE.

Is there a published list of which V....PS instructions are 256-bit enabled on SNB-EP, so might provide performance benefit over SSE? And which are not 256-bit enabled?

Thanks

capens__nicolas · ‎08-09-2011

Quoting ron_bennett@mentor.com

Is there a published list of which V....PS instructions are 256-bit enabled on SNB-EP, so might provide performance benefit over SSE? And which are not 256-bit enabled?

Divisions are executed by the same unit as square roots, so vdivps/pd also take twice as long as divps/pd. And according to Agner Fog's measurements the vrsqrtps/pd instruction is also executed as two 128-bit operations.

Note however that even though these instructions offer no benefit in throughput, there's still a register space benefit. If you replaced these instructions with two SSE instructions, you would have to use additional registers, which can lead to spilling. And on top of that it allows for better latency hiding. So you may still observe improved performance over SSE.

Ron_B_ · ‎08-10-2011

I noticed a reference to Agner Fog's instruction timing tables in a previous post, which point to the same conclusion:

http://www.agner.org/optimize/instruction_tables.pdf

Data volume per instruction is doubled, but CPI is also doubled, so no net benefit.

It seems that vdivps and vsqrtps are partially implemented on SNB in the sense that the syntax is supported (presumably for SW compatibility with future AVX products), but the performance benefit of a single 256-bit operation is not available - other than the ancillary benefits mentioned above.

Is that a fair summary?

Thanks for the quick reply,
Ron

bronxzv · ‎08-11-2011

as explained by c0d1f1ed VEX.256 vsqrtps(pd) and vdivps(pd) are executed as 2 uops on Sandy Bridge

hint : you can see the number of uops for any instruction with IACA http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/or the throughput/latency figures in the software optimization manual (see "Intel 64 and IA-32 Architectures Optimization Reference Manual" here http://www.intel.com/products/processor/manuals/)

TimP · ‎08-11-2011

Needless to say, moves to memory are the remaining major case where 256-bit operations are always split to 128 bits on SNB and IVB. For unaligned moves, it's more efficient to split explicitly to AVX-128, and all compilers do that automatically for unaligned stores; a majority split unaligned loads as well.
Shuffles are too complicated to be categorized simply, but there's a similar effect.
As the implications are different for each category of instruction, a simple list might be considered misleading.

Ron_B_ · ‎08-11-2011

Can you say anything about future plans to implement these two calls as single uops? Should we infer that would be post-SNB?

TimP · ‎08-11-2011

The SNB alternative to splitting the divide or sqrt takes us back to the iterative method, as implemented by Intel compilers. It should improve throughput by allowing other operations (or multiple divide and sqrt) to proceed in parallel. This is somewhat disappointing, as the IEEE accurate 128-bit parallel divide instruction became good enough from HTN on that there was little incentive to use the iteration beginning with the half precision instructions.

Ron_B_ · ‎08-11-2011

OK, that's what we thought. We noticed ICC is using an iterative method, so we're looking into using a Newton-Raphson approximation - depending on the accuracy needed.

Again, Agner Fog has a nice succint summary. Is this the method you're referring to?

16.17 FSQRT (SSE processors)

A fast way of calculating an approximate square root on processors with SSE is to multiply the reciprocal square root of x by x:

sqrt(x) = x * rsqrt(x)

The instruction RSQRTSS or RSQRTPS gives the reciprocal square root with a precision of 12 bits. You can improve the precision to 23 bits by using the Newton-Raphson formula described in Intels application note AP-803:

x0 = rsqrtss(a)

x1 = 0.5 * x0 * (3 - (a * x0) * x0)

where x0 is the first approximation to the reciprocal square root of a, and x1 is a better approximation. The order of evaluation is important. You must use this formula before multiplying with a to get the square root.

TimP · ‎08-11-2011

Yes, that's the description of how to get 1/sqrt(a) (single precision) from rsqrtss (more likely to be seen using rsqrtps in vectorized code). Similar formulae apply for divide. Intel compilers turn off these iterative expansions in favor of IEEE accurate instructions by the -prec-sqrt and -prec-div options.
Double precision is more problematic, since the rsqrtps etc. would need close to 14 bits precision (as it had on the original AMD Athlon-32) to get to 52+ bits precision after 2 iterations.