Is there a published list of which V....PS instructions are 256-bit enabled on SNB-EP, so might provide performance benefit over SSE? And which are not 256-bit enabled?
Data volume per instruction is doubled, but CPI is also doubled, so no net benefit.
It seems that vdivps and vsqrtps are partially implemented on SNB in the sense that the syntax is supported (presumably for SW compatibility with future AVX products), but the performance benefit of a single 256-bit operation is not available - other than the ancillary benefits mentioned above.
Is that a fair summary?
Thanks for the quick reply,
hint : you can see the number of uops for any instruction with IACA http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/or the throughput/latency figures in the software optimization manual (see "Intel 64 and IA-32 Architectures Optimization Reference Manual" here http://www.intel.com/products/processor/manuals/)
Shuffles are too complicated to be categorized simply, but there's a similar effect.
As the implications are different for each category of instruction, a simple list might be considered misleading.
Again, Agner Fog has a nice succint summary. Is this the method you're referring to?
16.17 FSQRT (SSE processors)
A fast way of calculating an approximate square root on processors with SSE is to multiply the reciprocal square root of x by x:
sqrt(x) = x * rsqrt(x)
The instruction RSQRTSS or RSQRTPS gives the reciprocal square root with a precision of 12 bits. You can improve the precision to 23 bits by using the Newton-Raphson formula described in Intels application note AP-803:
x0 = rsqrtss(a)
x1 = 0.5 * x0 * (3 - (a * x0) * x0)
where x0 is the first approximation to the reciprocal square root of a, and x1 is a better approximation. The order of evaluation is important. You must use this formula before multiplying with a to get the square root.
Double precision is more problematic, since the rsqrtps etc. would need close to 14 bits precision (as it had on the original AMD Athlon-32) to get to 52+ bits precision after 2 iterations.