- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We noticed that the VEX.256 encoding of vsqrtps has the same performance as VEX.128 / SSE. My understanding is that VEX.256 sqrt is implemented as two VEX.128 calls, so no expected performance benefit over SSE.
Is there a published list of which V....PS instructions are 256-bit enabled on SNB-EP, so might provide performance benefit over SSE? And which are not 256-bit enabled?
Thanks
Is there a published list of which V....PS instructions are 256-bit enabled on SNB-EP, so might provide performance benefit over SSE? And which are not 256-bit enabled?
Thanks
Link Copied
8 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting ron_bennett@mentor.com
Is there a published list of which V....PS instructions are 256-bit enabled on SNB-EP, so might provide performance benefit over SSE? And which are not 256-bit enabled?
Note however that even though these instructions offer no benefit in throughput, there's still a register space benefit. If you replaced these instructions with two SSE instructions, you would have to use additional registers, which can lead to spilling. And on top of that it allows for better latency hiding. So you may still observe improved performance over SSE.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I noticed a reference to Agner Fog's instruction timing tables in a previous post, which point to the same conclusion:
http://www.agner.org/optimize/instruction_tables.pdf
Data volume per instruction is doubled, but CPI is also doubled, so no net benefit.
It seems that vdivps and vsqrtps are partially implemented on SNB in the sense that the syntax is supported (presumably for SW compatibility with future AVX products), but the performance benefit of a single 256-bit operation is not available - other than the ancillary benefits mentioned above.
Is that a fair summary?
Thanks for the quick reply,
Ron
http://www.agner.org/optimize/instruction_tables.pdf
Data volume per instruction is doubled, but CPI is also doubled, so no net benefit.
It seems that vdivps and vsqrtps are partially implemented on SNB in the sense that the syntax is supported (presumably for SW compatibility with future AVX products), but the performance benefit of a single 256-bit operation is not available - other than the ancillary benefits mentioned above.
Is that a fair summary?
Thanks for the quick reply,
Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
as explained by c0d1f1ed VEX.256 vsqrtps(pd) and vdivps(pd) are executed as 2 uops on Sandy Bridge
hint : you can see the number of uops for any instruction with IACA http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/or the throughput/latency figures in the software optimization manual (see "Intel 64 and IA-32 Architectures Optimization Reference Manual" here http://www.intel.com/products/processor/manuals/)
hint : you can see the number of uops for any instruction with IACA http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/or the throughput/latency figures in the software optimization manual (see "Intel 64 and IA-32 Architectures Optimization Reference Manual" here http://www.intel.com/products/processor/manuals/)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Needless to say, moves to memory are the remaining major case where 256-bit operations are always split to 128 bits on SNB and IVB. For unaligned moves, it's more efficient to split explicitly to AVX-128, and all compilers do that automatically for unaligned stores; a majority split unaligned loads as well.
Shuffles are too complicated to be categorized simply, but there's a similar effect.
As the implications are different for each category of instruction, a simple list might be considered misleading.
Shuffles are too complicated to be categorized simply, but there's a similar effect.
As the implications are different for each category of instruction, a simple list might be considered misleading.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you say anything about future plans to implement these two calls as single uops? Should we infer that would be post-SNB?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The SNB alternative to splitting the divide or sqrt takes us back to the iterative method, as implemented by Intel compilers. It should improve throughput by allowing other operations (or multiple divide and sqrt) to proceed in parallel. This is somewhat disappointing, as the IEEE accurate 128-bit parallel divide instruction became good enough from HTN on that there was little incentive to use the iteration beginning with the half precision instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, that's what we thought. We noticed ICC is using an iterative method, so we're looking into using a Newton-Raphson approximation - depending on the accuracy needed.
Again, Agner Fog has a nice succint summary. Is this the method you're referring to?
Again, Agner Fog has a nice succint summary. Is this the method you're referring to?
16.17 FSQRT (SSE processors)
A fast way of calculating an approximate square root on processors with SSE is to multiply the reciprocal square root of x by x:
sqrt(x) = x * rsqrt(x)
The instruction RSQRTSS or RSQRTPS gives the reciprocal square root with a precision of 12 bits. You can improve the precision to 23 bits by using the Newton-Raphson formula described in Intels application note AP-803:
x0 = rsqrtss(a)
x1 = 0.5 * x0 * (3 - (a * x0) * x0)
where x0 is the first approximation to the reciprocal square root of a, and x1 is a better approximation. The order of evaluation is important. You must use this formula before multiplying with a to get the square root.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, that's the description of how to get 1/sqrt(a) (single precision) from rsqrtss (more likely to be seen using rsqrtps in vectorized code). Similar formulae apply for divide. Intel compilers turn off these iterative expansions in favor of IEEE accurate instructions by the -prec-sqrt and -prec-div options.
Double precision is more problematic, since the rsqrtps etc. would need close to 14 bits precision (as it had on the original AMD Athlon-32) to get to 52+ bits precision after 2 iterations.
Double precision is more problematic, since the rsqrtps etc. would need close to 14 bits precision (as it had on the original AMD Athlon-32) to get to 52+ bits precision after 2 iterations.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page