- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Is there a published list of which V....PS instructions are 256-bit enabled on SNB-EP, so might provide performance benefit over SSE? And which are not 256-bit enabled?

Thanks

Link Copied

8 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Quoting ron_bennett@mentor.com

*Is there a published list of which V....PS instructions are 256-bit enabled on SNB-EP, so might provide performance benefit over SSE? And which are not 256-bit enabled?*

Note however that even though these instructions offer no benefit in throughput, there's still a register space benefit. If you replaced these instructions with two SSE instructions, you would have to use additional registers, which can lead to spilling. And on top of that it allows for better latency hiding. So you may still observe improved performance over SSE.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

http://www.agner.org/optimize/instruction_tables.pdf

Data volume per instruction is doubled, but CPI is also doubled, so no net benefit.

It seems that vdivps and vsqrtps are partially implemented on SNB in the sense that the syntax is supported (presumably for SW compatibility with future AVX products), but the performance benefit of a single 256-bit operation is not available - other than the ancillary benefits mentioned above.

Is that a fair summary?

Thanks for the quick reply,

Ron

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

hint : you can see the number of uops for any instruction with IACA http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/or the throughput/latency figures in the software optimization manual (see "Intel 64 and IA-32 Architectures Optimization Reference Manual" here http://www.intel.com/products/processor/manuals/)

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Shuffles are too complicated to be categorized simply, but there's a similar effect.

As the implications are different for each category of instruction, a simple list might be considered misleading.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Again, Agner Fog has a nice succint summary. Is this the method you're referring to?

**16.17 FSQRT (SSE processors)**

A
fast way of calculating an approximate square root on processors with SSE is to
multiply the
reciprocal square root of *x *by
*x*:

sqrt(x) = x * rsqrt(x)

The instruction RSQRTSS or RSQRTPS gives the reciprocal square root with a precision of 12 bits. You can improve the precision to 23 bits by using the Newton-Raphson formula described in Intels application note AP-803:

x0 = rsqrtss(a)

x1 = 0.5 * x0 * (3 - (a * x0) * x0)

where x0 is the first approximation to the reciprocal square root of a, and x1 is a better approximation. The order of evaluation is important. You must use this formula before multiplying with a to get the square root.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Double precision is more problematic, since the rsqrtps etc. would need close to 14 bits precision (as it had on the original AMD Athlon-32) to get to 52+ bits precision after 2 iterations.

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page