Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Technologies
- Intel® ISA Extensions
- Which AVX instructions are 256-bit enabled on SNB-EP?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Ron_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-09-2011
04:52 PM

91 Views

Which AVX instructions are 256-bit enabled on SNB-EP?

Is there a published list of which V....PS instructions are 256-bit enabled on SNB-EP, so might provide performance benefit over SSE? And which are not 256-bit enabled?

Thanks

Link Copied

8 Replies

capens__nicolas

New Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-09-2011
10:59 PM

91 Views

Quoting ron_bennett@mentor.com

Note however that even though these instructions offer no benefit in throughput, there's still a register space benefit. If you replaced these instructions with two SSE instructions, you would have to use additional registers, which can lead to spilling. And on top of that it allows for better latency hiding. So you may still observe improved performance over SSE.

Ron_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-10-2011
05:53 PM

91 Views

http://www.agner.org/optimize/instruction_tables.pdf

Data volume per instruction is doubled, but CPI is also doubled, so no net benefit.

It seems that vdivps and vsqrtps are partially implemented on SNB in the sense that the syntax is supported (presumably for SW compatibility with future AVX products), but the performance benefit of a single 256-bit operation is not available - other than the ancillary benefits mentioned above.

Is that a fair summary?

Thanks for the quick reply,

Ron

bronxzv

New Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-11-2011
12:50 AM

91 Views

hint : you can see the number of uops for any instruction with IACA http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/or the throughput/latency figures in the software optimization manual (see "Intel 64 and IA-32 Architectures Optimization Reference Manual" here http://www.intel.com/products/processor/manuals/)

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-11-2011
05:33 AM

91 Views

Shuffles are too complicated to be categorized simply, but there's a similar effect.

As the implications are different for each category of instruction, a simple list might be considered misleading.

Ron_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-11-2011
11:41 AM

91 Views

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-11-2011
02:02 PM

91 Views

Ron_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-11-2011
02:40 PM

91 Views

Again, Agner Fog has a nice succint summary. Is this the method you're referring to?

**16.17 FSQRT (SSE processors)**

A
fast way of calculating an approximate square root on processors with SSE is to
multiply the
reciprocal square root of *x *by
*x*:

sqrt(x) = x * rsqrt(x)

The instruction RSQRTSS or RSQRTPS gives the reciprocal square root with a precision of 12 bits. You can improve the precision to 23 bits by using the Newton-Raphson formula described in Intels application note AP-803:

x0 = rsqrtss(a)

x1 = 0.5 * x0 * (3 - (a * x0) * x0)

where x0 is the first approximation to the reciprocal square root of a, and x1 is a better approximation. The order of evaluation is important. You must use this formula before multiplying with a to get the square root.

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-11-2011
04:03 PM

91 Views

Double precision is more problematic, since the rsqrtps etc. would need close to 14 bits precision (as it had on the original AMD Athlon-32) to get to 52+ bits precision after 2 iterations.

For more complete information about compiler optimizations, see our Optimization Notice.