Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development Technologies
- Intel® ISA Extensions
- SSE 4.1 instructions - DPPS/EXTRACTPS

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

vsachde

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-06-2008
11:50 AM

238 Views

SSE 4.1 instructions - DPPS/EXTRACTPS

x1 = dot-product(y1,z1)

x2 = dot-product(y2,z2)

x3 = dot-product(y3,z3)

x4 = x1/(sqrt(x2)*sqrt(x3)

I can do x1,x2,x3 with the DPPS instruction and then use extractps. So 3 DPPS with 3 EXTRACTPS. Turns out I did not get any improvement in performance. To use lesser number of EXTRACTPS, I used BLENDPS.

x1_sse = dpps(y1,z1,241)

x2_sse = dpps(y2,z2,242)

x2_sse = blendps(x1_sse,x2_sse, 2);

x3_sse = dpps(y3,z3, 244)

x3_sse = blendps(x2_sse, x3_sse, 4)

storeps(x3_sse, x3_array)

x1 = x3_array[0]

x2 = x3_array[1]

x3 = x3_array[2]

Turns out there is no improvement from this either, infact a slight degradation. All loads and stores are aligned. I am using icpc -ipo -xT -O3 -no-prec-div -static -funroll-loops (so -fast without -ipo since -ipo does not work with SSE4.1 instructions). Any comments on how I could do this better or are these instruction latencies just too long for my use ? I guess I am dissapointed with the performance of the SSE 4.1 so far.

Link Copied

11 Replies

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-06-2008
08:39 PM

238 Views

icpc -xS supports automatic selection of SSE4.1 instructions, where the compiler deems them beneficial.dpps fully unrolled "vectorization" of an inner loop inhibits auto-vectorization of a containing loop, which would seem a likely application of it. In the case where traditional "re-rolling" of a long partially unrolled dot product loop avoids the compiler selection of dpps, that is the better way to full performance.

Much as ad writers love to getpaid forwriting about new instructions, more significant performance improvements of Penryn CPUs are realized in SSE2 code, for example, by the improved performance of IEEE divide and square root (both serial and parallel versions), and by the higher supported FSB ratings.

ILevi1

Valued Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-09-2008
03:41 PM

238 Views

vsachde,

Could you please clarify if x1, y1, and z1 are scalar or vector values?

From your pseudo-code it looks like the division and square root will be the bottleneck, not the dot-product. In my syntetic tests I have measured DPPS takes 3 clocks compared to 5 clocks for the SSE3 code with horizonatal add.

Can you post simple C code equivalent of your loop? Perhaps there is a better way to transform it.

Also, isn't the:

x1 / (sqrt(x2) * sqrt(x3))

The same as:

x1 / sqrt(x2 * x3)

?

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-09-2008
07:25 PM

238 Views

vsachde

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-09-2008
08:56 PM

238 Views

ILevi1

Valued Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-10-2008
05:52 AM

238 Views

vsachde,

You should restructure your loop so you don't have if-then-else in it. Try to change from:

for (int i = 0; i < count; i++) { if (condition) { // do something } else { // do something else } }

To:

if (condition) { for (int i = 0; i < count; i++) { // do something } } else { for (int i = 0; i < count; i++) { // do something else } }

If possible of course.

As for the square root estimation, I doubt you will get better performance and you will most certainly lose precision. I definitely wouldn't waste time and effort on that unless you get paid for experimenting as well.

What I would do though is make it two or three pass — try to precalculate x2 * x3 and store it into an aligned array and then use vectorized sqrt and divide loop on it in the second pass.

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-10-2008
06:10 AM

238 Views

The compilers default to use of rsqrtps instructions plus iteration for vectorization on account of the slowness of sqrtps on earlier CPU models. Even then, the gain was largely in the ability to pipeline other operations so as to take advantage of the many cycles spent performing sqrt and divide.

As Igor said, the big potential gain is in finding a strategy to vectorize the reciprocal sqrt. If you do that with Intel compilers, you can simply switch compile options between prec-div on and off to try rsqrtps vs IEEE accurate method.

ILevi1

Valued Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-10-2008
07:19 AM

238 Views

That's right, since:

result = x1 / sqrt(x2 * x3)

Is the same as:

result = x1 * (1.0f / sqrt(x2 * x3))

The general idea is to transform that into:

result = x1 * RSQRTPS(x2 * x3)

If it gives you enough precision for your particular case.

Of course, it would be for the best if that transformation is done at the language level (by writing compiler friendly code) just as tim18 said.

vsachde

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-10-2008
10:35 AM

238 Views

ILevi1

Valued Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-10-2008
11:48 AM

238 Views

vsachde,

-no-prec-div won't do anything unles you make division and square root vectorizable as I suggested. As for the condition, you could also try precalculating x* - x[i + 1] and putting the results into another array so that comparison can be vectorized. If the code is extremely complex compiler won't be able to do it but you may try doing it with intrinsics after you do what I am suggesting. For example, you can calculate both branches of if-then-else and then blend the result according to vectorized comparison.*

I am sorry, but without seeing full loop code I simply can't give you more usefull suggestions.

vsachde

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-10-2008
11:50 AM

238 Views

precalculating x* - x[i + 1] and putting the results into another array is what i m going to do now. *

lxguy

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-12-2008
08:11 AM

238 Views

Thanks for your detailed explanation.I begin to have some knowledge of it now.

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.