- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
x1 = dot-product(y1,z1)
x2 = dot-product(y2,z2)
x3 = dot-product(y3,z3)
x4 = x1/(sqrt(x2)*sqrt(x3)
I can do x1,x2,x3 with the DPPS instruction and then use extractps. So 3 DPPS with 3 EXTRACTPS. Turns out I did not get any improvement in performance. To use lesser number of EXTRACTPS, I used BLENDPS.
x1_sse = dpps(y1,z1,241)
x2_sse = dpps(y2,z2,242)
x2_sse = blendps(x1_sse,x2_sse, 2);
x3_sse = dpps(y3,z3, 244)
x3_sse = blendps(x2_sse, x3_sse, 4)
storeps(x3_sse, x3_array)
x1 = x3_array[0]
x2 = x3_array[1]
x3 = x3_array[2]
Turns out there is no improvement from this either, infact a slight degradation. All loads and stores are aligned. I am using icpc -ipo -xT -O3 -no-prec-div -static -funroll-loops (so -fast without -ipo since -ipo does not work with SSE4.1 instructions). Any comments on how I could do this better or are these instruction latencies just too long for my use ? I guess I am dissapointed with the performance of the SSE 4.1 so far.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
icpc -xS supports automatic selection of SSE4.1 instructions, where the compiler deems them beneficial.dpps fully unrolled "vectorization" of an inner loop inhibits auto-vectorization of a containing loop, which would seem a likely application of it. In the case where traditional "re-rolling" of a long partially unrolled dot product loop avoids the compiler selection of dpps, that is the better way to full performance.
Much as ad writers love to getpaid forwriting about new instructions, more significant performance improvements of Penryn CPUs are realized in SSE2 code, for example, by the improved performance of IEEE divide and square root (both serial and parallel versions), and by the higher supported FSB ratings.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
vsachde,
Could you please clarify if x1, y1, and z1 are scalar or vector values?
From your pseudo-code it looks like the division and square root will be the bottleneck, not the dot-product. In my syntetic tests I have measured DPPS takes 3 clocks compared to 5 clocks for the SSE3 code with horizonatal add.
Can you post simple C code equivalent of your loop? Perhaps there is a better way to transform it.
Also, isn't the:
x1 / (sqrt(x2) * sqrt(x3))
The same as:
x1 / sqrt(x2 * x3)
?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
vsachde,
You should restructure your loop so you don't have if-then-else in it. Try to change from:
for (int i = 0; i < count; i++) { if (condition) { // do something } else { // do something else } }
To:
if (condition) { for (int i = 0; i < count; i++) { // do something } } else { for (int i = 0; i < count; i++) { // do something else } }
If possible of course.
As for the square root estimation, I doubt you will get better performance and you will most certainly lose precision. I definitely wouldn't waste time and effort on that unless you get paid for experimenting as well.
What I would do though is make it two or three pass — try to precalculate x2 * x3 and store it into an aligned array and then use vectorized sqrt and divide loop on it in the second pass.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The compilers default to use of rsqrtps instructions plus iteration for vectorization on account of the slowness of sqrtps on earlier CPU models. Even then, the gain was largely in the ability to pipeline other operations so as to take advantage of the many cycles spent performing sqrt and divide.
As Igor said, the big potential gain is in finding a strategy to vectorize the reciprocal sqrt. If you do that with Intel compilers, you can simply switch compile options between prec-div on and off to try rsqrtps vs IEEE accurate method.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's right, since:
result = x1 / sqrt(x2 * x3)
Is the same as:
result = x1 * (1.0f / sqrt(x2 * x3))
The general idea is to transform that into:
result = x1 * RSQRTPS(x2 * x3)
If it gives you enough precision for your particular case.
Of course, it would be for the best if that transformation is done at the language level (by writing compiler friendly code) just as tim18 said.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
vsachde,
-no-prec-div won't do anything unles you make division and square root vectorizable as I suggested. As for the condition, you could also try precalculating x - x[i + 1] and putting the results into another array so that comparison can be vectorized. If the code is extremely complex compiler won't be able to do it but you may try doing it with intrinsics after you do what I am suggesting. For example, you can calculate both branches of if-then-else and then blend the result according to vectorized comparison.
I am sorry, but without seeing full loop code I simply can't give you more usefull suggestions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page