Re: SSE 4.1 instructions - DPPS/EXTRACTPS

vsachde · ‎07-06-2008

I have been trying to use the Intel DPPS instruction with either EXTRACTPS or BLENDPS. Essentially I have a loop in which

x1 = dot-product(y1,z1)
x2 = dot-product(y2,z2)
x3 = dot-product(y3,z3)

x4 = x1/(sqrt(x2)*sqrt(x3)

I can do x1,x2,x3 with the DPPS instruction and then use extractps. So 3 DPPS with 3 EXTRACTPS. Turns out I did not get any improvement in performance. To use lesser number of EXTRACTPS, I used BLENDPS.

x1_sse = dpps(y1,z1,241)
x2_sse = dpps(y2,z2,242)
x2_sse = blendps(x1_sse,x2_sse, 2);
x3_sse = dpps(y3,z3, 244)
x3_sse = blendps(x2_sse, x3_sse, 4)

storeps(x3_sse, x3_array)
x1 = x3_array[0]
x2 = x3_array[1]
x3 = x3_array[2]

Turns out there is no improvement from this either, infact a slight degradation. All loads and stores are aligned. I am using icpc -ipo -xT -O3 -no-prec-div -static -funroll-loops (so -fast without -ipo since -ipo does not work with SSE4.1 instructions). Any comments on how I could do this better or are these instruction latencies just too long for my use ? I guess I am dissapointed with the performance of the SSE 4.1 so far.

TimP · ‎07-06-2008

icpc -xS supports automatic selection of SSE4.1 instructions, where the compiler deems them beneficial.dpps fully unrolled "vectorization" of an inner loop inhibits auto-vectorization of a containing loop, which would seem a likely application of it. In the case where traditional "re-rolling" of a long partially unrolled dot product loop avoids the compiler selection of dpps, that is the better way to full performance.

Much as ad writers love to getpaid forwriting about new instructions, more significant performance improvements of Penryn CPUs are realized in SSE2 code, for example, by the improved performance of IEEE divide and square root (both serial and parallel versions), and by the higher supported FSB ratings.

levicki · ‎07-09-2008

vsachde,

Could you please clarify if x1, y1, and z1 are scalar or vector values?

From your pseudo-code it looks like the division and square root will be the bottleneck, not the dot-product. In my syntetic tests I have measured DPPS takes 3 clocks compared to 5 clocks for the SSE3 code with horizonatal add.

Can you post simple C code equivalent of your loop? Perhaps there is a better way to transform it.

Also, isn't the:

x1 / (sqrt(x2) * sqrt(x3))

The same as:

x1 / sqrt(x2 * x3)

?

TimP · ‎07-09-2008

Removing a sqrt surely will gain more than dpps etc. could do, when those other operations pipeline effectively, even if not vectorized. If you have an SSE 4.1 machine, it should have relatively efficient divide and sqrt, but those remain without pipelining. No point in asking the compiler to do strange transformations on divide, when straightforward source code will do.

vsachde · ‎07-09-2008

Igor these are scalar values; I am now just wondering if square-root reciprocal estimate vector (_mm_sqrt_ps or _mm_rsqrt_ps) instruction might be the way to go for me. Ofcourse I will have to change the loops for the SIMD to work. And yes I already had it as sqrt(x2*x3); Another thing I should mention is that this is not really the whole loop but a part of a loop that has if-then-else instructions in its midst too which make complete SIMDization a more difficult job.

levicki · ‎07-10-2008

vsachde,

You should restructure your loop so you don't have if-then-else in it. Try to change from:

	for (int i = 0; i < count; i++) {
		if (condition) {
			// do something
		} else {
			// do something else
		}
	}

To:

	if (condition) {
		for (int i = 0; i < count; i++) {
			// do something
		}
	} else {
		for (int i = 0; i < count; i++) {
			// do something else
		}
	}

If possible of course.

As for the square root estimation, I doubt you will get better performance and you will most certainly lose precision. I definitely wouldn't waste time and effort on that unless you get paid for experimenting as well.

What I would do though is make it two or three pass — try to precalculate x2 * x3 and store it into an aligned array and then use vectorized sqrt and divide loop on it in the second pass.

TimP · ‎07-10-2008

The original idea of rsqrtps was to accelerate applications which required only 11 bit precision sqrt, without putting in the circuitry required for efficient IEEE sqrt.
The compilers default to use of rsqrtps instructions plus iteration for vectorization on account of the slowness of sqrtps on earlier CPU models. Even then, the gain was largely in the ability to pipeline other operations so as to take advantage of the many cycles spent performing sqrt and divide.
As Igor said, the big potential gain is in finding a strategy to vectorize the reciprocal sqrt. If you do that with Intel compilers, you can simply switch compile options between prec-div on and off to try rsqrtps vs IEEE accurate method.

levicki · ‎07-10-2008

That's right, since:

result = x1 / sqrt(x2 * x3)

Is the same as:

result = x1 * (1.0f / sqrt(x2 * x3))

The general idea is to transform that into:

result = x1 * RSQRTPS(x2 * x3)

If it gives you enough precision for your particular case.

Of course, it would be for the best if that transformation is done at the language level (by writing compiler friendly code) just as tim18 said.

vsachde · ‎07-10-2008

Igor and Tim, I did experiment with -no-prec-div switch and turns out it did not make any performance improvement. Also like I said the code is branchy, and condition is not really one condition but more dependant on the values of an array indexed by i, for example if (x - x[i+1] > .05*max_i). that is why the compiler cannot really pick this up that easily. I would probably have to get in there and do it manually, and then make sure that the precision is within acceptable bounds. Thanks for ur replies. If u have any more thoughts, keep them coming.

levicki · ‎07-10-2008

vsachde,

-no-prec-div won't do anything unles you make division and square root vectorizable as I suggested. As for the condition, you could also try precalculating x - x[i + 1] and putting the results into another array so that comparison can be vectorized. If the code is extremely complex compiler won't be able to do it but you may try doing it with intrinsics after you do what I am suggesting. For example, you can calculate both branches of if-then-else and then blend the result according to vectorized comparison.

I am sorry, but without seeing full loop code I simply can't give you more usefull suggestions.

vsachde · ‎07-10-2008

precalculating x - x[i + 1] and putting the results into another array is what i m going to do now.

lxguy · ‎09-12-2008

Thanks for your detailed explanation.I begin to have some knowledge of it now.