What's the best way to sum up values in __m128 ?

missing__zlw · ‎02-10-2011

It seems to me a very common operation.
_mm128 xx which contains values (xx3, xx2, xx1, xx0), all are in float.

and I would like to do xx0 + xx1 + xx2 + xx3

right now, I can use :
xx = _mm_hadd_ps(xx, _mm_set_zero); // to get (0, 0, xx3+ xx2, xx1+xx0)

xx= _mm_add_ss(xx, _mm_shuffle_ps(xx,xx, _MM_SHUFFLE( 0, 0, 0, 1 )) );
_mm_store_ss(&temp,xx);

Another way :

xx= _mm_add_ps(xx, _mm_movehl_ps(xx, xx));

xx= _mm_add_ss(xx, _mm_shuffle_ps(xx, xx, 1));

_mm_store_ss( &temp, xx );

Is there a better way? This seems a very common operation. Any plan to make it native?

Also, how about sum up numbers in more than one registries?

Thanks.

jimdempseyatthecove · ‎02-11-2011

Use hadd twice with same register for both sources and same register for destination

// xx = { xx3, xx2, xx1, xx0 }
xx=_mm_hadd_ps(xx,xx);
// xx = {xx3+xx2, xx1+xx0, xx3+xx2, xx1+xx0}
xx=_mm_hadd_ps(xx,xx);
// xx = {xx2+xx3+xx1+xx0, xx3+xx2+xx1+xx0, xx3+xx2+xx1+xx0, xx3+xx2+xx1+xx0}

The other sums are superfluous (but do not add additional overhead)

Note, if you are on an older processor, try to insert a load or store or non-SSEinstruction(s) between the hadd's.

Jim Dempsey

missing__zlw · ‎02-11-2011

Thank you for your response. Isn't mm_hadd_ps more expensive than the _mm_add_ps? Although I didn't find out exactly how much more expensive it is.

Brandon_H_Intel · ‎02-11-2011

What's "best" likely depends on what architecture you're targeting. For example, if I have a function like:

[cpp]float foo(float in[4]) {
   return __sec_reduce_add(in[:]);
}[/cpp]

(Note I'm using Intel Cilk Plus array notation here to do the summation)

For the default Intel SSE2 (i.e. icc -S -c test.c), the compiler in Intel C++ Composer XE Update 2 generates:

movups (%rdi), %xmm0 #2.11

movaps %xmm0, %xmm1 #2.28

movhlps %xmm0, %xmm1 #2.28

addps %xmm1, %xmm0 #2.28

movaps %xmm0, %xmm2 #2.28

shufps $245, %xmm0, %xmm2 #2.28

addss %xmm2, %xmm0 #2.28

ret #2.11

For Intel SSE4.2 (icc -xSSE4.2 -S -c test.c), I get:

movups (%rdi), %xmm0 #2.11

haddps %xmm0, %xmm0 #2.28

ret #2.11