Community
cancel
Showing results for 
Search instead for 
Did you mean: 
missing__zlw
Beginner
990 Views

What's the best way to sum up values in __m128 ?

It seems to me a very common operation.
_mm128 xx which contains values (xx3, xx2, xx1, xx0), all are in float.

and I would like to do xx0 + xx1 + xx2 + xx3

right now, I can use :
xx = _mm_hadd_ps(xx, _mm_set_zero); // to get (0, 0, xx3+ xx2, xx1+xx0)

xx= _mm_add_ss(xx, _mm_shuffle_ps(xx,xx, _MM_SHUFFLE( 0, 0, 0, 1 )) );
_mm_store_ss(&temp,xx);

Another way :

xx= _mm_add_ps(xx, _mm_movehl_ps(xx, xx));

xx= _mm_add_ss(xx, _mm_shuffle_ps(xx, xx, 1));

_mm_store_ss( &temp, xx );

Is there a better way? This seems a very common operation. Any plan to make it native?

Also, how about sum up numbers in more than one registries?

Thanks.

0 Kudos
3 Replies
jimdempseyatthecove
Black Belt
990 Views

Use hadd twice with same register for both sources and same register for destination

// xx = { xx3, xx2, xx1, xx0 }
xx=_mm_hadd_ps(xx,xx);
// xx = {xx3+xx2, xx1+xx0, xx3+xx2, xx1+xx0}
xx=_mm_hadd_ps(xx,xx);
// xx = {xx2+xx3+xx1+xx0, xx3+xx2+xx1+xx0, xx3+xx2+xx1+xx0, xx3+xx2+xx1+xx0}

The other sums are superfluous (but do not add additional overhead)

Note, if you are on an older processor, try to insert a load or store or non-SSEinstruction(s) between the hadd's.

Jim Dempsey
missing__zlw
Beginner
990 Views

Thank you for your response. Isn't mm_hadd_ps more expensive than the _mm_add_ps? Although I didn't find out exactly how much more expensive it is.
Brandon_H_Intel
Employee
990 Views

What's "best" likely depends on what architecture you're targeting. For example, if I have a function like:

[cpp]float foo(float in[4]) {
   return __sec_reduce_add(in[:]);
}[/cpp]

(Note I'm using Intel Cilk Plus array notation here to do the summation)

For the default Intel SSE2 (i.e. icc -S -c test.c), the compiler in Intel C++ Composer XE Update 2 generates:

movups (%rdi), %xmm0 #2.11

movaps %xmm0, %xmm1 #2.28

movhlps %xmm0, %xmm1 #2.28

addps %xmm1, %xmm0 #2.28

movaps %xmm0, %xmm2 #2.28

shufps $245, %xmm0, %xmm2 #2.28

addss %xmm2, %xmm0 #2.28

ret #2.11

For Intel SSE4.2 (icc -xSSE4.2 -S -c test.c), I get:


movups (%rdi), %xmm0 #2.11

haddps %xmm0, %xmm0 #2.28

haddps %xmm0, %xmm0 #2.28

ret #2.11

Reply