- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems to me a very common operation.
_mm128 xx which contains values (xx3, xx2, xx1, xx0), all are in float.
and I would like to do xx0 + xx1 + xx2 + xx3
right now, I can use :
xx = _mm_hadd_ps(xx, _mm_set_zero); // to get (0, 0, xx3+ xx2, xx1+xx0)
_mm128 xx which contains values (xx3, xx2, xx1, xx0), all are in float.
and I would like to do xx0 + xx1 + xx2 + xx3
right now, I can use :
xx = _mm_hadd_ps(xx, _mm_set_zero); // to get (0, 0, xx3+ xx2, xx1+xx0)
xx= _mm_add_ss(xx, _mm_shuffle_ps(xx,xx, _MM_SHUFFLE( 0, 0, 0, 1 )) );
_mm_store_ss(&temp,xx);
Another way :
xx= _mm_add_ps(xx, _mm_movehl_ps(xx, xx));
xx= _mm_add_ss(xx, _mm_shuffle_ps(xx, xx, 1));
_mm_store_ss( &temp, xx );
Is there a better way? This seems a very common operation. Any plan to make it native?
Also, how about sum up numbers in more than one registries?
Thanks.
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Use hadd twice with same register for both sources and same register for destination
// xx = { xx3, xx2, xx1, xx0 }
xx=_mm_hadd_ps(xx,xx);
// xx = {xx3+xx2, xx1+xx0, xx3+xx2, xx1+xx0}
xx=_mm_hadd_ps(xx,xx);
// xx = {xx2+xx3+xx1+xx0, xx3+xx2+xx1+xx0, xx3+xx2+xx1+xx0, xx3+xx2+xx1+xx0}
The other sums are superfluous (but do not add additional overhead)
Note, if you are on an older processor, try to insert a load or store or non-SSEinstruction(s) between the hadd's.
Jim Dempsey
// xx = { xx3, xx2, xx1, xx0 }
xx=_mm_hadd_ps(xx,xx);
// xx = {xx3+xx2, xx1+xx0, xx3+xx2, xx1+xx0}
xx=_mm_hadd_ps(xx,xx);
// xx = {xx2+xx3+xx1+xx0, xx3+xx2+xx1+xx0, xx3+xx2+xx1+xx0, xx3+xx2+xx1+xx0}
The other sums are superfluous (but do not add additional overhead)
Note, if you are on an older processor, try to insert a load or store or non-SSEinstruction(s) between the hadd's.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your response. Isn't mm_hadd_ps more expensive than the _mm_add_ps? Although I didn't find out exactly how much more expensive it is.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What's "best" likely depends on what architecture you're targeting. For example, if I have a function like:
(Note I'm using Intel Cilk Plus array notation here to do the summation)
For the default Intel SSE2 (i.e. icc -S -c test.c), the compiler in Intel C++ Composer XE Update 2 generates:
[cpp]float foo(float in[4]) { return __sec_reduce_add(in[:]); }[/cpp]
(Note I'm using Intel Cilk Plus array notation here to do the summation)
For the default Intel SSE2 (i.e. icc -S -c test.c), the compiler in Intel C++ Composer XE Update 2 generates:
movups (%rdi), %xmm0 #2.11
movaps %xmm0, %xmm1 #2.28
movhlps %xmm0, %xmm1 #2.28
addps %xmm1, %xmm0 #2.28
movaps %xmm0, %xmm2 #2.28
shufps $245, %xmm0, %xmm2 #2.28
addss %xmm2, %xmm0 #2.28
ret #2.11
For Intel SSE4.2 (icc -xSSE4.2 -S -c test.c), I get:
movups (%rdi), %xmm0 #2.11
haddps %xmm0, %xmm0 #2.28
haddps %xmm0, %xmm0 #2.28
ret #2.11
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page