- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems to me a very common operation.
_mm128 xx which contains values (xx3, xx2, xx1, xx0)
and I would like to do xx0 + xx1 + xx2 + xx3
right now, I can use :
xx = _mm_hadd_ps(xx, _mm_set_zero); // to get (0, 0, xx3+ xx2, xx1+xx0)
_mm128 xx which contains values (xx3, xx2, xx1, xx0)
and I would like to do xx0 + xx1 + xx2 + xx3
right now, I can use :
xx = _mm_hadd_ps(xx, _mm_set_zero); // to get (0, 0, xx3+ xx2, xx1+xx0)
xx= _mm_add_ss(xx, _mm_shuffle_ps(xx,xx, _MM_SHUFFLE( 0, 0, 0, 1 )) );
_mm_store_ss(&temp,xx);
Another way :
xx= _mm_add_ps(xx, _mm_movehl_ps(xx, xx));
xx= _mm_add_ss(xx, _mm_shuffle_ps(xx, xx, 1));
_mm_store_ss( &temp, xx );
Is there a better way? This seems a very common operation. Any plan to make it native?
Also, how about sum up numbers in more than one registries?
Thanks.
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
your first hadd implementation requires 5 + 1 + 3 cycles (without the set_zero and without the store).
the second requires 1 + 3 + 1 + 3 cycles (again without the store)
here's another one, if you have xx in memory:
In general, horizontal operations are not what SIMD is for. That's why your last question is so important. When you have more numbers to sum up you can do as many vertical adds as you have registers. E.g. you have have four __m128 registers a, b, c, and d. Then first you do
the second requires 1 + 3 + 1 + 3 cycles (again without the store)
here's another one, if you have xx in memory:
[cpp]xx[0] = (xx[0] + xx[1]) + (xx[2] + xx[3])[/cpp]It requires 3 + 1 + 3 cycles (without the loads and stores). But in this case the loads are probably going to make a difference so that one of the above should be faster.
In general, horizontal operations are not what SIMD is for. That's why your last question is so important. When you have more numbers to sum up you can do as many vertical adds as you have registers. E.g. you have have four __m128 registers a, b, c, and d. Then first you do
[cpp]_mm_add_ps(_mm_add_ps(a, b), _mm_add_ps(c, d));[/cpp]and then one of your hadd implementations. This is now much faster than the scalar equivalent.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page