_mm128 xx which contains values (xx3, xx2, xx1, xx0)
and I would like to do xx0 + xx1 + xx2 + xx3
right now, I can use :
xx = _mm_hadd_ps(xx, _mm_set_zero); // to get (0, 0, xx3+ xx2, xx1+xx0)
xx= _mm_add_ss(xx, _mm_shuffle_ps(xx,xx, _MM_SHUFFLE( 0, 0, 0, 1 )) );
Another way :
xx= _mm_add_ps(xx, _mm_movehl_ps(xx, xx));
xx= _mm_add_ss(xx, _mm_shuffle_ps(xx, xx, 1));
_mm_store_ss( &temp, xx );
Is there a better way? This seems a very common operation. Any plan to make it native?
Also, how about sum up numbers in more than one registries?
the second requires 1 + 3 + 1 + 3 cycles (again without the store)
here's another one, if you have xx in memory:
[cpp]xx = (xx + xx) + (xx + xx)[/cpp]It requires 3 + 1 + 3 cycles (without the loads and stores). But in this case the loads are probably going to make a difference so that one of the above should be faster.
In general, horizontal operations are not what SIMD is for. That's why your last question is so important. When you have more numbers to sum up you can do as many vertical adds as you have registers. E.g. you have have four __m128 registers a, b, c, and d. Then first you do
[cpp]_mm_add_ps(_mm_add_ps(a, b), _mm_add_ps(c, d));[/cpp]and then one of your hadd implementations. This is now much faster than the scalar equivalent.