xx= _mm_add_ss(xx, _mm_shuffle_ps(xx,xx, _MM_SHUFFLE( 0, 0, 0, 1 )) );
Another way :
xx= _mm_add_ps(xx, _mm_movehl_ps(xx, xx));
xx= _mm_add_ss(xx, _mm_shuffle_ps(xx, xx, 1));
_mm_store_ss( &temp, xx );
Is there a better way? This seems a very common operation. Any plan to make it native?
Also, how about sum up numbers in more than one registries?
[cpp]xx = (xx + xx) + (xx + xx)[/cpp]It requires 3 + 1 + 3 cycles (without the loads and stores). But in this case the loads are probably going to make a difference so that one of the above should be faster.
[cpp]_mm_add_ps(_mm_add_ps(a, b), _mm_add_ps(c, d));[/cpp]and then one of your hadd implementations. This is now much faster than the scalar equivalent.