I would like to do a full horizontal sum. I have variable a and b of type __m256d. Now I want to get a+a+a+a+b+b+b+b and store it.
In SSE I could easily do this with a = _mm_add_pd(a,b) and a = _mm_hadd_pd(a, unused).
The next thing is how to store only, one value of an AVX register. In SSE I just used _mm_store_sd. I could do a cast from __m256d to __m128d and use SSE instruction. But Intel® Architecture Code Analyzer tells me this will result in a penaly because of swtich from AVX to SSE.
In AVX is it correct that the _mm256_maskstore_pd replaces the _mm_store_sd with the corrrect mask. Is there is a possibility to use a fixed mask? So far I declare a variable and use it, but as the mask is static, I consider this as unecessary overhead:
__m256i storeMask = _mm256_set_epi32(0, 1<<31, 0, 0, 0, 0, 0, 0);
_mm256_maskstore_pd(&res, storeMask, a);
And one last thing, how could I acchieve the same thing for __m256 (float). There are shuffle and permute functions, but all in all its hard to get what I want. I haven't found I way so far.
I would appreciate any tipps and hits.