You are right, then it's

Christian_M_2 · ‎06-25-2015

Hello,

I am trying to achieve a dynamic shift. Well, let me explain the task. I process data with SSE, AVX. Data gets loaded, worked with and later results are stored. To support arbitrary lengths, I need some kind of maskload, but also for SSE.

Suppose my lenght is 9 elements, I work with int32 and SSE. First load, second load is fine. Third load is fine from memory bound, this is no problem. But only element 0 in vector register is valid, others need to be zero. How do I achieve this best?

I get the rest count by: length AND (NOT vectorelements). This would be 1 for the case with 9 elements. So I would need some shift with variable count. To start with a register filled with 1 and shift in the right amount of zeros and AND mask with loaded data. But are there any shifts with variable count? I did not find them. Another idea would be to fill a register ascending 0,1,2,3 and do a less compare with the rest.

0,1,2,3 LT 1,1,1,1 = 1,0,0,0

This would be the correct mask. But I have trouble doing this in AVX as even AVX2 has no set of full compare instructions. So bascially I want a convenient way to implement kind of masked load for SSE, AVX for int32 and float. The code would be allowed to load all data, that NO problem. For AVX there is a maskload, but how do I create a mask for my problem?

Vladimir_Sedach · ‎06-25-2015

Hi Christian,

Could it be something like this:

__m128i   mask[4] =
{
   _mm_setr_epi32(-1, -1, -1, -1), //not used
   _mm_setr_epi32(-1, 0, 0, 0),
   _mm_setr_epi32(-1, -1, 0, 0),
   _mm_setr_epi32(-1, -1, -1, 0)
};

__m128i mask1 = mask[length % 4];

and mask[8] for AVX.

Christian_M_2 · ‎06-26-2015

Hello,

wow that seems a cool solution. I will test how fast it performs. And the modulo will be replaced by and AND as vector length happily are a power of two.

Vladimir_Sedach · ‎06-26-2015

Christian,

Now days compilers are happily smart enough to replace (x % 2^n) by AND :)

Christian_M_2 · ‎07-13-2015

Yes, you are right!

I even optimized a little bit more for matrices: The rest is for each row the same, so at the beginning before the loop I calc the rest, get the mask. In the loop I process everything normal and one additional if block that works with the rest and uses the mask. I think the one if is ok, as branch prediction should realize it always evaluates to the same.

Vladimir_Sedach · ‎07-13-2015

Hi Christian,

You can do even better moving the check out of all loops:

       if (!(n % 4))
           for (i1 = 0; i1 < 100000000; i1++)
           {
               for (i = 0; i < n; i += 4)
                   sum = _mm_add_epi32(sum, *(__m128i *)(a + i));
           }
       else
           for (i1 = 0; i1 < 100000000; i1++)
           {
               for (i = 0; i <= n - 4; i += 4)
                   sum = _mm_add_epi32(sum, *(__m128i *)(a + i));

               x = _mm_load_si128((__m128i *)(a + i));
               x = _mm_and_si128(x, mask1);
               sum = _mm_add_epi32(sum, x);
           }

Christian_M_2 · ‎07-14-2015

You are right, then it's optimale for for multiple of inc and does not require an if in the other case

Dynamic Shift