Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Dynamic Shift

Christian_M_2
Beginner
1,691 Views

Hello,

I am trying to achieve a dynamic shift. Well, let me explain the task. I process data with SSE, AVX. Data gets loaded, worked with and later results are stored. To support arbitrary lengths, I need some kind of maskload, but also for SSE.

Suppose my lenght is 9 elements, I work with int32 and SSE. First load, second load is fine. Third load is fine from memory bound, this is no problem. But only element 0 in vector register is valid, others need to be zero. How do I achieve this best?

I get the rest count by: length AND (NOT vectorelements). This would be 1 for the case with 9 elements. So I would need some shift with variable count. To start with a register filled with 1 and shift in the right amount of zeros and AND mask with loaded data. But are there any shifts with variable count? I did not find them. Another idea would be to fill a register ascending 0,1,2,3 and do a less compare with the rest.

0,1,2,3 LT 1,1,1,1 = 1,0,0,0

This would be the correct mask. But I have trouble doing this in AVX as even AVX2 has no set of full compare instructions. So bascially I want a convenient way to implement kind of masked load for SSE, AVX for int32 and float. The code would be allowed to load all data, that NO problem. For AVX there is a maskload, but how do I create a mask for my problem?

0 Kudos
6 Replies
Vladimir_Sedach
New Contributor I
1,691 Views

Hi Christian,

Could it be something like this:

__m128i    mask[4] =
{
    _mm_setr_epi32(-1, -1, -1, -1), //not used
    _mm_setr_epi32(-1, 0, 0, 0),
    _mm_setr_epi32(-1, -1, 0, 0),
    _mm_setr_epi32(-1, -1, -1, 0)
};

__m128i    mask1 = mask[length % 4];

and mask[8] for AVX.

 

0 Kudos
Christian_M_2
Beginner
1,691 Views

Hello,

wow that seems a cool solution. I will test how fast it performs. And the modulo will be replaced by and AND as vector length happily are a power of two.

0 Kudos
Vladimir_Sedach
New Contributor I
1,691 Views

Christian,

Now days compilers are happily smart enough to replace (x % 2^n) by AND :)

0 Kudos
Christian_M_2
Beginner
1,691 Views

Yes, you are right!

I even optimized a little bit more for matrices: The rest is for each row the same, so at the beginning before the loop I calc the rest, get the mask. In the loop I process everything normal and one additional if block that works with the rest and uses the mask. I think the one if is ok, as branch prediction should realize it always evaluates to the same.

 

0 Kudos
Vladimir_Sedach
New Contributor I
1,691 Views

Hi Christian,

You can do even better moving the check out of all loops:

        if (!(n % 4))
            for (i1 = 0; i1 < 100000000; i1++)
            {
                for (i = 0; i < n; i += 4)
                    sum = _mm_add_epi32(sum, *(__m128i *)(a + i));
            }
        else
            for (i1 = 0; i1 < 100000000; i1++)
            {
                for (i = 0; i <= n - 4; i += 4)
                    sum = _mm_add_epi32(sum, *(__m128i *)(a + i));

                x = _mm_load_si128((__m128i *)(a + i));
                x = _mm_and_si128(x, mask1);
                sum = _mm_add_epi32(sum, x);
            }

0 Kudos
Christian_M_2
Beginner
1,691 Views

You are right, then it's optimale for for multiple of inc and does not require an if in the other case

0 Kudos
Reply