- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello,

I am trying to achieve a dynamic shift. Well, let me explain the task. I process data with SSE, AVX. Data gets loaded, worked with and later results are stored. To support arbitrary lengths, I need some kind of maskload, but also for SSE.

Suppose my lenght is 9 elements, I work with int32 and SSE. First load, second load is fine. Third load is fine from memory bound, this is no problem. But only element 0 in vector register is valid, others need to be zero. How do I achieve this best?

I get the rest count by: length AND (NOT vectorelements). This would be 1 for the case with 9 elements. So I would need some shift with variable count. To start with a register filled with 1 and shift in the right amount of zeros and AND mask with loaded data. But are there any shifts with variable count? I did not find them. Another idea would be to fill a register ascending 0,1,2,3 and do a less compare with the rest.

0,1,2,3 LT 1,1,1,1 = 1,0,0,0

This would be the correct mask. But I have trouble doing this in AVX as even AVX2 has no set of full compare instructions. So bascially I want a convenient way to implement kind of masked load for SSE, AVX for int32 and float. The code would be allowed to load all data, that NO problem. For AVX there is a maskload, but how do I create a mask for my problem?

- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Christian,

Could it be something like this:

__m128i mask[4] =

{

_mm_setr_epi32(-1, -1, -1, -1), //not used

_mm_setr_epi32(-1, 0, 0, 0),

_mm_setr_epi32(-1, -1, 0, 0),

_mm_setr_epi32(-1, -1, -1, 0)

};

__m128i mask1 = mask[length % 4];

and mask[8] for AVX.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello,

wow that seems a cool solution. I will test how fast it performs. And the modulo will be replaced by and AND as vector length happily are a power of two.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Christian,

Now days compilers are happily smart enough to replace (x % 2^n) by AND :)

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Yes, you are right!

I even optimized a little bit more for matrices: The rest is for each row the same, so at the beginning before the loop I calc the rest, get the mask. In the loop I process everything normal and one additional if block that works with the rest and uses the mask. I think the one if is ok, as branch prediction should realize it always evaluates to the same.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Christian,

You can do even better moving the check out of all loops:

if (!(n % 4))

for (i1 = 0; i1 < 100000000; i1++)

{

for (i = 0; i < n; i += 4)

sum = _mm_add_epi32(sum, *(__m128i *)(a + i));

}

else

for (i1 = 0; i1 < 100000000; i1++)

{

for (i = 0; i <= n - 4; i += 4)

sum = _mm_add_epi32(sum, *(__m128i *)(a + i));

x = _mm_load_si128((__m128i *)(a + i));

x = _mm_and_si128(x, mask1);

sum = _mm_add_epi32(sum, x);

}

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

You are right, then it's optimale for for multiple of inc and does not require an if in the other case

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page