Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Christian_M_2

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-25-2015
03:59 AM

159 Views

Dynamic Shift

Hello,

I am trying to achieve a dynamic shift. Well, let me explain the task. I process data with SSE, AVX. Data gets loaded, worked with and later results are stored. To support arbitrary lengths, I need some kind of maskload, but also for SSE.

Suppose my lenght is 9 elements, I work with int32 and SSE. First load, second load is fine. Third load is fine from memory bound, this is no problem. But only element 0 in vector register is valid, others need to be zero. How do I achieve this best?

I get the rest count by: length AND (NOT vectorelements). This would be 1 for the case with 9 elements. So I would need some shift with variable count. To start with a register filled with 1 and shift in the right amount of zeros and AND mask with loaded data. But are there any shifts with variable count? I did not find them. Another idea would be to fill a register ascending 0,1,2,3 and do a less compare with the rest.

0,1,2,3 LT 1,1,1,1 = 1,0,0,0

This would be the correct mask. But I have trouble doing this in AVX as even AVX2 has no set of full compare instructions. So bascially I want a convenient way to implement kind of masked load for SSE, AVX for int32 and float. The code would be allowed to load all data, that NO problem. For AVX there is a maskload, but how do I create a mask for my problem?

Link Copied

6 Replies

Vladimir_Sedach

New Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-25-2015
08:02 AM

159 Views

Hi Christian,

Could it be something like this:

__m128i mask[4] =

{

_mm_setr_epi32(-1, -1, -1, -1), //not used

_mm_setr_epi32(-1, 0, 0, 0),

_mm_setr_epi32(-1, -1, 0, 0),

_mm_setr_epi32(-1, -1, -1, 0)

};

__m128i mask1 = mask[length % 4];

and mask[8] for AVX.

Christian_M_2

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-26-2015
01:28 AM

159 Views

Hello,

wow that seems a cool solution. I will test how fast it performs. And the modulo will be replaced by and AND as vector length happily are a power of two.

Vladimir_Sedach

New Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-26-2015
01:54 AM

159 Views

Christian,

Now days compilers are happily smart enough to replace (x % 2^n) by AND :)

Christian_M_2

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-13-2015
02:12 AM

159 Views

Yes, you are right!

I even optimized a little bit more for matrices: The rest is for each row the same, so at the beginning before the loop I calc the rest, get the mask. In the loop I process everything normal and one additional if block that works with the rest and uses the mask. I think the one if is ok, as branch prediction should realize it always evaluates to the same.

Vladimir_Sedach

New Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-13-2015
03:36 AM

159 Views

Hi Christian,

You can do even better moving the check out of all loops:

if (!(n % 4))

for (i1 = 0; i1 < 100000000; i1++)

{

for (i = 0; i < n; i += 4)

sum = _mm_add_epi32(sum, *(__m128i *)(a + i));

}

else

for (i1 = 0; i1 < 100000000; i1++)

{

for (i = 0; i <= n - 4; i += 4)

sum = _mm_add_epi32(sum, *(__m128i *)(a + i));

x = _mm_load_si128((__m128i *)(a + i));

x = _mm_and_si128(x, mask1);

sum = _mm_add_epi32(sum, x);

}

Christian_M_2

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-14-2015
04:45 AM

159 Views

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.