- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm comparing two programmes, one is written using SSE and the other one AVX. My aim is to show that the avx version is running 2 times faster but I'm loosing something like 20 % with some shift operations.
I need to perform quite often a shift operation to rotate an Avx Vector 1 byte on the left. It seems like all the instructions I need will only be available with AVX2.
Actually I'm splitting the source _m256i vector into 2 _128i but this way I'm loosing performances. Is there any other way to perform this operation? Why shifting operation were not included in avx instruction set?
Thanks in advance for your help, here's the current version on my code
[cpp]
a1 = _mm256_castsi256_si128( _source );
a2 = _mm256_extractf128_si256 ( _source,1 );
b1 = _mm_slli_si128( a1,1);
b2 = _mm_slli_si128( a2,1);
a1 = _mm_srli_si128( a1,15);
a2 = _mm_srli_si128( a2,15);
_dest = _mm256_castsi128_si256 ( _mm_or_si128(b1,a2) );
_dest = _mm256_insertf128_ps ( _dest, _mm_or_si128(b2,a1), 1 );
[/cpp]
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So if I want to left shift an _m256 float point array, there is no way to do it using AVX instructions?
I was wondering if a combination of __mm256_shuffle_ps and __mm256_permute_ps would make it. Is that possible? If yes, I just could not understand the meaning of __MM_SHUFFLE macro on the context of __mm256_permute_ps function, how can I use it?
Or if you still think to use 128bit is better, what intrinsic 128 bit functions should I use for that?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Is it a cyclical operation or No? Does it mean that in case of a vector of N elements an element[0] should be moved to an element[N-1]?
Not in my case, it is not cyclical. In my case the element[0] is not necessary any more.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
rmendes.silva::
Shifts left by 4 bytes:
__m256i r = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8), r1, r2;
r1 = _mm256_slli_si256(r, 4);
r2 = _mm256_srli_si256(r, 12);
r2 = _mm256_permute2f128_si256(r2, r2, 0x08);
r = _mm256_or_si256(r1, r2);
With AVX2:
r1 = _mm256_permute2f128_si256(r, r, 0x08);
r = _mm256_alignr_epi8(r, r1, 12);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok Vladimir, but this is for integer values. What about float point? I didn't found anything to do the same with float points.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
rmendes.silva::
It's almost the same with floats:
__m256 r, r1, r2;
AVX:
r1 = _mm256_slli_si256(_mm256_castps_si256(r), 4);
r2 = _mm256_srli_si256(_mm256_castps_si256(r), 12);
r2 = _mm256_permute2f128_ps(r2, r2, 0x08);
r = _mm256_or_ps(r1, r2);
AVX2:
r1 = _mm256_permute2f128_ps(r, r, 0x08);
r = _mm256_alignr_epi8(_mm256_castps_si256(r), _mm256_castps_si256(r1), 12);
If you need doubles, replace 4, 12 and 12 by 8, 8 and 8.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In Intel documentation _mm256_slli_si256 is only included on AVX2 documentation, so I think it is not available for AVX, is it?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your right. Perhaps, it's not a good idea to do it with AVX at all.
I'm afraid this version wont be too fast.
__m256 r;
__m128 r0, r1;
r0 = _mm256_castps256_ps128(r);
r1 = _mm256_extractf128_ps(r, 1);
r1 = _mm_insert_ps(r1, r0, 0xF0); //r1[3] = r0[3]
r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), 4));
r1 = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 1, 0, 3)); //rotate
r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1);
If you don't care about r[0]:
r = _mm256_shuffle_ps(r, r, _MM_SHUFFLE(2, 1, 0, 3));
r0 = _mm256_castps256_ps128(r);
r1 = _mm256_extractf128_ps(r, 1);
r1 = _mm_move_ss(r1, r0);
r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, you're right about AVX. Doesn't seem to be a good idea do it with AVX, if we want performance. I will try, but I'm also afraid that would not be so fast. Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe you have the chance to organise the data in memory right for AVX before processing?
You mentioned it is not cyclic so this might be an option.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've considered this Christian, but I can't see how to do that without moving things around, which is exactly what I want to avoid.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
rmendes.silva,
SLL_256() shifts left by an arbitrary number of elements.
All the "if" checks are removed by optimization since "offs" is a const.
It can be used as is with __m256, and needs just to replace all the casts otherwise.
Please let me know if it's fast/slow in your case.
It also would be nice to see a snippet of your code that uses the shift.
Perhaps a faster approach could be found.
// r: result
// a: src vector
// offs: number of elements to shift
// elem_n: number of elements in vector (8 for float)
#define SLL_256(r, a, offs, elem_n) \
{ \
__m128 r0, r1; \
const int size = sizeof(a) / elem_n; \
\
if (!offs) \
r = a; \
else if (offs == elem_n / 2) \
r = _mm256_permute2f128_ps(a, a, 0x08); \
else if (offs >= elem_n) \
r = _mm256_setzero_ps(); \
else if (offs < elem_n / 2) \
{ \
r0 = _mm256_castps256_ps128(a); \
r1 = _mm256_extractf128_ps(a, 1); \
r1 = _mm_castsi128_ps(_mm_alignr_epi8(_mm_castps_si128(r1), _mm_castps_si128(r0), (elem_n / 2 - offs) * size)); \
r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), offs * size)); \
r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1); \
} \
else \
{ \
r0 = _mm256_castps256_ps128(a); \
r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), (offs - elem_n / 2) * size)); \
r = _mm256_permute2f128_ps(_mm256_castps128_ps256(r0), _mm256_castps128_ps256(r0), 0x08); \
} \
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vladimir,
Great post. This should be a starting point for Intel compiler intrinsic developers to offer an official _mm256_... intrinsic function. Such that whenever an improvement in AVX design is made, that we mere users need not revisit our code.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
I really appreciate your words, thanks.
Added a similar SRL_256() to shift right.
This time they accept 256-bit vectors of any type.
Just anxious a bit this version could be slower with a not very smart compiler.
// r: result
// a: src vector
// offs: number of elements to shift (must be a const)
// elem_n: number of elements in vector (8 for "float")
#define SLL_256(r, a, offs, elem_n) \
{ \
__m256i *pr = (__m256i *)&r; \
__m256i *pa = (__m256i *)&a; \
__m128i r0, r1; \
const int size = sizeof(a) / elem_n; \
\
if (!offs) \
*pr = *pa; \
else if (offs == elem_n / 2) \
*pr = _mm256_permute2f128_si256(*pa, *pa, 0x08); \
else if (offs >= elem_n) \
*pr = _mm256_setzero_si256(); \
else if (offs < elem_n / 2) \
{ \
r0 = _mm256_castsi256_si128(*pa); \
r1 = _mm256_extractf128_si256(*pa, 1); \
r1 = _mm_alignr_epi8(r1, r0, (elem_n / 2 - offs) * size); \
r0 = _mm_slli_si128(r0, offs * size); \
*pr = _mm256_insertf128_si256(_mm256_castsi128_si256(r0), r1, 1); \
} \
else \
{ \
r0 = _mm256_castsi256_si128(*pa); \
r0 = _mm_slli_si128(r0, (offs - elem_n / 2) * size); \
*pr = _mm256_permute2f128_si256(_mm256_castsi128_si256(r0), _mm256_castsi128_si256(r0), 0x08); \
} \
}
#define SRL_256(r, a, offs, elem_n) \
{ \
__m256i *pr = (__m256i *)&r; \
__m256i *pa = (__m256i *)&a; \
__m128i r0, r1; \
const int size = sizeof(a) / elem_n; \
\
if (!offs) \
*pr = *pa; \
else if (offs == elem_n / 2) \
*pr = _mm256_permute2f128_si256(*pa, *pa, 0x81); \
else if (offs >= elem_n) \
*pr = _mm256_setzero_si256(); \
else if (offs < elem_n / 2) \
{ \
r0 = _mm256_castsi256_si128(*pa); \
r1 = _mm256_extractf128_si256(*pa, 1); \
r0 = _mm_alignr_epi8(r1, r0, offs * size); \
r1 = _mm_srli_si128(r1, offs * size); \
*pr = _mm256_insertf128_si256(_mm256_castsi128_si256(r0), r1, 1); \
} \
else \
{ \
r1 = _mm256_extractf128_si256(*pa, 1); \
r1 = _mm_srli_si128(r1, (offs - elem_n / 2) * size); \
*pr = _mm256_permute2f128_si256(_mm256_castsi128_si256(r1), _mm256_castsi128_si256(r1), 0x80); \
} \
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Vladimir,
I will try this approach an will post here it gets faster of not. Thanks.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page