AVX - Vector shifts

ale3 · ‎12-01-2012

I'm comparing two programmes, one is written using SSE and the other one AVX. My aim is to show that the avx version is running 2 times faster but I'm loosing something like 20 % with some shift operations.

I need to perform quite often a shift operation to rotate an Avx Vector 1 byte on the left. It seems like all the instructions I need will only be available with AVX2.

Actually I'm splitting the source _m256i vector into 2 _128i but this way I'm loosing performances. Is there any other way to perform this operation? Why shifting operation were not included in avx instruction set?

Thanks in advance for your help, here's the current version on my code

[cpp]

a1 = _mm256_castsi256_si128( _source );
a2 = _mm256_extractf128_si256 ( _source,1 );

b1 = _mm_slli_si128( a1,1);
b2 = _mm_slli_si128( a2,1);
a1 = _mm_srli_si128( a1,15);
a2 = _mm_srli_si128( a2,15);

_dest = _mm256_castsi128_si256 ( _mm_or_si128(b1,a2) );
_dest = _mm256_insertf128_ps ( _dest, _mm_or_si128(b2,a1), 1 );

[/cpp]

Thomas_W_Intel · ‎12-05-2012

Yes, Intel AVX supports only FP instructions for 256bit. If you need other instructions, like the shift that you are describing, it is better to use the 128bit instructions. You might save some instructions because of the non-destructive source that comes with AVX, but that's about it. For integer instructions with 256bit registers, you will have to wait for AVX2.

Silva__Rafael · ‎01-30-2014

So if I want to left shift an _m256 float point array, there is no way to do it using AVX instructions?

I was wondering if a combination of __mm256_shuffle_ps and __mm256_permute_ps would make it. Is that possible? If yes, I just could not understand the meaning of __MM_SHUFFLE macro on the context of __mm256_permute_ps function, how can I use it?

Or if you still think to use 128bit is better, what intrinsic 128 bit functions should I use for that?

SergeyKostrov · ‎01-31-2014

>>...I need to perform quite often a shift operation to rotate an Avx Vector 1 byte on the left... Is it a cyclical operation or No? Does it mean that in case of a vector of N elements an element[0] should be moved to an element[N-1]? And of course all the rest vector elements are shifted to the left.

Silva__Rafael · ‎02-01-2014

Sergey Kostrov wrote:

Is it a cyclical operation or No? Does it mean that in case of a vector of N elements an element[0] should be moved to an element[N-1]?

Not in my case, it is not cyclical. In my case the element[0] is not necessary any more.

Vladimir_Sedach · ‎02-03-2014

rmendes.silva::

Shifts left by 4 bytes:
__m256i r = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8), r1, r2;
r1 = _mm256_slli_si256(r, 4);
r2 = _mm256_srli_si256(r, 12);
r2 = _mm256_permute2f128_si256(r2, r2, 0x08);
r = _mm256_or_si256(r1, r2);

With AVX2:

r1 = _mm256_permute2f128_si256(r, r, 0x08);
r = _mm256_alignr_epi8(r, r1, 12);

Silva__Rafael · ‎02-04-2014

Ok Vladimir, but this is for integer values. What about float point? I didn't found anything to do the same with float points.

Vladimir_Sedach · ‎02-04-2014

rmendes.silva::

It's almost the same with floats:
__m256 r, r1, r2;

AVX:
r1 = _mm256_slli_si256(_mm256_castps_si256(r), 4);
r2 = _mm256_srli_si256(_mm256_castps_si256(r), 12);
r2 = _mm256_permute2f128_ps(r2, r2, 0x08);
r = _mm256_or_ps(r1, r2);

AVX2:
r1 = _mm256_permute2f128_ps(r, r, 0x08);
r = _mm256_alignr_epi8(_mm256_castps_si256(r), _mm256_castps_si256(r1), 12);

If you need doubles, replace 4, 12 and 12 by 8, 8 and 8.

Silva__Rafael · ‎02-04-2014

In Intel documentation _mm256_slli_si256 is only included on AVX2 documentation, so I think it is not available for AVX, is it?

Vladimir_Sedach · ‎02-04-2014

Your right. Perhaps, it's not a good idea to do it with AVX at all.
I'm afraid this version wont be too fast.

__m256 r;
__m128 r0, r1;

r0 = _mm256_castps256_ps128(r);
r1 = _mm256_extractf128_ps(r, 1);
r1 = _mm_insert_ps(r1, r0, 0xF0); //r1[3] = r0[3]
r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), 4));
r1 = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 1, 0, 3)); //rotate
r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1);

If you don't care about r[0]:
r = _mm256_shuffle_ps(r, r, _MM_SHUFFLE(2, 1, 0, 3));
r0 = _mm256_castps256_ps128(r);
r1 = _mm256_extractf128_ps(r, 1);
r1 = _mm_move_ss(r1, r0);
r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1);

Silva__Rafael · ‎02-05-2014

Yes, you're right about AVX. Doesn't seem to be a good idea do it with AVX, if we want performance. I will try, but I'm also afraid that would not be so fast. Thank you.

Christian_M_2 · ‎02-27-2014

Maybe you have the chance to organise the data in memory right for AVX before processing?

You mentioned it is not cyclic so this might be an option.

Silva__Rafael · ‎03-03-2014

I've considered this Christian, but I can't see how to do that without moving things around, which is exactly what I want to avoid.

Vladimir_Sedach · ‎03-04-2014

rmendes.silva,

SLL_256() shifts left by an arbitrary number of elements.
All the "if" checks are removed by optimization since "offs" is a const.
It can be used as is with __m256, and needs just to replace all the casts otherwise.

Please let me know if it's fast/slow in your case.
It also would be nice to see a snippet of your code that uses the shift.
Perhaps a faster approach could be found.

// r: result
// a: src vector
// offs: number of elements to shift
// elem_n: number of  elements in vector (8 for float)
   #define SLL_256(r, a, offs, elem_n) \
   { \
       __m128   r0, r1; \
       const int   size = sizeof(a) / elem_n; \
\
       if (!offs) \
           r = a; \
       else if (offs == elem_n / 2) \
           r = _mm256_permute2f128_ps(a, a, 0x08); \
       else if (offs >= elem_n) \
           r = _mm256_setzero_ps(); \
       else if (offs < elem_n / 2) \
       { \
           r0 = _mm256_castps256_ps128(a); \
           r1 = _mm256_extractf128_ps(a, 1); \
           r1 = _mm_castsi128_ps(_mm_alignr_epi8(_mm_castps_si128(r1), _mm_castps_si128(r0), (elem_n / 2 - offs) * size)); \
           r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), offs * size)); \
           r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1); \
       } \
       else \
       { \
           r0 = _mm256_castps256_ps128(a); \
           r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), (offs - elem_n / 2) * size)); \
           r = _mm256_permute2f128_ps(_mm256_castps128_ps256(r0), _mm256_castps128_ps256(r0), 0x08); \
       } \
   }

jimdempseyatthecove · ‎03-04-2014

Vladimir,

Great post. This should be a starting point for Intel compiler intrinsic developers to offer an official _mm256_... intrinsic function. Such that whenever an improvement in AVX design is made, that we mere users need not revisit our code.

Jim Dempsey

Vladimir_Sedach · ‎03-04-2014

Jim,

I really appreciate your words, thanks.
Added a similar SRL_256() to shift right.
This time they accept 256-bit vectors of any type.
Just anxious a bit this version could be slower with a not very smart compiler.

// r: result
// a: src vector
// offs: number of elements to shift (must be a const)
// elem_n: number of  elements in vector (8 for "float")
#define SLL_256(r, a, offs, elem_n) \
{ \
   __m256i   *pr = (__m256i *)&r; \
   __m256i   *pa = (__m256i *)&a; \
   __m128i   r0, r1; \
   const int   size = sizeof(a) / elem_n; \
\
   if (!offs) \
       *pr = *pa; \
   else if (offs == elem_n / 2) \
       *pr = _mm256_permute2f128_si256(*pa, *pa, 0x08); \
   else if (offs >= elem_n) \
       *pr = _mm256_setzero_si256(); \
   else if (offs < elem_n / 2) \
   { \
       r0 = _mm256_castsi256_si128(*pa); \
       r1 = _mm256_extractf128_si256(*pa, 1); \
       r1 = _mm_alignr_epi8(r1, r0, (elem_n / 2 - offs) * size); \
       r0 = _mm_slli_si128(r0, offs * size); \
       *pr = _mm256_insertf128_si256(_mm256_castsi128_si256(r0), r1, 1); \
   } \
   else \
   { \
       r0 = _mm256_castsi256_si128(*pa); \
       r0 = _mm_slli_si128(r0, (offs - elem_n / 2) * size); \
       *pr = _mm256_permute2f128_si256(_mm256_castsi128_si256(r0), _mm256_castsi128_si256(r0), 0x08); \
   } \
}

#define SRL_256(r, a, offs, elem_n) \
{ \
   __m256i   *pr = (__m256i *)&r; \
   __m256i   *pa = (__m256i *)&a; \
   __m128i   r0, r1; \
   const int   size = sizeof(a) / elem_n; \
\
   if (!offs) \
       *pr = *pa; \
   else if (offs == elem_n / 2) \
       *pr = _mm256_permute2f128_si256(*pa, *pa, 0x81); \
   else if (offs >= elem_n) \
       *pr = _mm256_setzero_si256(); \
   else if (offs < elem_n / 2) \
   { \
       r0 = _mm256_castsi256_si128(*pa); \
       r1 = _mm256_extractf128_si256(*pa, 1); \
       r0 = _mm_alignr_epi8(r1, r0, offs * size); \
       r1 = _mm_srli_si128(r1, offs * size); \
       *pr = _mm256_insertf128_si256(_mm256_castsi128_si256(r0), r1, 1); \
   } \
   else \
   { \
       r1 = _mm256_extractf128_si256(*pa, 1); \
       r1 = _mm_srli_si128(r1, (offs - elem_n / 2) * size); \
       *pr = _mm256_permute2f128_si256(_mm256_castsi128_si256(r1), _mm256_castsi128_si256(r1), 0x80); \
   } \
}

Silva__Rafael · ‎03-05-2014

Hi Vladimir,

I will try this approach an will post here it gets faster of not. Thanks.