topic Your right. Perhaps, it's not in Intel® ISA Extensions

AVX - Vector shifts

ale3 — Sun, 02 Dec 2012 04:25:12 GMT

I'm comparing two programmes, one is written using SSE and the other one AVX. My aim is to show that the avx version is running 2 times faster but I'm loosing something like 20 % with some shift operations.

I need to perform quite often a shift operation to rotate an Avx Vector 1 byte on the left. It seems like all the instructions I need will only be available with AVX2.

Actually I'm splitting the source _m256i vector into 2 _128i but this way I'm loosing performances. Is there any other way to perform this operation? Why shifting operation were not included in avx instruction set?

Thanks in advance for your help, here's the current version on my code

[cpp]

a1 = _mm256_castsi256_si128( _source );
a2 = _mm256_extractf128_si256 ( _source,1 );

b1 = _mm_slli_si128( a1,1);
b2 = _mm_slli_si128( a2,1);
a1 = _mm_srli_si128( a1,15);
a2 = _mm_srli_si128( a2,15);

_dest = _mm256_castsi128_si256 ( _mm_or_si128(b1,a2) );
_dest = _mm256_insertf128_ps ( _dest, _mm_or_si128(b2,a1), 1 );

[/cpp]

Yes, Intel AVX supports only

Thomas_W_Intel — Wed, 05 Dec 2012 22:12:01 GMT

Yes, Intel AVX supports only FP instructions for 256bit. If you need other instructions, like the shift that you are describing, it is better to use the 128bit instructions. You might save some instructions because of the non-destructive source that comes with AVX, but that's about it. For integer instructions with 256bit registers, you will have to wait for AVX2.

So if I want to left shift an

Silva__Rafael — Fri, 31 Jan 2014 00:38:23 GMT

So if I want to left shift an _m256 float point array, there is no way to do it using AVX instructions?

I was wondering if a combination of __mm256_shuffle_ps and __mm256_permute_ps would make it. Is that possible? If yes, I just could not understand the meaning of __MM_SHUFFLE macro on the context of __mm256_permute_ps function, how can I use it?

Or if you still think to use 128bit is better, what intrinsic 128 bit functions should I use for that?

>>...I need to perform quite

SergeyKostrov — Fri, 31 Jan 2014 21:54:00 GMT

>>...I need to perform quite often a shift operation to rotate an Avx Vector 1 byte on the left... Is it a cyclical operation or No? Does it mean that in case of a vector of N elements an element[0] should be moved to an element[N-1]? And of course all the rest vector elements are shifted to the left.

Quote:Sergey Kostrov wrote:

Silva__Rafael — Sat, 01 Feb 2014 17:19:05 GMT

Sergey Kostrov wrote:

Is it a cyclical operation or No? Does it mean that in case of a vector of N elements an element[0] should be moved to an element[N-1]?

Not in my case, it is not cyclical. In my case the element[0] is not necessary any more.

rmendes.silva::

Vladimir_Sedach — Mon, 03 Feb 2014 09:14:43 GMT

rmendes.silva::

Shifts left by 4 bytes:
__m256i r = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8), r1, r2;
r1 = _mm256_slli_si256(r, 4);
r2 = _mm256_srli_si256(r, 12);
r2 = _mm256_permute2f128_si256(r2, r2, 0x08);
r = _mm256_or_si256(r1, r2);

With AVX2:

r1 = _mm256_permute2f128_si256(r, r, 0x08);
r = _mm256_alignr_epi8(r, r1, 12);

Ok Vladimir, but this is for

Silva__Rafael — Tue, 04 Feb 2014 15:24:33 GMT

Ok Vladimir, but this is for integer values. What about float point? I didn't found anything to do the same with float points.

rmendes.silva::

Vladimir_Sedach — Tue, 04 Feb 2014 19:54:32 GMT

rmendes.silva::

It's almost the same with floats:
__m256 r, r1, r2;

AVX:
r1 = _mm256_slli_si256(_mm256_castps_si256(r), 4);
r2 = _mm256_srli_si256(_mm256_castps_si256(r), 12);
r2 = _mm256_permute2f128_ps(r2, r2, 0x08);
r = _mm256_or_ps(r1, r2);

AVX2:
r1 = _mm256_permute2f128_ps(r, r, 0x08);
r = _mm256_alignr_epi8(_mm256_castps_si256(r), _mm256_castps_si256(r1), 12);

If you need doubles, replace 4, 12 and 12 by 8, 8 and 8.

In Intel documentation _mm256

Silva__Rafael — Tue, 04 Feb 2014 20:25:31 GMT

In Intel documentation _mm256_slli_si256 is only included on AVX2 documentation, so I think it is not available for AVX, is it?

Your right. Perhaps, it's not

Vladimir_Sedach — Tue, 04 Feb 2014 23:03:00 GMT

Your right. Perhaps, it's not a good idea to do it with AVX at all.
I'm afraid this version wont be too fast.

__m256 r;
__m128 r0, r1;

r0 = _mm256_castps256_ps128(r);
r1 = _mm256_extractf128_ps(r, 1);
r1 = _mm_insert_ps(r1, r0, 0xF0); //r1[3] = r0[3]
r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), 4));
r1 = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 1, 0, 3)); //rotate
r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1);

If you don't care about r[0]:
r = _mm256_shuffle_ps(r, r, _MM_SHUFFLE(2, 1, 0, 3));
r0 = _mm256_castps256_ps128(r);
r1 = _mm256_extractf128_ps(r, 1);
r1 = _mm_move_ss(r1, r0);
r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1);

Yes, you're right about AVX.

Silva__Rafael — Wed, 05 Feb 2014 10:46:20 GMT

Yes, you're right about AVX. Doesn't seem to be a good idea do it with AVX, if we want performance. I will try, but I'm also afraid that would not be so fast. Thank you.

Maybe you have the chance to

Christian_M_2 — Thu, 27 Feb 2014 11:59:16 GMT

Maybe you have the chance to organise the data in memory right for AVX before processing?

You mentioned it is not cyclic so this might be an option.

I've considered this

Silva__Rafael — Mon, 03 Mar 2014 18:42:27 GMT

I've considered this Christian, but I can't see how to do that without moving things around, which is exactly what I want to avoid.

rmendes.silva,

Vladimir_Sedach — Tue, 04 Mar 2014 12:39:00 GMT

rmendes.silva,

SLL_256() shifts left by an arbitrary number of elements.
All the "if" checks are removed by optimization since "offs" is a const.
It can be used as is with __m256, and needs just to replace all the casts otherwise.

Please let me know if it's fast/slow in your case.
It also would be nice to see a snippet of your code that uses the shift.
Perhaps a faster approach could be found.

// r: result
// a: src vector
// offs: number of elements to shift
// elem_n: number of  elements in vector (8 for float)
   #define SLL_256(r, a, offs, elem_n) \
   { \
       __m128   r0, r1; \
       const int   size = sizeof(a) / elem_n; \
\
       if (!offs) \
           r = a; \
       else if (offs == elem_n / 2) \
           r = _mm256_permute2f128_ps(a, a, 0x08); \
       else if (offs >= elem_n) \
           r = _mm256_setzero_ps(); \
       else if (offs < elem_n / 2) \
       { \
           r0 = _mm256_castps256_ps128(a); \
           r1 = _mm256_extractf128_ps(a, 1); \
           r1 = _mm_castsi128_ps(_mm_alignr_epi8(_mm_castps_si128(r1), _mm_castps_si128(r0), (elem_n / 2 - offs) * size)); \
           r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), offs * size)); \
           r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1); \
       } \
       else \
       { \
           r0 = _mm256_castps256_ps128(a); \
           r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), (offs - elem_n / 2) * size)); \
           r = _mm256_permute2f128_ps(_mm256_castps128_ps256(r0), _mm256_castps128_ps256(r0), 0x08); \
       } \
   }

Vladimir,

jimdempseyatthecove — Tue, 04 Mar 2014 13:27:57 GMT

Vladimir,

Great post. This should be a starting point for Intel compiler intrinsic developers to offer an official _mm256_... intrinsic function. Such that whenever an improvement in AVX design is made, that we mere users need not revisit our code.

Jim Dempsey

Jim,

Vladimir_Sedach — Tue, 04 Mar 2014 15:40:38 GMT

Jim,

I really appreciate your words, thanks.
Added a similar SRL_256() to shift right.
This time they accept 256-bit vectors of any type.
Just anxious a bit this version could be slower with a not very smart compiler.

// r: result
// a: src vector
// offs: number of elements to shift (must be a const)
// elem_n: number of  elements in vector (8 for "float")
#define SLL_256(r, a, offs, elem_n) \
{ \
   __m256i   *pr = (__m256i *)&r; \
   __m256i   *pa = (__m256i *)&a; \
   __m128i   r0, r1; \
   const int   size = sizeof(a) / elem_n; \
\
   if (!offs) \
       *pr = *pa; \
   else if (offs == elem_n / 2) \
       *pr = _mm256_permute2f128_si256(*pa, *pa, 0x08); \
   else if (offs >= elem_n) \
       *pr = _mm256_setzero_si256(); \
   else if (offs < elem_n / 2) \
   { \
       r0 = _mm256_castsi256_si128(*pa); \
       r1 = _mm256_extractf128_si256(*pa, 1); \
       r1 = _mm_alignr_epi8(r1, r0, (elem_n / 2 - offs) * size); \
       r0 = _mm_slli_si128(r0, offs * size); \
       *pr = _mm256_insertf128_si256(_mm256_castsi128_si256(r0), r1, 1); \
   } \
   else \
   { \
       r0 = _mm256_castsi256_si128(*pa); \
       r0 = _mm_slli_si128(r0, (offs - elem_n / 2) * size); \
       *pr = _mm256_permute2f128_si256(_mm256_castsi128_si256(r0), _mm256_castsi128_si256(r0), 0x08); \
   } \
}

#define SRL_256(r, a, offs, elem_n) \
{ \
   __m256i   *pr = (__m256i *)&r; \
   __m256i   *pa = (__m256i *)&a; \
   __m128i   r0, r1; \
   const int   size = sizeof(a) / elem_n; \
\
   if (!offs) \
       *pr = *pa; \
   else if (offs == elem_n / 2) \
       *pr = _mm256_permute2f128_si256(*pa, *pa, 0x81); \
   else if (offs >= elem_n) \
       *pr = _mm256_setzero_si256(); \
   else if (offs < elem_n / 2) \
   { \
       r0 = _mm256_castsi256_si128(*pa); \
       r1 = _mm256_extractf128_si256(*pa, 1); \
       r0 = _mm_alignr_epi8(r1, r0, offs * size); \
       r1 = _mm_srli_si128(r1, offs * size); \
       *pr = _mm256_insertf128_si256(_mm256_castsi128_si256(r0), r1, 1); \
   } \
   else \
   { \
       r1 = _mm256_extractf128_si256(*pa, 1); \
       r1 = _mm_srli_si128(r1, (offs - elem_n / 2) * size); \
       *pr = _mm256_permute2f128_si256(_mm256_castsi128_si256(r1), _mm256_castsi128_si256(r1), 0x80); \
   } \
}

Hi Vladimir,

Silva__Rafael — Wed, 05 Mar 2014 20:23:45 GMT

Hi Vladimir,

I will try this approach an will post here it gets faster of not. Thanks.