Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

AVX - Vector shifts

ale3
Beginner
4,783 Views

I'm comparing two programmes, one is written using SSE and the other one AVX. My aim is to show that the avx version is running 2 times faster but I'm loosing something like 20 % with some shift operations.

I need to perform quite often a shift operation to rotate an Avx Vector 1 byte on the left. It seems like all the instructions I need will only be available with AVX2.

Actually I'm splitting the source _m256i vector into 2 _128i but  this way I'm loosing performances. Is there any other way to perform this operation? Why shifting operation were not included in avx instruction set?

Thanks in advance for your help, here's the current version on my code

[cpp]

  a1 = _mm256_castsi256_si128( _source );
  a2 = _mm256_extractf128_si256 ( _source,1 );
 
  b1 = _mm_slli_si128( a1,1);
  b2 = _mm_slli_si128( a2,1);
  a1 = _mm_srli_si128( a1,15);
  a2 = _mm_srli_si128( a2,15);
 
   _dest  =  _mm256_castsi128_si256 ( _mm_or_si128(b1,a2) );
   _dest =  _mm256_insertf128_ps (  _dest, _mm_or_si128(b2,a1), 1 );

[/cpp]

0 Kudos
16 Replies
Thomas_W_Intel
Employee
4,783 Views
Yes, Intel AVX supports only FP instructions for 256bit. If you need other instructions, like the shift that you are describing, it is better to use the 128bit instructions. You might save some instructions because of the non-destructive source that comes with AVX, but that's about it. For integer instructions with 256bit registers, you will have to wait for AVX2.
0 Kudos
Silva__Rafael
Beginner
4,783 Views

So if I want to left shift an _m256 float point array, there is no way to do it using AVX instructions?

I was wondering if a combination of __mm256_shuffle_ps and __mm256_permute_ps would make it. Is that possible? If yes, I just could not understand the meaning of __MM_SHUFFLE macro on the context of __mm256_permute_ps function, how can I use it?

Or if you still think to use 128bit is better, what intrinsic 128 bit functions should I use for that?

0 Kudos
SergeyKostrov
Valued Contributor II
4,783 Views
>>...I need to perform quite often a shift operation to rotate an Avx Vector 1 byte on the left... Is it a cyclical operation or No? Does it mean that in case of a vector of N elements an element[0] should be moved to an element[N-1]? And of course all the rest vector elements are shifted to the left.
0 Kudos
Silva__Rafael
Beginner
4,783 Views

Sergey Kostrov wrote:

Is it a cyclical operation or No? Does it mean that in case of a vector of N elements an element[0] should be moved to an element[N-1]?

Not in my case, it is not cyclical. In my case the element[0] is not necessary any more.

0 Kudos
Vladimir_Sedach
New Contributor I
4,783 Views

rmendes.silva::

Shifts left by 4 bytes:
__m256i    r = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8), r1, r2;
r1 = _mm256_slli_si256(r, 4);
r2 = _mm256_srli_si256(r, 12);
r2 = _mm256_permute2f128_si256(r2, r2, 0x08);
r = _mm256_or_si256(r1, r2);

With AVX2:

r1 = _mm256_permute2f128_si256(r, r, 0x08);
r = _mm256_alignr_epi8(r, r1, 12);
 

 

 

0 Kudos
Silva__Rafael
Beginner
4,783 Views

Ok Vladimir, but this is for integer values. What about float point? I didn't found anything to do the same with float points.

0 Kudos
Vladimir_Sedach
New Contributor I
4,783 Views

rmendes.silva::

It's almost the same with floats:
__m256    r, r1, r2;

AVX:
r1 = _mm256_slli_si256(_mm256_castps_si256(r), 4);
r2 = _mm256_srli_si256(_mm256_castps_si256(r), 12);
r2 = _mm256_permute2f128_ps(r2, r2, 0x08);
r = _mm256_or_ps(r1, r2);

AVX2:
r1 = _mm256_permute2f128_ps(r, r, 0x08);
r = _mm256_alignr_epi8(_mm256_castps_si256(r), _mm256_castps_si256(r1), 12);

If you need doubles, replace 4, 12 and 12 by 8, 8 and 8.
 

0 Kudos
Silva__Rafael
Beginner
4,783 Views

In Intel documentation _mm256_slli_si256 is only included on AVX2 documentation, so I think it is not available for AVX, is it?

0 Kudos
Vladimir_Sedach
New Contributor I
4,783 Views

Your right. Perhaps, it's not a good idea to do it with AVX at all.
I'm afraid this version wont be too fast.

__m256    r;
__m128    r0, r1;

r0 = _mm256_castps256_ps128(r);
r1 = _mm256_extractf128_ps(r, 1);
r1 = _mm_insert_ps(r1, r0, 0xF0); //r1[3] = r0[3]
r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), 4));
r1 = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 1, 0, 3)); //rotate
r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1);

If you don't care about r[0]:
r = _mm256_shuffle_ps(r, r, _MM_SHUFFLE(2, 1, 0, 3));
r0 = _mm256_castps256_ps128(r);
r1 = _mm256_extractf128_ps(r, 1);
r1 = _mm_move_ss(r1, r0);
r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1);

 

 

0 Kudos
Silva__Rafael
Beginner
4,783 Views

Yes, you're right about AVX. Doesn't seem to be a good idea do it with AVX, if we want performance. I will try, but I'm also afraid that would not be so fast. Thank you.

0 Kudos
Christian_M_2
Beginner
4,783 Views

Maybe you have the chance to organise the data in memory right for AVX before processing?

You mentioned it is not cyclic so this might be an option.

0 Kudos
Silva__Rafael
Beginner
4,783 Views

I've considered this Christian, but I can't see how to do that without moving things around, which is exactly what I want to avoid.

0 Kudos
Vladimir_Sedach
New Contributor I
4,783 Views

rmendes.silva,

SLL_256() shifts left by an arbitrary number of elements.
All the "if" checks are removed by optimization since "offs" is a const.
It can be used as is with __m256, and needs just to replace all the casts otherwise.

Please let me know if it's fast/slow in your case.
It also would be nice to see a snippet of your code that uses the shift.
Perhaps a faster approach could be found.

// r: result
// a: src vector
// offs: number of elements to shift
// elem_n: number of  elements in vector (8 for float)
    #define SLL_256(r, a, offs, elem_n) \
    { \
        __m128    r0, r1; \
        const int    size = sizeof(a) / elem_n; \
\
        if (!offs) \
            r = a; \
        else if (offs == elem_n / 2) \
            r = _mm256_permute2f128_ps(a, a, 0x08); \
        else if (offs >= elem_n) \
            r = _mm256_setzero_ps(); \
        else if (offs < elem_n / 2) \
        { \
            r0 = _mm256_castps256_ps128(a); \
            r1 = _mm256_extractf128_ps(a, 1); \
            r1 = _mm_castsi128_ps(_mm_alignr_epi8(_mm_castps_si128(r1), _mm_castps_si128(r0), (elem_n / 2 - offs) * size)); \
            r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), offs * size)); \
            r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1); \
        } \
        else \
        { \
            r0 = _mm256_castps256_ps128(a); \
            r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), (offs - elem_n / 2) * size)); \
            r = _mm256_permute2f128_ps(_mm256_castps128_ps256(r0), _mm256_castps128_ps256(r0), 0x08); \
        } \
    }

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,783 Views

Vladimir,

Great post. This should be a starting point for Intel compiler intrinsic developers to offer an official _mm256_... intrinsic function. Such that whenever an improvement in AVX design is made, that we mere users need not revisit our code.

Jim Dempsey

0 Kudos
Vladimir_Sedach
New Contributor I
4,783 Views

Jim,

I really appreciate your words, thanks.
Added a similar SRL_256() to shift right.
This time they accept 256-bit vectors of any type.
Just anxious a bit this version could be slower with a not very smart compiler. 

// r: result
// a: src vector
// offs: number of elements to shift (must be a const)
// elem_n: number of  elements in vector (8 for "float")
#define SLL_256(r, a, offs, elem_n) \
{ \
    __m256i    *pr = (__m256i *)&r; \
    __m256i    *pa = (__m256i *)&a; \
    __m128i    r0, r1; \
    const int    size = sizeof(a) / elem_n; \
\
    if (!offs) \
        *pr = *pa; \
    else if (offs == elem_n / 2) \
        *pr = _mm256_permute2f128_si256(*pa, *pa, 0x08); \
    else if (offs >= elem_n) \
        *pr = _mm256_setzero_si256(); \
    else if (offs < elem_n / 2) \
    { \
        r0 = _mm256_castsi256_si128(*pa); \
        r1 = _mm256_extractf128_si256(*pa, 1); \
        r1 = _mm_alignr_epi8(r1, r0, (elem_n / 2 - offs) * size); \
        r0 = _mm_slli_si128(r0, offs * size); \
        *pr = _mm256_insertf128_si256(_mm256_castsi128_si256(r0), r1, 1); \
    } \
    else \
    { \
        r0 = _mm256_castsi256_si128(*pa); \
        r0 = _mm_slli_si128(r0, (offs - elem_n / 2) * size); \
        *pr = _mm256_permute2f128_si256(_mm256_castsi128_si256(r0), _mm256_castsi128_si256(r0), 0x08); \
    } \
}

#define SRL_256(r, a, offs, elem_n) \
{ \
    __m256i    *pr = (__m256i *)&r; \
    __m256i    *pa = (__m256i *)&a; \
    __m128i    r0, r1; \
    const int    size = sizeof(a) / elem_n; \
\
    if (!offs) \
        *pr = *pa; \
    else if (offs == elem_n / 2) \
        *pr = _mm256_permute2f128_si256(*pa, *pa, 0x81); \
    else if (offs >= elem_n) \
        *pr = _mm256_setzero_si256(); \
    else if (offs < elem_n / 2) \
    { \
        r0 = _mm256_castsi256_si128(*pa); \
        r1 = _mm256_extractf128_si256(*pa, 1); \
        r0 = _mm_alignr_epi8(r1, r0, offs * size); \
        r1 = _mm_srli_si128(r1, offs * size); \
        *pr = _mm256_insertf128_si256(_mm256_castsi128_si256(r0), r1, 1); \
    } \
    else \
    { \
        r1 = _mm256_extractf128_si256(*pa, 1); \
        r1 = _mm_srli_si128(r1, (offs - elem_n / 2) * size); \
        *pr = _mm256_permute2f128_si256(_mm256_castsi128_si256(r1), _mm256_castsi128_si256(r1), 0x80); \
    } \
}

Silva__Rafael
Beginner
4,783 Views

Hi Vladimir,

 I will try this approach an will post here it gets faster of not. Thanks.

0 Kudos
Reply