Solved: Will a vector version of rol be supported in the future

STEPHEN_H_3 · ‎06-16-2016

Hi

I am trying to vectorise code that is using mainly integer instructions (add,rol,xor). I cannot get the compiler to vectorise this.

My understanding is there is no vector version of rol. Will this be supported in the future?

I have tried on Westmere, Sandy Bridge and Haswell with both SSE and AVX. In AVX the rol is repalced by shld, but there is no gain.

I seem to be able to get the code to unroll, but no vector instructions are inserted (according to disassembler). There is a slight speedup (~10%), but I believe this is due to better use of multiple ALUs; from more independent instructions.

Any guidance would be welcome.

Thanks

Note; Using intel16.0 icc - linux - SB/HSW

jimdempseyatthecove · ‎06-29-2016

Depends on the code. You cannot issue scalar instructions of vector registers. Therefore, the unroll section (if placed inside otherwise vectored code) would require store to RAM -> scalar loop with RAM -> load from RAM. The optimization process, if it deems too costly, will run the outer loop in scalar, if not too costly it may insert the scalar code inside the vectorized loop.

Sketch of vectorized rol of 32-bit integer AVX2 vectors.

__m256i zero;
zero = _mm256_xor_si256(zero,zero);

for(int i=0;i<count;++i) {
  __m256i temp = _mm256_load_si256(&array);
  __m256i NegCarry = _mm256_cmpgt_epi32(zero,temp);
  _mm256_store_si256(&array, _mm256_sub_epi32(_mm256_add_epi32(temp, temp), NegCarry));
}

Note, there are (at least) two posters asking essentially the same question, is this for a CS course test?

Load, compare, add, sub, Store (5 instructions/8 integers)

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎06-17-2016

Please show scalar code that performs what you want to do. We do not know the size of your integers, nor if you intend to perform a single rol.

For a single rol:

a) use one of the compares (signed or unsigned) to produce -1's in the lanes were the msb were set and store this into a register. Then
b) add the vector register to have the lanes rol'ed to itself. Then
c) subtract the saved value from a) above

rol and ror (asr and asl) would be nice too, maybe later.

Jim Dempsey

andysem · ‎06-29-2016

Whether the compiler is able to vectorize the code or not depends on many factors - most importantly, data access pattern and dependencies. The rol instruction itself is easy to emulate with a couple of shifts and an or, so it may not be the culprit. If that piece of code is performance critical I suggest you try to vectorize it manually - at the very least the problems the compiler is facing should become apparent, and at the most you will have the code faster than what the compiler would have produced.

STEPHEN_H_3 · ‎06-29-2016

Hi, thanks for the answers. As andysem points out, the lack of vectorised rol turns out not to be the blocker to vectorisation. I have now managed to vectorise, and yes, two shifts and and OR are used by the compiler. FYI, I've noticed that vectorised rol exists in AVX512.

I have not got a successful intrinsics version, but hopefully I will try the approach suggested in case I can speed up further.

One thing I'm wondering. If lets say the compiler can unroll a loop (no dependencies) and vectorise most instructions, but one instruction does not have a vector counterpart, then I presume the compiler will still "vectorise" the loop - but with that instruction unrolled? Or would that block vectorisation?

Many Thanks for the responses

jimdempseyatthecove · ‎06-29-2016

Depends on the code. You cannot issue scalar instructions of vector registers. Therefore, the unroll section (if placed inside otherwise vectored code) would require store to RAM -> scalar loop with RAM -> load from RAM. The optimization process, if it deems too costly, will run the outer loop in scalar, if not too costly it may insert the scalar code inside the vectorized loop.

Sketch of vectorized rol of 32-bit integer AVX2 vectors.

__m256i zero;
zero = _mm256_xor_si256(zero,zero);

for(int i=0;i<count;++i) {
  __m256i temp = _mm256_load_si256(&array);
  __m256i NegCarry = _mm256_cmpgt_epi32(zero,temp);
  _mm256_store_si256(&array, _mm256_sub_epi32(_mm256_add_epi32(temp, temp), NegCarry));
}

Note, there are (at least) two posters asking essentially the same question, is this for a CS course test?

Load, compare, add, sub, Store (5 instructions/8 integers)

Jim Dempsey

andysem · ‎06-29-2016

> Note, there are (at least) two posters asking essentially the same question, is this for a CS course test?

I suspect this could be related to the recent spam attack. The other two are probably produced by bots.

STEPHEN_H_3 · ‎06-30-2016

Thanks for this Jim. At the moment I'm working on unsigned integers and rotating by given rotation constants. When you said am I doing a single rol, do you mean rotating by one bit?

I am working on pairs of values, this is the general sequence.

x0^=x1

rotate x1 left by a given constant

x1+=x0

Your method looks great, I will have a go at adapting it for my code. I dont quite understand all of it.

eg. I'm not sure what this is for.

zero = _mm256_xor_si256(zero,zero);

The other posts look to be identical copies of mine!

jimdempseyatthecove · ‎06-30-2016

__m256i zero; // *** at this point zero has junk (uninitialized data) ***
zero = _mm256_xor_si256(zero,zero); // *** xor junk with self produces 0

Jim Dempsey

jimdempseyatthecove · ‎06-30-2016

For multi-bit unsigned data you could use the ..._div_... instruction to shift the upper bits right, and the ..._mul_... to shift the lower bits left...

*** However, AVX integer div intrinsic creates a sequence of two or more instructions, which may perform worse than a native instruction.

Jim Dempsey

JWong19 · ‎07-01-2016

I think that it is simple to rotate-left......

template <int nCount>
__m128i _mm_rol_epi32(__m128i const & epi32A)
{
    __m128i const epi32H = _mm_slli_epi32(epi32A, nCount);
    __m128i const epi32L = _mm_srli_epi32(epi32A, 32 - nCount);
    return _mm_or_si128(epi32AH, epi32L);
}

sdfds_d_ · ‎07-03-2016

For multi-bit unsigned data you could use the ..._div_... instruction to shift the upper bits right, and the ..._mul_... to shift the lower bits left...

*** However, AVX integer div intrinsic creates a sequence of two or more instructions, which may perform worse than a native instruction.

andysem · ‎07-27-2016

jimdempseyatthecove wrote:

__m256i zero; // *** at this point zero has junk (uninitialized data) ***
zero = _mm256_xor_si256(zero,zero); // *** xor junk with self produces 0

You can use _mm256_setzero_si256 for this.

jimdempseyatthecove · ‎07-27-2016

That macro uses the PXOR instruction. The two statements resolve to the same instructions. Though setzero is more descriptive as to what you are doing. But then again, anyone programming with SSE/AVX/AVX2/AVX-512x should know what an XOR (with self) does.

Jim Dempsey

andysem · ‎07-27-2016

Yes, but _mm256_setzero_si256 has the advantage that it doesn't involve uninitialized data, which may trigger compiler warnings. The compiler may also be more inclined to recognize this call as a constant and merge multiple calls to _mm256_setzero_si256 into a single ymm with zero content.

jimdempseyatthecove · ‎07-27-2016

The compiler should recognize

int X;

X ^= X;

And zero X without reading X (provided X is not volatile, as it is not above). It should do the same with the registerized zero in earlier post. Your issue is valid should zero not be held in a register, and it which case, you would want to explicitly regenerate the 0.

Jim Dempsey

andysem · ‎07-28-2016

At least gcc 5.4 does emit a warning for this code.

#include <immintrin.h>

int main()
{
    __m256i zero;
    zero = _mm256_xor_si256(zero, zero);
    return 0;
}

g++ -Wall -O3 -mavx2 zero_ymm.cpp -o zero_ymm
zero_ymm.cpp: In function ‘int main()’:
zero_ymm.cpp:6:40: warning: ‘zero’ is used uninitialized in this function [-Wuninitialized]
     zero = _mm256_xor_si256(zero, zero);
                                        ^

If you compile this code without optimization and look in the disassembly you will see what is going on - the compiler is creating two temporaries for zero and then XOR them - which is what actually happens on the language level when you call _mm256_xor_si256. From this perspective you're applying the operation on two pieces of uninitialized data and that is why the warning is legit. At this point it doesn't matter that the result of the operation will always be zero.

jimdempseyatthecove · ‎08-05-2016

Then ..._setzero_... is better to be used.

Note, when optimization is enabled, in an actual use program, your program structure may be such that variable zero is registerized. When it is not registerized you may experience: read (uninitialized data), pxor, (possible, but not normally) write to RAM at zero. The compiler optimization should be smart enough to reduce this to pxor of the register assigned to shadow zero (though the compiler warning may still emit).

Thanks in any event for pointing out _mm..._setzero_...

Jim Dempsey