Solved: You're welcome, mate.

Alexander_L_1 · ‎01-29-2017

AVXOP xmm0, xmm0, xmm1 Hi,

years ago I've read and heard different mystic things about latencies caused by a regsiter choise if using AVX, and why it is better to use AVX instead of SSE - nondestructive operations would be performed better as destructive. Now, years ago that may differs. In short:

Assume there are simply values (signed bytes, words, etc, - does not matter), say in xmm0, xmm1, xmm2, xmm3.
We wand to calculate sum or max or min - does not matter (?)
And we a free to use other regsiters too, so we use something like AVXOP R1, R2, R3 (here I will use AVXOP for a real instruction!).
Now we can do following:

1. (always use different regsiter as destination operand)
AVXOP xmm4, xmm0, xmm1
AVXOP xmm5, xmm2, xmm3
AVXOP xmm6, xmm4, xmm5 ; xmm6 is result, total use 3 additional registers

2. (always use same regsiter as destination operand, but minimize "cross use" potentially unready source regsiter from one operation as destination register in followed operation )
AVXOP xmm0, xmm0, xmm1
AVXOP xmm2, xmm2, xmm3
AVXOP xmm4, xmm0, xmm2 ; xmm4 is result, total use no additional registers

3. (always use same regsiter as destination operand but don't minimize "cross use" potentially unready source regsiter from one operation as destination register in followed operation )
AVXOP xmm0, xmm0, xmm1
AVXOP xmm0, xmm0, xmm2
AVXOP xmm0, xmm0, xmm3 ; xmm0 is result, total use no additional registers.

If they are all equal, so next question would be, it looks better to use SSE instead of, because instructions are shorter and has less latency, so the next code like:

SSEOP xmm0, xmm1
SSEOP xmm0, xmm2
SSEOP xmm0, xmm3 ; xmm0 is result, total use no additional registers.

will win about 3 upper variants like.

One question more, has Intel some decision to reduce stalls if SSE and AVX (AVX2) were intermixed?

And the last question (now with real code snippet):

   xmm0 = _mm_load_si128((__m128i*)(rsi + rdx * 2 - 0x00000010));
   xmm5 = _mm_load_si128((__m128i*)(rsi + rdx * 2));
   xmm6 = _mm_load_si128((__m128i*)(rsi + rdx * 2 + 0x00000010));
   xmm7 = _mm_load_si128((__m128i*)(rsi + rdx * 2 + 0x00000020));
   xmm11 = _mm_load_si128((__m128i*)(rsi + rdx * 2 + 0x00000030));

   //; [X - 2, Y - 2..X + 2, Y - 2] == > XMM0..XMM4
   xmm1 = _mm_alignr_epi8(xmm6, xmm5, 1);
   xmm2 = _mm_alignr_epi8(xmm6, xmm5, 2);
   xmm3 = _mm_alignr_epi8(xmm5, xmm0, 15);
   xmm4 = _mm_alignr_epi8(xmm5, xmm0, 14);

   xmm0 = _mm_max_epi8(xmm1, xmm2);
   xmm2 = _mm_max_epi8(xmm3, xmm4);
   xmm3 = _mm_max_epi8(xmm0, xmm5);
   xmm8 = _mm_max_epi8(xmm3, xmm2); // MAX(XMM0, XMM1, XMM2, XMM3, XMM4) == > RESULT IN XMM8

...... xmm7 and xmm11 will be used from here now.

Question - is this a good idea to start loading as soon as possible (all 5 values) in order to get it from memory (with high probability the values are not cached) or better to interleave code like:

   xmm0 = _mm_load_si128((__m128i*)(rsi + rdx * 2 - 0x00000010));
   xmm5 = _mm_load_si128((__m128i*)(rsi + rdx * 2));
   xmm6 = _mm_load_si128((__m128i*)(rsi + rdx * 2 + 0x00000010));

   //; [X - 2, Y - 2..X + 2, Y - 2] == > XMM0..XMM4
   xmm1 = _mm_alignr_epi8(xmm6, xmm5, 1);
   xmm2 = _mm_alignr_epi8(xmm6, xmm5, 2);
   xmm3 = _mm_alignr_epi8(xmm5, xmm0, 15);
   xmm4 = _mm_alignr_epi8(xmm5, xmm0, 14);

xmm7 = _mm_load_si128((__m128i*)(rsi + rdx * 2 + 0x00000020));
xmm11 = _mm_load_si128((__m128i*)(rsi + rdx * 2 + 0x00000030));

xmm0 = _mm_max_epi8(xmm1, xmm2);
   xmm2 = _mm_max_epi8(xmm3, xmm4);
   xmm3 = _mm_max_epi8(xmm0, xmm5);
   xmm8 = _mm_max_epi8(xmm3, xmm2); // MAX(XMM0, XMM1, XMM2, XMM3, XMM4) == > RESULT IN XMM8

...... xmm7 and xmm11 will be used from here now.

Or it is all equal and would be done by prefetcher either ( using of prefetch instruction and something like register preload is not a question).

Thanks!

Alex

andysem · ‎01-29-2017

Regarding the first part of your question, the cases #1 and #2 are equivalent from the perspective of performance. This is so because of register renaming that happens in the CPU. So even when your destination register is one of the input registers, the CPU internally saves the result of the instruction into a new register, which is then referred to as the destination register in the instruction. In other words:

AVXOP xmm0, xmm0, xmm1

is equivalent to

AVXOP xmm0', xmm0, xmm1

[rename xmm0' to xmm0, so that the following instructions refer to the result of this instruction]

The case #3 seems to have a different behavior and much lower performance because it contains a data dependency chain that contains all three instructions. This means that while cases #1 and #2 could potentially execute the first two instructions in parallel (provided that the CPU is capable to do that), in the case #3 all three instructions must be executed sequentially.

Regarding SSE vs. AVX encoding and non-destructive instructions, the latter allows to eliminate movdqa instructions between registers that are otherwise required to achieve the same effect. This reduces the code size and also relieves instruction decoder. In many cases just recompiling the same code for AVX can provide a noticeable speedup.

If they are all equal, so next question would be, it looks better to use SSE instead of, because instructions are shorter and has less latency

Equivalent SSE and AVX instructions have the same latency and throughput, AFAIK. At least, I haven't seen a case where they are not. Also, I don't think there is any significant difference in instruction size between SSE and AVX (well, between SSE2 and AVX2 anyway). In both cases, instructions take about 4 bytes on average, when memory operands and immediate constants are not involved.

One question more, has Intel some decision to reduce stalls if SSE and AVX (AVX2) were intermixed?

I assume you mean the penalties caused by mixing 256-bit vector instructions from AVX/AVX2 with 128-bit vector instructions from SSE. You can already avoid those penalties by issuing vzeroupper or vzeroall instructions. Recent Intel architectures also reduced the penalties, but they are far from zero.

Question - is this a good idea to start loading as soon as possible (all 5 values) in order to get it from memory (with high probability the values are not cached) or better to interleave code

It is generally a good idea to start loading data beforehand, but I wouldn't give a general advice like that. There are many things to consider. First, your compiler may reorder code as it sees fit, so there may actually be no difference in how you arrange it. Second, there is instruction reordering in the CPU, with a rather large window, too, in recent architectures. So the exact instruction order has less significance in the modern days. Third, by issuing early loads you basically reserve registers that could have been useful for other means. That could in turn cause register spills and harm the performance more than you could potentially win by the early load. Then you should consider the number of the load ports your target CPU has. The CPU won't be able to issue more load instructions in parallel than it can, no matter how you order the code. In general, you should profile your code to see if one way or the other is beneficial for your case.

View solution in original post

andysem · ‎01-29-2017

Regarding the first part of your question, the cases #1 and #2 are equivalent from the perspective of performance. This is so because of register renaming that happens in the CPU. So even when your destination register is one of the input registers, the CPU internally saves the result of the instruction into a new register, which is then referred to as the destination register in the instruction. In other words:

AVXOP xmm0, xmm0, xmm1

is equivalent to

AVXOP xmm0', xmm0, xmm1

[rename xmm0' to xmm0, so that the following instructions refer to the result of this instruction]

The case #3 seems to have a different behavior and much lower performance because it contains a data dependency chain that contains all three instructions. This means that while cases #1 and #2 could potentially execute the first two instructions in parallel (provided that the CPU is capable to do that), in the case #3 all three instructions must be executed sequentially.

Regarding SSE vs. AVX encoding and non-destructive instructions, the latter allows to eliminate movdqa instructions between registers that are otherwise required to achieve the same effect. This reduces the code size and also relieves instruction decoder. In many cases just recompiling the same code for AVX can provide a noticeable speedup.

If they are all equal, so next question would be, it looks better to use SSE instead of, because instructions are shorter and has less latency

Equivalent SSE and AVX instructions have the same latency and throughput, AFAIK. At least, I haven't seen a case where they are not. Also, I don't think there is any significant difference in instruction size between SSE and AVX (well, between SSE2 and AVX2 anyway). In both cases, instructions take about 4 bytes on average, when memory operands and immediate constants are not involved.

One question more, has Intel some decision to reduce stalls if SSE and AVX (AVX2) were intermixed?

I assume you mean the penalties caused by mixing 256-bit vector instructions from AVX/AVX2 with 128-bit vector instructions from SSE. You can already avoid those penalties by issuing vzeroupper or vzeroall instructions. Recent Intel architectures also reduced the penalties, but they are far from zero.

Question - is this a good idea to start loading as soon as possible (all 5 values) in order to get it from memory (with high probability the values are not cached) or better to interleave code

It is generally a good idea to start loading data beforehand, but I wouldn't give a general advice like that. There are many things to consider. First, your compiler may reorder code as it sees fit, so there may actually be no difference in how you arrange it. Second, there is instruction reordering in the CPU, with a rather large window, too, in recent architectures. So the exact instruction order has less significance in the modern days. Third, by issuing early loads you basically reserve registers that could have been useful for other means. That could in turn cause register spills and harm the performance more than you could potentially win by the early load. Then you should consider the number of the load ports your target CPU has. The CPU won't be able to issue more load instructions in parallel than it can, no matter how you order the code. In general, you should profile your code to see if one way or the other is beneficial for your case.

Alexander_L_1 · ‎01-29-2017

Hello Andy,

first of all, many thanks for useful information!

Second, I wrote directly in assembler language, because of VS compiler often produces crazy inperformant code, save/restore values either there are enough register, etc.

I've completely forgot about internal register renaming, too much WPF, WCF and other distraction:)

But... Would be processor always use this technique, what if all register is already in use? Are any more regsiters internally available for processors single core? And why it is not possible to do the same in case #3 (it is clear with dependencies) - decode instruction in parallel, start instruction in parallel inside pipeline AND use internal result from previous operation without waiting the result goes to real XMM register, I remember something about that, but not really sure it is implemented by Intel or maybe that was completely wrong information.

Really good to know, VZERO... can avoid the penalties :)

What about the last question? Code wrote in assembler. Start 6 loads (MOV(NT)DQA) per core sequuentially or intermix with other instructions as shown above. Is it possible, that it is impossible to start 6 loads each-by-each and this would be a penalty? Once gain, compiler produce really bad code (as shown in other topic here), so I need to write in assembler.

Alex

TimP · ‎01-29-2017

Interesting exchange, but I'm perplexed about which compiler you complain about. I would much prefer to select the most efficient of several available compilers rather than optimize asm for a specific unspecified cpu.

Shadow registers seem to be a nearly inexhaustible resource if you optimize number of threads per core and allow your compiler to work. If you mix sse and avx intrinsics, intel c++ will try to avoid the transition stalls by promoting sse to avx or adding vzeroupper at function returns.

Alexander_L_1 · ‎01-30-2017

Hello Tim,

you are correct about compilers and assembler, but there is a big problem. The chief will have cheapest tools, cheapest work, cheapest stuff (where 2/3 does not know waht to do at all, and 1/3 must do work for other 2/3 too) and nothing to invest, but they will be the "big number one" because of selling cheapest systems - simply crazy. P.S. I will no longer work for that company because of such conditions :)

Compiler is VS 2015 C++ compiler, the code was done in intrinsics. As mentioned in other thread the compiler produces significantly inperformant code. I.e. there are exactly 16 _mm128i variables, so that 16 xmm regsiters can be used without "caching" - but caching is used. Other problem: var1, var2, var3 is calculated in that order and var1, var2, var3 is also stored in that order, but the code is rearranged that var3, var2, var1 order by compiler - the resukt is pipeline stall. And the next really crazy thing is 4x _mm_stream_si128() is reordered so that resulting MOVNTDQA instrcutions are intermixed with other instructions, which prevent cacheline-write-throught and results in a total inperformant situation. Also, I have noticed, that rewrite code in assembler give speedup of factor 1.5-2.5 AND sometimes prevents cache pollution so that other algorithms speed-ups too.

The next problem, the code vectorization must be done manually because there is a bunch of "cheapest" code. The number of threads must be ballanced manually too, because there many other threads, so additional threads can slow down the overall performance.

One question, what you mean with "Shadow registers seem to be a nearly inexhaustible resource if you optimize number of threads per core"? How many shadow registers per core are available. Other interesting question - if there is a lot of shadow registers, why the number of real registers is much lesser?

Alex

andysem · ‎01-30-2017

Would be processor always use this technique, what if all register is already in use? Are any more regsiters internally available for processors single core?

AFAIK, register renaming works on any write to a register, probably even when not strictly required. There are many more internal registers in the CPU than are exposed through xmm/ymm names, so there is never a case when there is no spare internal register. For example, this article (http://www.realworldtech.com/haswell-cpu/3/) states there are 144 vector registers in Sandy Bridge and 168 in Haswell, and only 16 of them are exposed to the code.

Other interesting question - if there is a lot of shadow registers, why the number of real registers is much lesser?

Because exposing more registers would require more space in the instruction encoding.

And why it is not possible to do the same in case #3 (it is clear with dependencies) - decode instruction in parallel, start instruction in parallel inside pipeline AND use internal result from previous operation without waiting the result goes to real XMM register, I remember something about that, but not really sure it is implemented by Intel or maybe that was completely wrong information.

In order to execute an instruction, all its input values have to be ready. That means that the previous results have to be written to a (renamed) register. That's what forces these instructions to go sequentially. Renaming a register is cheap, and this step is probably indivisible from executing the instruction.

I.e. there are exactly 16 _mm128i variables, so that 16 xmm regsiters can be used without "caching" - but caching is used.

Having 16 __m128i variables does not guarantee that this many registers are needed. In fact, the correspondence between the variables and registers is rather weak and inconclusive. The compiler is able to rearrange the code so that some of the variables are not used and this way it may free some registers for other purposes. Some additional registers may be required for temporary results or for implementing non-destructive operations. Loop unrolling also significantly increases register consumption. So, while I'm not of a very high opinion about MSVC, it may have had valid reasons to spill registers.

Alexander_L_1 · ‎01-30-2017

Many thanks for very usefull information, andysem.

The instruction may (but not always must) need one-two byte(s) more to encode if use more real registers, but this can significantly shorten the total number of instructions executed and can additionally prevent register "caching" so the real speedup wins clearly about some "bigger code" (ok, methods are small and fits in I-Cache) :)

About MSVC compiler - I've seen the produced assembly code and done several benchmarks. Thats why I decided to (re)write in assembly. See my other posts here, you will wonder about the instructions order.

P.S. Is Andy your real name?

Alex

Todd_W_ · ‎02-04-2017

FWIW my experience is MSVC 2015.3 can be inefficient at register allocation, resulting in spilling 2-3 registers sooner than necessary. No measureable impact from instruction ordering or below threshold register use in my cases, but the slowdown's predictably precipitous when VC decides to spill two instead of fitting a loop into 15 registers like it should. I've also found providing register hints tends to result in worse spilling rather than getting VC to understand it's being told the optimal register allocation. I'd build in another compiler which can figure out the obvious solution before dropping to assembly but, if it has to be VC, well, yeah.

Alexander_L_1 · ‎02-06-2017

Hi Todd,

many thanks for sharing your experience with me! I've exactly the same problem :( Even register-hinting does not help either. That's why I rewrite some important pieces of code with assembler (in order of absence of better compilers).

Todd_W_ · ‎02-12-2017

You're welcome, mate. Hopefully things'll improve some as Microsoft continues modernizing their compiler code base. My expectations are low, though.

Question about latency