pmovzx/pmovsx

dark_shikari · ‎10-12-2008

I have yet to figure out the point of these instructions. I have tested them in many many cases, mostly to attempt to replace the following structure:

movq xmmreg1, memory1

movq xmmreg2, memory2

punpcklbw xmmreg1, zeroreg

punpcklbw xmmreg2, zeroreg

psubw xmmreg1, xmmreg2

with

pmovzxbw xmmreg1, memory1

pmovzxbw xmmreg2, memory2

psubw xmmreg1, xmmreg2

However, in every single case it is almost universally slower or the same speed, despite it being fewer instructions and despite mubench listing pmovzx as a 1-latency 1-throughput instruction.

What is supposed to be the use of this instruction if it is slower than movq/punpck? Or, if its not supposed to be slower than movq/punpck, why might it be slower in these cases?

Max_L · ‎10-12-2008

Hi Dark, yes it is 1 uop/1-cycle throughput instruction on both Penryn and Nehalem, what CPU are you trying it on, Penryn or Nehalem?

can you also post more complete code snippet (ideally some real-life usage in your code) which results with slow down when switching to PMOVZX?

Thanks,

-Max

dark_shikari · ‎10-13-2008

I have tried it on both Penryn and Nehalem and it is no better on Nehalem either.

Posting a code snippet is rather complicated, as most of the code is in extremely heavily macro'd Yasm syntax. The basic macro is as follows though:

%macro LOAD_DIFF 5
movh %1, %4
punpcklbw %1, %3
movh %2, %5
punpcklbw %2, %3
psubw %1, %2
%endmacro

This macro takes three xmm regs and two memory arguments with byte data. It unpacks the byte input from the two memory arguments, takes the difference, and stores that in the first xmmreg argument. The third xmmreg is a zero register. "movh" is movd when in MMX mode and movq when in XMM mode. The macro is a bit more complicated than that in reality--mainly to deal with the fact that a zero register isn't always available.

I have tried replacing this with pmov (only in XMM mode, of course)--it never, ever helps in any function that uses this macro. Overall speed loss tends to be about 10-15 clocks for functions which call LOAD_DIFF 32 times. The issue doesn't seem to be one of code alignment either.

If you want to look into more detail with this, this patch replaces the code above with the pmov equivalent, for the codebase available from here. The method of benchmarking is the built-in checkasm utility ("make checkasm"), which, using the --bench argument, measures the speed of all tested functions using thousands of repeated runs and RDTSC.

Max_L · ‎10-13-2008

thank you, I'll take a look later this week.

-Max

levicki · ‎10-16-2008

From my personal testing I can confirm that they are slower than said combination. I personally use MOVxxx and PSHUFB with a proper mask.

dark_shikari · ‎10-16-2008

Quoting - Igor Levicki

From my personal testing I can confirm that they are slower than said combination. I personally use MOVxxx and PSHUFB with a proper mask.

Is this really faster than punpck? I would think it would be at best the same speed as punpcklbw, and unlike punpck it would require an actual mask from memory rather than a zero register, which can be created with a pxor.

levicki · ‎10-16-2008

If I remember correctlythey were equally fast. I prefer PSHUFB because I have one more XMM register free and I am always short on them ;-)

dark_shikari · ‎12-07-2008

Quoting - Max Locktyukhin

thank you, I'll take a look later this week.

-Max

Any update on this? ;)

gabest · ‎12-09-2008

My opinion is that pmovzxbw is probably not so different from unpacking it against zero. While apmovzxbd + cvtdq2ps can convert a dword (rgba color) into floats pretty fast, with no extra regs needed. GenerallyI try to avoid pshufb, just becausereserving one xmm reg for the mask can be expensive in tight situations.

levicki · ‎12-09-2008

Quoting - gabest

My opinion is that pmovzxbw is probably not so different from unpacking it against zero. While apmovzxbd + cvtdq2ps can convert a dword (rgba color) into floats pretty fast, with no extra regs needed. GenerallyI try to avoid pshufb, just becausereserving one xmm reg for the mask can be expensive in tight situations.

That is why I use pshufb with memory operand.

dark_shikari · ‎12-09-2008

Quoting - gabest

My opinion is that pmovzxbw is probably not so different from unpacking it against zero. While apmovzxbd + cvtdq2ps can convert a dword (rgba color) into floats pretty fast, with no extra regs needed. GenerallyI try to avoid pshufb, just becausereserving one xmm reg for the mask can be expensive in tight situations.

At least where I use it, I don't even need an extra register to use the unpacking method.

[cpp]%macro LOAD_DIFF 5
%ifidn %3, none
    movh       %1, %4
    movh       %2, %5
    punpcklbw  %1, %2
    punpcklbw  %2, %2
    psubw      %1, %2
%else
    movh       %1, %4
    punpcklbw  %1, %3
    movh       %2, %5
    punpcklbw  %2, %3
    psubw      %1, %2
%endif
%endmacro[/cpp]

This macro loads two sets of 8-bit values (%4, %5), unpacks them to registers (%1, %2), and subtracts to get word results: without ever using a temporary register for an unpack mask.

Slightly better performance is gained when using a third register due to better dependency chains, which is why the alternate form of this macro (second part) is available.

Both methods are considerably faster than the three-instruction version (pmov, pmov, psub).

SHIH_K_Intel · ‎12-27-2008

With respect to the original question of relative performance between movq+punpcklbw vs. pmovzxbw, here's some observation that might help.

One common but subtle aspect of testing small kernels and integrating them into larger body of code is the complexity of load and store operations.

I'll try to illustrate one situation that I encounter often using the following two inline functions (the code shown below arequick experiments, not meant to replicate the complete functionality of what you intended), and how different test setup led to different conclusions of which one performs better.

The first function using movq+punpcklbw on 32 bytes consists of 17 instructions, the 2nd function using movzxbw on32 bytes consists of 12 instructions.

Using RDTSC, one can estimate the amortized per loop cycle for each iteration of the kernel, but bear in mind the per-loop cycle represents the cumulative exposure of the functionof interestplus any overhead of the test setup to do RDTSC measurement over large number of iterations.

In one of the first test setup, those overhead amounted to 8 instructions in the caller routine to prepare and pass parameters to these two functions and I used an accumulator to track the result of each microbench function call, adding another 6 instructions of overhead.

The per-loop exposure of the 31 instruction sequence (17 of them belong to the function of interest using movq+punpcklbw) was less then 12 cycles on a Nehalem. The same test setup ran on the PMOVZXBW equivalent of 26 instructions produced per-loop exposure of identical result (within 1%). Would I conclude that PMOVZXBW is no better than MOVQ+PUNPCKLBW?

Looking at performance monitoring event data of load and store operations that might have experienced blockages, I removed the accumulator part of the test overhead.

Then, the MOVQ+PUNPCKLBW instruction sequence being measured becomes 8+17+4=29 instructions, and per-loop cycle becomes 9.5 cycle, more than 2 cycles faster than the previous test setup. These 29 instructions represent about 32 micro-ops, so there isn't too much more retirement bandwidth (4 uops per cycle) left.

On the otherhand, the PMOVZXBW sequence being measured is 8+12+4= 24 instruction, the per-loop cycle is 9 cycle, a half cycle is shaved off of the average thruput of executing the equivalent functions using PMOVZXBW vs. MOVQ+PUNPCKLBW.

The point is that although each 32 bytes of MOVQ+PUNPCKLBW processing involves only 4 loads, the actual loop iteration being measured had 6 loads, using an accumulator in the first test setup added one more load. 7 loads over a 31-instruction sequence probably represents the lower end of memory accessactivities in typical x86 workloads (which should have made things relatively simple). The interaction of loads and stores, forwarding issues, memory address issues can cause bubbles that mask or shift the real culprit when chasing down the last few scheduling bubbles in performance tuning. In my first test, the per-loop cycle were identical because bubbles associated with load operations had masked out any advantage that PMOVZXBW could provide.

/* test function 1 using MOVQ+PUNPCKLBW */

__int64 tst_unpk1_asm(const char* a, const char* b)
{
__int64 xi;
_asm{
xor eax, eax
mov esi, a
pxor xmm7, xmm7
xor edx, edx
mov edi, b
movq xmm1, [esi+edx]
movq xmm2, [edi+edx]
punpcklbw xmm1, xmm7
punpcklbw xmm2, xmm7
psubw xmm1, xmm2
movq xmm3, [esi+edx+8]
movq xmm4, [edi+edx+8]
punpcklbw xmm3, xmm7
punpcklbw xmm4, xmm7
psubw xmm3, xmm4
packsswb xmm1, xmm3
movq xi, xmm1
}
return xi;
}

/* test function2 using PMOVZXBW */

__int64 tst_mvzx1_asm(const char* a, const char* b)
{
__int64 xi;
_asm{
xor eax, eax
mov esi, a
xor edx, edx
mov edi, b
pmovzxbw xmm1, [esi+edx]
pmovzxbw xmm2, [edi+edx]
psubw xmm1, xmm2
pmovzxbw xmm3, [esi+edx+8]
pmovzxbw xmm4, [edi+edx+8]
psubw xmm3, xmm4
packsswb xmm1, xmm3
movq xi, xmm1
}
return xi;
}

---------------------------------------------------

The following is the compiler generated asm listing for the 31 instruction sequence with an accumulator. Removing the accumulator (line 299.10 and 300.4) also removes blockages for hardware to deal with interactions between loads and stores.With less bubbles (loads+stores)that hardware needs to deal with, PMOVZX turns out tosqueeze half a cycle of throughputoverMOVQ+PCKMOVLBW.

-- listing generated by ICL w/ O2 -- test code with contrived accumulator overhead ---------------

$B4$5: ; Preds $B4$7 $B4$4
mov edi, ebx ;297.26
and edi, 7 ;297.26
add edi, ebp ;297.14
mov DWORD PTR [esp+24], edi ;299.10
mov esi, ebx ;298.39
and esi, 29 ;298.39
lea esi, DWORD PTR [ebp+esi+64] ;298.15
mov DWORD PTR [esp+28], esi ;299.10
; LOE ecx ebx ebp
$B4$6: ; Preds $B4$5
; Begin ASM
xor eax, eax ;299.10
mov esi, DWORD PTR [esp+24] ;299.10
pxor xmm7, xmm7 ;299.10
xor edx, edx ;299.10
mov edi, DWORD PTR [esp+28] ;299.10
movq xmm1, QWORD PTR [esi+edx] ;299.10
movq xmm2, QWORD PTR [edi+edx] ;299.10
punpcklbw xmm1, xmm7 ;299.10
punpcklbw xmm2, xmm7 ;299.10
psubw xmm1, xmm2 ;299.10
movq xmm3, QWORD PTR [esi+edx+8] ;299.10
movq xmm4, QWORD PTR [edi+edx+8] ;299.10
punpcklbw xmm3, xmm7 ;299.10
punpcklbw xmm4, xmm7 ;299.10
psubw xmm3, xmm4 ;299.10
packsswb xmm1, xmm3 ;299.10
movq QWORD PTR [esp+16], xmm1 ;299.10
; End ASM
; LOE ecx ebx ebp
$B4$7: ; Preds $B4$6
mov esi, DWORD PTR [esp+16] ;299.10
mov edx, DWORD PTR _inner_loop_cnt ;296.18
add DWORD PTR _accumulator, esi ;300.4
add ebx, 1 ;296.27
cmp ebx, edx ;296.18
jb $B4$5 ; Prob 82% ;296.18

levicki · ‎12-27-2008

Interesting thinking. However, if you need to go into so much analysis to fabricate a contrived example in which a new instruction may be half a cycle faster than the combination of two older ones, then in my opinion the silicon wasted for that instruction could have been put to some better use.

I do understand that there is a need for such an instruction from the instruction set orthogonality standpoint, and it is good that it was added, but the developer's or the optimization manual should have clearly stated that we should not expect it to be faster than the code it replaces so we do not waste our time implementing and testing something that isn't going to work better, at least on the current generation of hardware.

SHIH_K_Intel · ‎12-27-2008

Quoting - Igor Levicki

Interesting thinking. However, if you need to go into so much analysis to fabricate a contrived example in which a new instruction may be half a cycle faster than the combination of two older ones, then in my opinion the silicon wasted for that instruction could have been put to some better use.

I do understand that there is a need for such an instruction from the instruction set orthogonality standpoint, and it is good that it was added, but the developer's or the optimization manual should have clearly stated that we should not expect it to be faster than the code it replaces so we do not waste our time implementing and testing something that isn't going to work better, at least on the current generation of hardware.

I think you mis-interpreted the correlation between (a) my contrived test code (because I don't know the caller-consumer code that prompted the original question, but I have to make something up for this experiment), and (b) how much perf benefit a given instruction could bring forth in specific situation.

As we all know the familiar idiom, your mileage will vary, depending on numerous factors. Amdahl's law will certainly be one of them, and each workload/test set up will introduce different variables and weighting factors.

For the specific situation I tested, the baseline of a working set that fits L1 is already sustaining a retirement throughput of more than 3 micro-ops per-cycle, the dominant performance bottle bottleneck was due to certain load operation (not the MOVQ preceding the PUNPCKLBW) being blocked. So, it is actually not surprising from hindsight that replacing MOVQ+PUNPCKLBW with PMOVZXBW didn't change performance.

Even after I removed the accumulator portion of the test code (which is contrived because i don't know the functional requirement of the caller), it appears the dominant performance bottleneck is still some load operations (which may still be outside the producer function1 or function2) being blocked. Nevertheless, the impact of using PMOVZXBW becomes visible was the point, not that one could extrapolate half a cycle is all you can expect to gain in general or at most for using PMOVZXBW. It's conceivable that PMOVSX may replace slightly more instructions in some situations, so it's reasonable to expect different weighting and different impact.

As a side remark, the read-modify dependency issues between a caller consuming the result of a hand-tuned producer function can manifest in many other situations. It may be in-adequate to assume by dropping in a hand-tuned function (or macro) and perf gain of the hand-tuned producer function will naturally propagate into the larger body of code.

levicki · ‎12-27-2008

On the other hand, I believe that there is nothing to misinterpret here -- you haven't shown us a case where PMOVZX... is substantialy faster. That alone is not a problem in my opinion -- much bigger problem is that however hard you try you might not find such a case because that instruction simply isn't performing better on the current hardware.

As I already said earlier, I have tested the usefullness of PMOVZX... on January 21st, 2008 (the earliest date I could get my hands on a Penryn CPU here in Serbia) as I always do with all new and potentially interesting instructions, and in real code there was no improvement. As you just said it yourself there can't be any, because bottleneck isn't the zero-extend operation whichever way you perform it.