- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have yet to figure out the point of these instructions. I have tested them in many many cases, mostly to attempt to replace the following structure:
movq xmmreg1, memory1
movq xmmreg2, memory2
punpcklbw xmmreg1, zeroreg
punpcklbw xmmreg2, zeroreg
psubw xmmreg1, xmmreg2
with
pmovzxbw xmmreg1, memory1
pmovzxbw xmmreg2, memory2
psubw xmmreg1, xmmreg2
However, in every single case it is almost universally slower or the same speed, despite it being fewer instructions and despite mubench listing pmovzx as a 1-latency 1-throughput instruction.
What is supposed to be the use of this instruction if it is slower than movq/punpck? Or, if its not supposed to be slower than movq/punpck, why might it be slower in these cases?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have tried it on both Penryn and Nehalem and it is no better on Nehalem either.
Posting a code snippet is rather complicated, as most of the code is in extremely heavily macro'd Yasm syntax. The basic macro is as follows though:
%macro LOAD_DIFF 5
movh %1, %4
punpcklbw %1, %3
movh %2, %5
punpcklbw %2, %3
psubw %1, %2
%endmacro
This macro takes three xmm regs and two memory arguments with byte data. It unpacks the byte input from the two memory arguments, takes the difference, and stores that in the first xmmreg argument. The third xmmreg is a zero register. "movh" is movd when in MMX mode and movq when in XMM mode. The macro is a bit more complicated than that in reality--mainly to deal with the fact that a zero register isn't always available.
I have tried replacing this with pmov (only in XMM mode, of course)--it never, ever helps in any function that uses this macro. Overall speed loss tends to be about 10-15 clocks for functions which call LOAD_DIFF 32 times. The issue doesn't seem to be one of code alignment either.
If you want to look into more detail with this, this patch replaces the code above with the pmov equivalent, for the codebase available from here. The method of benchmarking is the built-in checkasm utility ("make checkasm"), which, using the --bench argument, measures the speed of all tested functions using thousands of repeated runs and RDTSC.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is this really faster than punpck? I would think it would be at best the same speed as punpcklbw, and unlike punpck it would require an actual mask from memory rather than a zero register, which can be created with a pxor.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My opinion is that pmovzxbw is probably not so different from unpacking it against zero. While apmovzxbd + cvtdq2ps can convert a dword (rgba color) into floats pretty fast, with no extra regs needed. GenerallyI try to avoid pshufb, just becausereserving one xmm reg for the mask can be expensive in tight situations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My opinion is that pmovzxbw is probably not so different from unpacking it against zero. While apmovzxbd + cvtdq2ps can convert a dword (rgba color) into floats pretty fast, with no extra regs needed. GenerallyI try to avoid pshufb, just becausereserving one xmm reg for the mask can be expensive in tight situations.
That is why I use pshufb with memory operand.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My opinion is that pmovzxbw is probably not so different from unpacking it against zero. While apmovzxbd + cvtdq2ps can convert a dword (rgba color) into floats pretty fast, with no extra regs needed. GenerallyI try to avoid pshufb, just becausereserving one xmm reg for the mask can be expensive in tight situations.
At least where I use it, I don't even need an extra register to use the unpacking method.
[cpp]%macro LOAD_DIFF 5
%ifidn %3, none
movh %1, %4
movh %2, %5
punpcklbw %1, %2
punpcklbw %2, %2
psubw %1, %2
%else
movh %1, %4
punpcklbw %1, %3
movh %2, %5
punpcklbw %2, %3
psubw %1, %2
%endif
%endmacro[/cpp]
This macro loads two sets of 8-bit values (%4, %5), unpacks them to registers (%1, %2), and subtracts to get word results: without ever using a temporary register for an unpack mask.
Slightly better performance is gained when using a third register due to better dependency chains, which is why the alternate form of this macro (second part) is available.
Both methods are considerably faster than the three-instruction version (pmov, pmov, psub).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
{
__int64 xi;
_asm{
xor eax, eax
mov esi, a
pxor xmm7, xmm7
xor edx, edx
mov edi, b
movq xmm1, [esi+edx]
movq xmm2, [edi+edx]
punpcklbw xmm1, xmm7
punpcklbw xmm2, xmm7
psubw xmm1, xmm2
movq xmm3, [esi+edx+8]
movq xmm4, [edi+edx+8]
punpcklbw xmm3, xmm7
punpcklbw xmm4, xmm7
psubw xmm3, xmm4
packsswb xmm1, xmm3
movq xi, xmm1
}
return xi;
}
/* test function2 using PMOVZXBW */
__int64 tst_mvzx1_asm(const char* a, const char* b)
{
__int64 xi;
_asm{
xor eax, eax
mov esi, a
xor edx, edx
mov edi, b
pmovzxbw xmm1, [esi+edx]
pmovzxbw xmm2, [edi+edx]
psubw xmm1, xmm2
pmovzxbw xmm3, [esi+edx+8]
pmovzxbw xmm4, [edi+edx+8]
psubw xmm3, xmm4
packsswb xmm1, xmm3
movq xi, xmm1
}
return xi;
}
---------------------------------------------------
mov edi, ebx ;297.26
and edi, 7 ;297.26
add edi, ebp ;297.14
mov DWORD PTR [esp+24], edi ;299.10
mov esi, ebx ;298.39
and esi, 29 ;298.39
lea esi, DWORD PTR [ebp+esi+64] ;298.15
mov DWORD PTR [esp+28], esi ;299.10
; LOE ecx ebx ebp
$B4$6: ; Preds $B4$5
; Begin ASM
xor eax, eax ;299.10
mov esi, DWORD PTR [esp+24] ;299.10
pxor xmm7, xmm7 ;299.10
xor edx, edx ;299.10
mov edi, DWORD PTR [esp+28] ;299.10
movq xmm1, QWORD PTR [esi+edx] ;299.10
movq xmm2, QWORD PTR [edi+edx] ;299.10
punpcklbw xmm1, xmm7 ;299.10
punpcklbw xmm2, xmm7 ;299.10
psubw xmm1, xmm2 ;299.10
movq xmm3, QWORD PTR [esi+edx+8] ;299.10
movq xmm4, QWORD PTR [edi+edx+8] ;299.10
punpcklbw xmm3, xmm7 ;299.10
punpcklbw xmm4, xmm7 ;299.10
psubw xmm3, xmm4 ;299.10
packsswb xmm1, xmm3 ;299.10
movq QWORD PTR [esp+16], xmm1 ;299.10
; End ASM
; LOE ecx ebx ebp
$B4$7: ; Preds $B4$6
mov esi, DWORD PTR [esp+16] ;299.10
mov edx, DWORD PTR _inner_loop_cnt ;296.18
add DWORD PTR _accumulator, esi ;300.4
add ebx, 1 ;296.27
cmp ebx, edx ;296.18
jb $B4$5 ; Prob 82% ;296.18
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interesting thinking. However, if you need to go into so much analysis to fabricate a contrived example in which a new instruction may be half a cycle faster than the combination of two older ones, then in my opinion the silicon wasted for that instruction could have been put to some better use.
I do understand that there is a need for such an instruction from the instruction set orthogonality standpoint, and it is good that it was added, but the developer's or the optimization manual should have clearly stated that we should not expect it to be faster than the code it replaces so we do not waste our time implementing and testing something that isn't going to work better, at least on the current generation of hardware.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interesting thinking. However, if you need to go into so much analysis to fabricate a contrived example in which a new instruction may be half a cycle faster than the combination of two older ones, then in my opinion the silicon wasted for that instruction could have been put to some better use.
I do understand that there is a need for such an instruction from the instruction set orthogonality standpoint, and it is good that it was added, but the developer's or the optimization manual should have clearly stated that we should not expect it to be faster than the code it replaces so we do not waste our time implementing and testing something that isn't going to work better, at least on the current generation of hardware.
I think you mis-interpreted the correlation between (a) my contrived test code (because I don't know the caller-consumer code that prompted the original question, but I have to make something up for this experiment), and (b) how much perf benefit a given instruction could bring forth in specific situation.
As we all know the familiar idiom, your mileage will vary, depending on numerous factors. Amdahl's law will certainly be one of them, and each workload/test set up will introduce different variables and weighting factors.
For the specific situation I tested, the baseline of a working set that fits L1 is already sustaining a retirement throughput of more than 3 micro-ops per-cycle, the dominant performance bottle bottleneck was due to certain load operation (not the MOVQ preceding the PUNPCKLBW) being blocked. So, it is actually not surprising from hindsight that replacing MOVQ+PUNPCKLBW with PMOVZXBW didn't change performance.
Even after I removed the accumulator portion of the test code (which is contrived because i don't know the functional requirement of the caller), it appears the dominant performance bottleneck is still some load operations (which may still be outside the producer function1 or function2) being blocked. Nevertheless, the impact of using PMOVZXBW becomes visible was the point, not that one could extrapolate half a cycle is all you can expect to gain in general or at most for using PMOVZXBW. It's conceivable that PMOVSX may replace slightly more instructions in some situations, so it's reasonable to expect different weighting and different impact.
As a side remark, the read-modify dependency issues between a caller consuming the result of a hand-tuned producer function can manifest in many other situations. It may be in-adequate to assume by dropping in a hand-tuned function (or macro) and perf gain of the hand-tuned producer function will naturally propagate into the larger body of code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On the other hand, I believe that there is nothing to misinterpret here -- you haven't shown us a case where PMOVZX... is substantialy faster. That alone is not a problem in my opinion -- much bigger problem is that however hard you try you might not find such a case because that instruction simply isn't performing better on the current hardware.
As I already said earlier, I have tested the usefullness of PMOVZX... on January 21st, 2008 (the earliest date I could get my hands on a Penryn CPU here in Serbia) as I always do with all new and potentially interesting instructions, and in real code there was no improvement. As you just said it yourself there can't be any, because bottleneck isn't the zero-extend operation whichever way you perform it.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page