PTEST improvement?

Matthias_Kretz · ‎11-24-2009

Hi,

another benchmark: while I was testing compare performance, the next step is to compare branching on compares, so I wanted to show the impact of ptest in comparison to pmovmskb - cmp. But my results show that ptest is slower in almost all cases. See the first page of compare.pdf for the results. I would understand ptest and pmovmskb showing the same speed if both instructions count as being in the "integer domain", therefore both having the same 1 cycle penalty wrt. domain crossing (is this correct?).

Am I understanding correctly, that in principle ptest and pmovmskb execute equally fast and that the cmp-jump can be optimized via macro-fusion so that both vector-branching implementations really are equivalent (except for the one additional GPR that the pmovmskb version requires)? Where then could the difference come from?

(Yes, I will have to try out the simulator. I did not find the time yet to try it.)

For reference:
(float_v::operator<).isFull():

with ptest:
cmpltps %xmm1,%xmm0
ptest %xmm2,%xmm0
jae
(where xmm2 is 0xfffff...)

without ptest:
cmpltps %xmm1,%xmm0
pmovmskb %xmm0,%ecx
cmp $0xffff,%ecx
je

!(float_v::operator<).isEmpty():

with ptest:
cmpltps %xmm1,%xmm3
ptest %xmm3,%xmm3
je

without ptest:
cmpltps %xmm1,%xmm3
pmovmskb %xmm3,%ecx
test %ecx,%ecx
je

Max_L · ‎12-01-2009

Hello Matthias,

Please use MOVMSKPS instead PMOVMSKB, even though it will not change your current performance results, as both instructions are implemented very similarly in current generation of processors.

In short, you are correct in your measurements for the code provided. PTEST is decoded into 2 uops and has 3 cycle latency when receives sources from FP instruction (like CMPLTPS, in your case), also integer instruction receiving input (flags) from PTEST will also have +1 cycles added to its latency. So sequence: CMPLTPS->PTEST->Jcc is 1+3+1+1=6 cycle latency/4 uops.

MOVMSKPS, MOVMSKPD and PMOVMSKB are 1 uop and 1-cycle latency instructions when they receive sources from FP instruction, plus integer instruction receiving source from them will have +2 cycles added to its latency. So sequence CMPLTPS->MOVMSKPS->CMP/JMP(fused) is 1+1+2+1=5 cycles latency/3 uops.

Results may also be affected by how you are testing. If you just have this tiny piece of code above in the loop, you may get penalized by FrontEnd/decoding bottlenecks, which do not appear normally in the compiled code's execution sequence. So, to be more realistic in measurements, please unroll the loops (have your branches not being taken). And then, of course, latency and throughput kinds of measuring what is "faster" are different with regard to how dependency chains are formed.

Hope it explains,

-Max