; Hits (%) PXOR xmm3, xmm3 ; xmm3 = 0 0 (0%) PXOR xmm2, xmm2 ; xmm2 = 0 4 (0.01%) MOVDQU xmm0, XMMWORD PTR[eax] ; Get next 5 pixels 1 (0%) PCMPGTB xmm3, xmm0 ; 0 > xmm0? (is val byte negative) 2752 (6.74%) PCMPGTB xmm2, xmm1 ; 0 > xmm1? (is thresh byte negative) 103 (0.25%) MOVDQA xmm4, xmm1 ; Copy thresholds 1 (0%) PCMPGTB xmm4, xmm0 ; Test it! Did we exceed threshold? 2 (0%) PXOR xmm4, xmm7 ; 437 (1.07%)Two things seem odd to me:
These are just my observation and suggestions
Sampling methods tend to have an artifact known as "skidding" in terms of its ability to pinpoint the correct instruction address. So it is likely unaligned loads may constrain the performance you observe.
With the fragment that you've shown of 1 load, one reg move and 5 SIMD ALU, and relatively small dependency chain. I would expect you could verify its performance on a four-wide Intel Core microarchitecture machine with some simple measurement using performance counters, in terms of CPI, the performance counters can tell you if how much of the movdqu is facing splits or your workload needs to deal of cache locality or other issues. You can take a look at appendix B of the Software manual at http://developer.intel.com/products/processor/manuals/index.htm