I've been working on a vision processing system written in C++. It works great, but I wasn't pleased with the CPU usage, so I began to run analysis tools on the program to see why my program was taking up 20% of the CPU time. I found one segment of code that the compiler refused to optimize (and I understand why -- it would have to make an assumption that would be unsafe to make, though I know to be true). So I decided to write the offending function in assembly and optimize it myself with SSE (I'm talking about processing a ton of pixels, so SSE offers the capability I'm looking for).
It worked great, and brought my CPU usage from 20% down to about 4% idle, 10% active, which I consider acceptable for my application, but I'd like to go more. So I analyzed it again. I was originally concerned with the "MOVDQU" instruction that I was using because I know I could take a severe penalty for that instruction, so I expected that to show up. However, my results suggested otherwise. Here is a block of the code:
; Hits (%) PXOR xmm3, xmm3 ; xmm3 = 0 0 (0%) PXOR xmm2, xmm2 ; xmm2 = 0 4 (0.01%) MOVDQU xmm0, XMMWORD PTR[eax] ; Get next 5 pixels 1 (0%) PCMPGTB xmm3, xmm0 ; 0 > xmm0? (is val byte negative) 2752 (6.74%) PCMPGTB xmm2, xmm1 ; 0 > xmm1? (is thresh byte negative) 103 (0.25%) MOVDQA xmm4, xmm1 ; Copy thresholds 1 (0%) PCMPGTB xmm4, xmm0 ; Test it! Did we exceed threshold? 2 (0%) PXOR xmm4, xmm7 ; 437 (1.07%)Two things seem odd to me:
Two things seem odd to me: Firstly, the MOVDQU instruction doesn't have much (or any) penalty like I thought it would. Secondly, the instruction immediately after the MOVDQU instruction is taking a significant portion of time. To me, that means one of two things: Either PCMPGTB is more expensive than I thought, or the penalty that MOVDQU is incurring appears to the analysis tool to be the next instruction. I believe it is the latter because the other PCMPGTB instructions don't take nearly as much time.
I'm using a timer-based sampling system, so the program counter is sampled every 0.5ms, which can mean that these results are subject to statistical randomness, however I think that such a difference as is shown above should suggest something.
Am I right in thinking that MOVDQU is the offender? If not, what is it about that first PCMPGTB instruction that is causing such a huge penalty? (I understand you may not be able to tell me the answer to the second question, but any suggestions would be appreciated.)
Thanks for any suggestions,
-- Matthew P. Del Buono
These are just my observation and suggestions
Sampling methods tend to have an artifact known as "skidding" in terms of its ability to pinpoint the correct instruction address. So it is likely unaligned loads may constrain the performance you observe.
With the fragment that you've shown of 1 load, one reg move and 5 SIMD ALU, and relatively small dependency chain. I would expect you could verify its performance on a four-wide Intel Core microarchitecture machine with some simple measurement using performance counters, in terms of CPI, the performance counters can tell you if how much of the movdqu is facing splits or your workload needs to deal of cache locality or other issues. You can take a look at appendix B of the Software manual at http://developer.intel.com/products/processor/manuals/index.htm