The processor i'm using is Intel CoreTMDuo for Centrino, it's 1.67Ghz
it has 3 ALUs in one core, so i'm doing a test to see its ability of executing instructions in parallel
in below, the macro testmac is the set of instructions to be executed in parallel, and testmac1000 in the middle of the code is another macro containing 1000 testmac
when i execute the following code, (in WinXP) giving the process realtime priority, the time it takes to finish to loop varies between 0.203 to 0.219 second (minimum 1.13 cycle/loop)
the strange thing is, when I remove instruction 3, the time varies between 0.187 to 0.204 second (minimum 1.04 cycle/loop)
and when i remove instruction 2 as well, the time varies between 0.172 to 0.188 second (minimum 0.957 cycle/loop)
Why should there be a difference?
the core has 4 decoders, so decoding shouldn't be the factor that's resulting the extra latency right?
and it's capable of retiring up to 4 instructions per cycle, my code only requires it to retire 3 instructions per cycle, so retirement also can't be the problem right?
so what's the thing that's causing the extra latency??? is it due to some instruction/decoding caching mechanism?
add eax, 1 ;instruction 1
add ebx, 1 ;instruction 2
add edx, 1 ;instruction 3
testmac1000 ; this is just a macro container 1000 testmac
May be branch prediction? Can you test the following rewrite?
On AMD by default branch prediction works the following way: it assumes that the branch will not be taken, if the branch is taken, even once, then the processor will try to fetch and execute at the same time the code located at @@loop and the code located at @@exit. So I always reorder my loops to have a conditionnal jump for the exit path (not often true) and an inconditionnal jump for looping. I don't know if Intel processors use the same technique, but it's easy to test.
Branch prediction, of course, is an issue only for the first 2 or 3 and last time through the loop; with loop counts of 1000 and no changes in path, branch prediction should be 99.5%. Both of you might want to read up on Loop Stream Detector, which attempts to minimize the need for unrolling to maintain performance. According to the results presented here, it hasn't been fully successful. Of course, no one would optimize hardware for such a simple useless loop.