- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The processor i'm using is Intel CoreTMDuo for Centrino, it's 1.67Ghz
it has 3 ALUs in one core, so i'm doing a test to see its ability of executing instructions in parallel
in below, the macro testmac is the set of instructions to be executed in parallel, and testmac1000 in the middle of the code is another macro containing 1000 testmac
when i execute the following code, (in WinXP) giving the process realtime priority, the time it takes to finish to loop varies between 0.203 to 0.219 second (minimum 1.13 cycle/loop)
the strange thing is, when I remove instruction 3, the time varies between 0.187 to 0.204 second (minimum 1.04 cycle/loop)
and when i remove instruction 2 as well, the time varies between 0.172 to 0.188 second (minimum 0.957 cycle/loop)
Why should there be a difference?
the core has 4 decoders, so decoding shouldn't be the factor that's resulting the extra latency right?
and it's capable of retiring up to 4 instructions per cycle, my code only requires it to retire 3 instructions per cycle, so retirement also can't be the problem right?
so what's the thing that's causing the extra latency??? is it due to some instruction/decoding caching mechanism?
testmac macro
add eax, 1 ;instruction 1
add ebx, 1 ;instruction 2
add edx, 1 ;instruction 3
endm
movecx, 100000
moveax, 0
movebx, 0
movedx, 0
align 16
@@loop:
testmac1000 ; this is just a macro container 1000 testmac
testmac1000
testmac1000
subecx, 1
jnz@@loop
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The processor i'm using is Intel CoreTM Duo for Centrino, it's 1.67Ghz
it has 3 ALUs in one core, so i'm doing a test to see its ability of executing instructions in parallel
[...]
so what's the thing that's causing the extra latency??? is it due to some instruction/decoding caching mechanism?
testmac macro
add eax, 1 ;instruction 1
add ebx, 1 ;instruction 2
add edx, 1 ;instruction 3
endm
movecx, 100000
moveax, 0
movebx, 0
movedx, 0
align 16
@@loop:
testmac1000 ; this is just a macro container 1000 testmac
testmac1000
testmac1000
subecx, 1
jnz@@loop
May be branch prediction? Can you test the following rewrite?
align 16
@@loop:
testmac1000
testmac1000
testmac1000
dec ecx
jz @@exit
jmp @@loop
@@exit:
On AMD by default branch prediction works the following way: it assumes that the branch will not be taken, if the branch is taken, even once, then the processor will try to fetch and execute at the same time the code located at @@loop and the code located at @@exit. So I always reorder my loops to have a conditionnal jump for the exit path (not often true) and an inconditionnal jump for looping. I don't know if Intel processors use the same technique, but it's easy to test.
Best regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Branch prediction, of course, is an issue only for the first 2 or 3 and last time through the loop; with loop counts of 1000 and no changes in path, branch prediction should be 99.5%. Both of you might want to read up on Loop Stream Detector, which attempts to minimize the need for unrolling to maintain performance. According to the results presented here, it hasn't been fully successful. Of course, no one would optimize hardware for such a simple useless loop.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page