Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1696 Discussions

Question regarding parallel execution of instructions

houyunqing
Beginner
467 Views

The processor i'm using is Intel CoreTMDuo for Centrino, it's 1.67Ghz
it has 3 ALUs in one core, so i'm doing a test to see its ability of executing instructions in parallel

in below, the macro testmac is the set of instructions to be executed in parallel, and testmac1000 in the middle of the code is another macro containing 1000 testmac

when i execute the following code, (in WinXP) giving the process realtime priority, the time it takes to finish to loop varies between 0.203 to 0.219 second (minimum 1.13 cycle/loop)
the strange thing is, when I remove instruction 3, the time varies between 0.187 to 0.204 second (minimum 1.04 cycle/loop)
and when i remove instruction 2 as well, the time varies between 0.172 to 0.188 second (minimum 0.957 cycle/loop)

Why should there be a difference?
the core has 4 decoders, so decoding shouldn't be the factor that's resulting the extra latency right?
and it's capable of retiring up to 4 instructions per cycle, my code only requires it to retire 3 instructions per cycle, so retirement also can't be the problem right?
so what's the thing that's causing the extra latency??? is it due to some instruction/decoding caching mechanism?

testmac macro
add eax, 1 ;instruction 1
add ebx, 1 ;instruction 2
add edx, 1 ;instruction 3
endm
movecx, 100000
moveax, 0
movebx, 0
movedx, 0
align 16
@@loop:
testmac1000 ; this is just a macro container 1000 testmac
testmac1000
testmac1000
subecx, 1
jnz@@loop

0 Kudos
2 Replies
fb251
Beginner
467 Views
Quoting - houyunqing

The processor i'm using is Intel CoreTM Duo for Centrino, it's 1.67Ghz
it has 3 ALUs in one core, so i'm doing a test to see its ability of executing instructions in parallel
[...]
so what's the thing that's causing the extra latency??? is it due to some instruction/decoding caching mechanism?

testmac macro
add eax, 1 ;instruction 1
add ebx, 1 ;instruction 2
add edx, 1 ;instruction 3
endm
movecx, 100000
moveax, 0
movebx, 0
movedx, 0
align 16
@@loop:
testmac1000 ; this is just a macro container 1000 testmac
testmac1000
testmac1000
subecx, 1
jnz@@loop

May be branch prediction? Can you test the following rewrite?

align 16

@@loop:

testmac1000

testmac1000

testmac1000

dec ecx

jz @@exit

jmp @@loop

@@exit:

On AMD by default branch prediction works the following way: it assumes that the branch will not be taken, if the branch is taken, even once, then the processor will try to fetch and execute at the same time the code located at @@loop and the code located at @@exit. So I always reorder my loops to have a conditionnal jump for the exit path (not often true) and an inconditionnal jump for looping. I don't know if Intel processors use the same technique, but it's easy to test.

Best regards

0 Kudos
TimP
Honored Contributor III
467 Views

Branch prediction, of course, is an issue only for the first 2 or 3 and last time through the loop; with loop counts of 1000 and no changes in path, branch prediction should be 99.5%. Both of you might want to read up on Loop Stream Detector, which attempts to minimize the need for unrolling to maintain performance. According to the results presented here, it hasn't been fully successful. Of course, no one would optimize hardware for such a simple useless loop.

0 Kudos
Reply