topic Re: Question regarding parallel execution of instructions in Intel® Moderncode for Parallel Architectures

Question regarding parallel execution of instructions

houyunqing — Sat, 25 Oct 2008 13:30:34 GMT

The processor i'm using is Intel Core^TMDuo for Centrino, it's 1.67Ghz
it has 3 ALUs in one core, so i'm doing a test to see its ability of executing instructions in parallel

in below, the macro testmac is the set of instructions to be executed in parallel, and testmac1000 in the middle of the code is another macro containing 1000 testmac

when i execute the following code, (in WinXP) giving the process realtime priority, the time it takes to finish to loop varies between 0.203 to 0.219 second (minimum 1.13 cycle/loop)
the strange thing is, when I remove instruction 3, the time varies between 0.187 to 0.204 second (minimum 1.04 cycle/loop)
and when i remove instruction 2 as well, the time varies between 0.172 to 0.188 second (minimum 0.957 cycle/loop)

Why should there be a difference?
the core has 4 decoders, so decoding shouldn't be the factor that's resulting the extra latency right?
and it's capable of retiring up to 4 instructions per cycle, my code only requires it to retire 3 instructions per cycle, so retirement also can't be the problem right?
so what's the thing that's causing the extra latency??? is it due to some instruction/decoding caching mechanism?

testmac macro
add eax, 1 ;instruction 1
add ebx, 1 ;instruction 2
add edx, 1 ;instruction 3
endm
movecx, 100000
moveax, 0
movebx, 0
movedx, 0
align 16
@@loop:
testmac1000 ; this is just a macro container 1000 testmac
testmac1000
testmac1000
subecx, 1
jnz@@loop

Re: Question regarding parallel execution of instructions

fb251 — Sat, 25 Oct 2008 14:51:10 GMT

Quoting - houyunqing

The processor i'm using is Intel Core^TM Duo for Centrino, it's 1.67Ghz
it has 3 ALUs in one core, so i'm doing a test to see its ability of executing instructions in parallel
[...]
so what's the thing that's causing the extra latency??? is it due to some instruction/decoding caching mechanism?

testmac macro
add eax, 1 ;instruction 1
add ebx, 1 ;instruction 2
add edx, 1 ;instruction 3
endm
movecx, 100000
moveax, 0
movebx, 0
movedx, 0
align 16
@@loop:
testmac1000 ; this is just a macro container 1000 testmac
testmac1000
testmac1000
subecx, 1
jnz@@loop

May be branch prediction? Can you test the following rewrite?

align 16

@@loop:

testmac1000

dec ecx

jz @@exit

jmp @@loop

@@exit:

On AMD by default branch prediction works the following way: it assumes that the branch will not be taken, if the branch is taken, even once, then the processor will try to fetch and execute at the same time the code located at @@loop and the code located at @@exit. So I always reorder my loops to have a conditionnal jump for the exit path (not often true) and an inconditionnal jump for looping. I don't know if Intel processors use the same technique, but it's easy to test.

Best regards

Re: Question regarding parallel execution of instructions

TimP — Sat, 25 Oct 2008 15:32:43 GMT

Branch prediction, of course, is an issue only for the first 2 or 3 and last time through the loop; with loop counts of 1000 and no changes in path, branch prediction should be 99.5%. Both of you might want to read up on Loop Stream Detector, which attempts to minimize the need for unrolling to maintain performance. According to the results presented here, it hasn't been fully successful. Of course, no one would optimize hardware for such a simple useless loop.