optimal ordering of instructions

tthsqe · ‎08-10-2009

I'm wondering why the performance of thefollowing loop is not improved by interleaving the last 6 instructions of the loop with the first 10:

mov r8,0
mov r9,0
mov r10,1
mov r11,0
mov r12,4
mov r13,8
mov rax,0
mov rcx,0x80000000 ; 2^31
loop:
mov r14,r9
mov r15,r11
xor r14,r8
xor r15,r10
popcnt r14,r14
popcnt r15,r15
and r14,1
and r15,1
lea rax,[rax+r14]
lea rax,[rax+r15]
add r8,r12
adc r9,0
add r10,r13
adc r11,0
add r12,8
add r13,8
dec rcx
jnz loop

I thoughI could keep the processor (i7 920)busierby putting some of theadds into the dependency chain, but they all resulted in slower exectution times. Can anyone find a reason for this, or possibly get it to go even faster? Are they getting executed at the same time as instructions towards the biginning of the loop. It's quite a big leap....
I am a little surprised. This was the order I put the instructions at first glance-with the intention of rearranging themlaterfor more speed. Little did I know!

If your wondering what the code does,
it is the sum of (parity(i^2) mod 2) over i
where i^2can be a128bit integer.