Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

optimal ordering of instructions

I'm wondering why the performance of thefollowing loop is not improved by interleaving the last 6 instructions of the loop with the first 10:

mov r8,0
mov r9,0
mov r10,1
mov r11,0
mov r12,4
mov r13,8
mov rax,0
mov rcx,0x80000000 ; 2^31
mov r14,r9
mov r15,r11
xor r14,r8
xor r15,r10
popcnt r14,r14
popcnt r15,r15
and r14,1
and r15,1
lea rax,[rax+r14]
lea rax,[rax+r15]
add r8,r12
adc r9,0
add r10,r13
adc r11,0
add r12,8
add r13,8
dec rcx
jnz loop

I thoughI could keep the processor (i7 920)busierby putting some of theadds into the dependency chain, but they all resulted in slower exectution times. Can anyone find a reason for this, or possibly get it to go even faster? Are they getting executed at the same time as instructions towards the biginning of the loop. It's quite a big leap....
I am a little surprised. This was the order I put the instructions at first glance-with the intention of rearranging themlaterfor more speed. Little did I know!

If your wondering what the code does,
it is the sum of (parity(i^2) mod 2) over i
where i^2can be a128bit integer.
0 Kudos
0 Replies