- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm wondering why the performance of thefollowing loop is not improved by interleaving the last 6 instructions of the loop with the first 10:
mov r8,0
mov r9,0
mov r10,1
mov r11,0
mov r12,4
mov r13,8
mov rax,0
mov rcx,0x80000000 ; 2^31
loop:
mov r14,r9
mov r15,r11
xor r14,r8
xor r15,r10
popcnt r14,r14
popcnt r15,r15
and r14,1
and r15,1
lea rax,[rax+r14]
lea rax,[rax+r15]
add r8,r12
adc r9,0
add r10,r13
adc r11,0
add r12,8
add r13,8
dec rcx
jnz loop
I thoughI could keep the processor (i7 920)busierby putting some of theadds into the dependency chain, but they all resulted in slower exectution times. Can anyone find a reason for this, or possibly get it to go even faster? Are they getting executed at the same time as instructions towards the biginning of the loop. It's quite a big leap....
I am a little surprised. This was the order I put the instructions at first glance-with the intention of rearranging themlaterfor more speed. Little did I know!
If your wondering what the code does,
it is the sum of (parity(i^2) mod 2) over i
where i^2can be a128bit integer.
mov r8,0
mov r9,0
mov r10,1
mov r11,0
mov r12,4
mov r13,8
mov rax,0
mov rcx,0x80000000 ; 2^31
loop:
mov r14,r9
mov r15,r11
xor r14,r8
xor r15,r10
popcnt r14,r14
popcnt r15,r15
and r14,1
and r15,1
lea rax,[rax+r14]
lea rax,[rax+r15]
add r8,r12
adc r9,0
add r10,r13
adc r11,0
add r12,8
add r13,8
dec rcx
jnz loop
I thoughI could keep the processor (i7 920)busierby putting some of theadds into the dependency chain, but they all resulted in slower exectution times. Can anyone find a reason for this, or possibly get it to go even faster? Are they getting executed at the same time as instructions towards the biginning of the loop. It's quite a big leap....
I am a little surprised. This was the order I put the instructions at first glance-with the intention of rearranging themlaterfor more speed. Little did I know!
If your wondering what the code does,
it is the sum of (parity(i^2) mod 2) over i
where i^2can be a128bit integer.
Link Copied
0 Replies

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page