Hi, I've just been going through the Intel Optimisation Manual and I'm having a semi-hard time making sense of it in relation to my problem. Here's the code I'm trying to optimise as much as possible:
tool: cmp r15, isbasetool ; isbasetool is an equate jae exTool jmp qword ptr [r11+r15*8] ; jumps to up to around 100 different locations. ; All the destination code are small pieces of code one ; after the other with "ret" instructions at the end of each. exTool: sub r8, 32 mov [r8], rax mov [r8+8], rbx mov [r8+16], rcx mov [r8+24], rdx @@: push r15 mov r15, [rdi+r15*8] call tool pop r15 inc r15 jmp @b
Now, I've done jump tables and what not but they have side effects that actually make the project too much of an issue to manage and so I ended up with this algorithm that fits perfectly except for one thing - the conditional jump.
"Tool:" executes VERY frequently and needs to be extremely efficient as I can get it. I read that putting a ud2 instruction after the indirect jump would help and aligning exTool to 16 bytes would also help. Are there any other ideas? I'm still looking for a non conditional jump version but that isn't the purpose of this thread. I'm interested in how I can optimise this piece of code so the conditional jump is in alignment with the prefetch system. Note that it's unpredictable and the indirect jump can jump to over 100 locations. It can also execute both branches just as often.
The issue is the conditional jump prediction. I don't think I can do much about the indirect jump.
Thanks for any help anyone can give.
Hmmm. Nevermind. I got rid of the jump this way:
tool equ $ mov r14, exTool cmp r15, isbasetool cmovb r14, [r11+r15*8] jmp r14 exTool equ $ sub r8, 32 mov [r8], rax mov [r8+8], rbx mov [r8+16], rcx mov [r8+24], rdx @@: push r15 mov r15, [rdi+r15*8] call tool pop r15 inc r15 jmp @b
If anyone can come up with more optimisation techniques, I'd appreciate it. I need it to be fast. Would aligning all the destination addresses to 16 bytes have any effect?