Trying to optimise this code

David_K_10 · ‎01-05-2016

Hi, I've just been going through the Intel Optimisation Manual and I'm having a semi-hard time making sense of it in relation to my problem. Here's the code I'm trying to optimise as much as possible:

tool:
	cmp r15, isbasetool          ; isbasetool is an equate
	jae exTool
	jmp qword ptr [r11+r15*8]    ; jumps to up to around 100 different locations. 
                                     ; All the destination code are small pieces of code one
                                     ; after the other with "ret" instructions at the end of each.
exTool:
	sub r8, 32
	mov [r8], rax
	mov [r8+8], rbx
	mov [r8+16], rcx
	mov [r8+24], rdx
@@:
        push r15
	mov r15, [rdi+r15*8]
	call tool
	pop r15
	inc r15
	jmp @b

Now, I've done jump tables and what not but they have side effects that actually make the project too much of an issue to manage and so I ended up with this algorithm that fits perfectly except for one thing - the conditional jump.

"Tool:" executes VERY frequently and needs to be extremely efficient as I can get it. I read that putting a ud2 instruction after the indirect jump would help and aligning exTool to 16 bytes would also help. Are there any other ideas? I'm still looking for a non conditional jump version but that isn't the purpose of this thread. I'm interested in how I can optimise this piece of code so the conditional jump is in alignment with the prefetch system. Note that it's unpredictable and the indirect jump can jump to over 100 locations. It can also execute both branches just as often.

The issue is the conditional jump prediction. I don't think I can do much about the indirect jump.

Thanks for any help anyone can give.

David_K_10 · ‎01-05-2016

Hmmm. Nevermind. I got rid of the jump this way:

	tool			        equ	$
					mov r14, exTool
					cmp r15, isbasetool
					cmovb r14, [r11+r15*8]
					jmp r14

	exTool			        equ	$
					sub r8, 32
					mov [r8], rax
					mov [r8+8], rbx
					mov [r8+16], rcx
					mov [r8+24], rdx
		@@:
					push r15
					mov r15, [rdi+r15*8]
					call tool
					pop r15
					inc r15
					jmp @b

If anyone can come up with more optimisation techniques, I'd appreciate it. I need it to be fast. Would aligning all the destination addresses to 16 bytes have any effect?

SergeyKostrov · ‎02-11-2016

>>... >>...jmp qword ptr [r11+r15*8] ; jumps to up to around 100 different locations. >>... I don't consider it as a problem unless some of these jumps will generate a cache miss ( I mean L1 cache for instructions, not data ). If main binary codes, the jump table and processing pieces of codes ( all together ), fit into instruction cache then everything should work very fast. On almost all recent Intel CPUs a size of L1 cache for instructions is about 32KB per core and this is a lot for a case when some processing is implemented in assembler. It is not clear from your original post if there is any significant performance impact.