New SIMDLEA Instruction?

w0wtiger · ‎01-11-2009

I thought LEA can do 1,2,4,8 multiply,
but the alignment problem on the MMX, SSE, even AVX..
MMX used align8 usually..
SSE used align 16 always,
So 2 and 4 multiply haven't used.

So can itadd new SIMDLEA or change the ModRM encoding?

ModR/M example:
GPR | reg*1 | reg*2 | reg*4 | reg*8 |
MMX | reg*1 | reg*2 | reg*4 | reg*8 |
SSE | reg*1 | reg*16| reg*32 | reg*8 | (or AVX)

capens__nicolas · ‎01-13-2009

Quoting - w0wtiger

I thought LEA can do 1,2,4,8 multiply,
but the alignment problem on the MMX, SSE, even AVX..
MMX used align8 usually..
SSE used align 16 always,
So 2 and 4 multiply haven't used.

So can itadd new SIMDLEA or change the ModRM encoding?

ModR/M example:
GPR | reg*1 | reg*2 | reg*4 | reg*8 |
MMX | reg*1 | reg*2 | reg*4 | reg*8 |
SSE | reg*1 | reg*16| reg*32 | reg*8 | (or AVX)

Hi w0wtiger,

Nowadays a shift or even a multiply is practically as fast as a LEA instruction. Also, in most cases you can do the shift in a less critical path of the code. For instance when you have a loop iterating over an array don't multiply the index at every iteration but use the pointer directly as a counter, incrementing it by 16 (for SSE) and adjusting the exit condition to the address where the array stops.

Or did you have a very specific scenario in mind where an extended LEA instruction would be of significant benefit?

Take care,

Nicolas

Max_L · ‎01-14-2009

Hi, thanks for idea but something like that will not be done any soon.

That kind of address scaling would incur additional implementation costs and not as efficient as alternatives, as you repeat same calculation several times.

The more performance efficient way for address calculation in a loop will be:

mov addr1_reg,
mov addr2_reg,
mov size_reg,
shl size_reg, 5 ; mul by 32
xor index_reg, index_reg

loop:
vmovups ymm0, [addr1_reg + index_reg]
vmulps ymm1, ymm0, [addr2_reg + index_reg]
...

add index_reg, 32
cmp index_reg, size_reg
jl loop;

In case you have more than one of LOAD+OP kind of operations reusing same base and index registers (e.g. when you unroll), you may benefit from converting it to adding a base reg once in the end of the loop and not using index register at all, like this:

mov addr1_reg,
mov addr2_reg,
mov size_reg,
shl size_reg, 5 ; mul by 32
xor index_reg, index_reg

loop:
vmovups ymm0, [addr1_reg + index_reg]
vmovups ymm1, [addr1_reg + index_reg + 32]
vmulps ymm0, ymm0, [addr2_reg]
vmulps ymm1, ymm1, [addr2_reg + 32] ;; second LD+OP reusing same base addr in a loop, adding base reg once in the end of the loop instead index
...

add addr2_reg, 64
add index_reg, 64
cmp index_reg, size_reg
jl loop;

Same optimization technique works not just for the AVX in the future, but currently for Nehalem as well.

-Max