Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

## New SIMDLEA Instruction?

Beginner
449 Views

I thought LEA can do 1,2,4,8 multiply,
but the alignment problem on the MMX, SSE, even AVX..
MMX used align8 usually..
SSE used align 16 always,
So 2 and 4 multiply haven't used.

So can itadd new SIMDLEA or change the ModRM encoding?

ModR/M example:
GPR | reg*1 | reg*2 | reg*4 | reg*8 |
MMX | reg*1 | reg*2 | reg*4 | reg*8 |
SSE | reg*1 | reg*16| reg*32 | reg*8 | (or AVX)

2 Replies
New Contributor I
449 Views
Quoting - w0wtiger

I thought LEA can do 1,2,4,8 multiply,
but the alignment problem on the MMX, SSE, even AVX..
MMX used align8 usually..
SSE used align 16 always,
So 2 and 4 multiply haven't used.

So can itadd new SIMDLEA or change the ModRM encoding?

ModR/M example:
GPR | reg*1 | reg*2 | reg*4 | reg*8 |
MMX | reg*1 | reg*2 | reg*4 | reg*8 |
SSE | reg*1 | reg*16| reg*32 | reg*8 | (or AVX)

Hi w0wtiger,

Nowadays a shift or even a multiply is practically as fast as a LEA instruction. Also, in most cases you can do the shift in a less critical path of the code. For instance when you have a loop iterating over an array don't multiply the index at every iteration but use the pointer directly as a counter, incrementing it by 16 (for SSE) and adjusting the exit condition to the address where the array stops.

Or did you have a very specific scenario in mind where an extended LEA instruction would be of significant benefit?

Take care,

Nicolas
Employee
449 Views

Hi, thanks for idea but something like that will not be done any soon.

That kind of address scaling would incur additional implementation costs and not as efficient as alternatives, as you repeat same calculation several times.

The more performance efficient way for address calculation in a loop will be:

mov size_reg,
shl size_reg, 5 ; mul by 32
xor index_reg, index_reg

loop:
vmulps ymm1, ymm0, [addr2_reg + index_reg]
...

cmp index_reg, size_reg
jl loop;

In case you have more than one of LOAD+OP kind of operations reusing same base and index registers (e.g. when you unroll), you may benefit from converting it to adding a base reg once in the end of the loop and not using index register at all, like this:

mov size_reg,
shl size_reg, 5 ; mul by 32
xor index_reg, index_reg

loop: