If LEA instructions with three operands (base, index, offset), there is pressure of using port 1 and port 3 to cause 3 cycles latency - especially in deep loop. I don't think it makes sense to modify (inline?) assembly code directly, recommend to use Intel(r) C/C++ compiler with advanced options, such as O2, xHost, etc.
In source code level, you may review:
1. Reduce index access in loop, if possible
2. Consider data alignment
3. Reduce branch code in loop
4. No dependency between iterations of loop
5. Others I missed