I'm confused by a passage in the Intel Architecture Optimization Manual about load latencies:
188.8.131.52 L1 DCache - Loads
The common load latency is five cycles. When using a simple addressing mode, base plus offset
that is smaller than 2048, the load latency can be four cycles.
Data Type/Addressing Mode Base + Offset > 2048; Base + Offset < 2048
Base + Index [+ Offset]
Integer 5 4
MMX, SSE, 128-bit AVX 6 5
X87 7 6
256-bit AVX 7 7
I'm not sure how to interpret this. Adding some parentheses for clarity, is the faster case ((Base + Offset) < 2048), a condition that user code is unlikely to achieve, or (Base + (Offset < 2048)), something that can often be accomodated?
Sorry to take so long to reply... end/start of year deadlines, etc.
I think this section of the manual is differentiating instructions like 'mov edx,[eax+4]' and 'mov edx,[eax+4096]'. The +4 case (displacement==4) should load in 4 clocks. The +4096 case (so the displacement is 4096) should load in 5 cycles.
Hope this helps... someone... if it is not too late for you.