Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Clarification: Sandy Bridge Load Latency


I'm confused by a passage in the Intel Architecture Optimization Manual about load latencies: L1 DCache - Loads

The common load latency is five cycles. When using a simple addressing mode, base plus offset

that is smaller than 2048, the load latency can be four cycles. 

Table 2.12

Data Type/Addressing Mode    Base + Offset > 2048;   Base + Offset < 2048 

                                                Base + Index [+ Offset]  

Integer                                                   5                                    4

MMX, SSE, 128-bit AVX                        6                                   5

X87                                                        7                                   6

256-bit AVX                                           7                                   7

I'm not sure how to interpret this.  Adding some parentheses for clarity, is the faster case ((Base + Offset) < 2048), a condition that user code is unlikely to achieve, or (Base + (Offset < 2048)), something that can often be accomodated?  

0 Kudos
1 Reply

Hello Nathan,

Sorry to take so long to reply... end/start of year deadlines, etc.

I think this section of the manual is differentiating instructions like 'mov edx,[eax+4]' and 'mov edx,[eax+4096]'. The +4 case (displacement==4) should load in 4 clocks. The +4096 case (so the displacement is 4096) should load in 5 cycles.

Hope this helps... someone... if it is not too late for you.


0 Kudos