L1D Latency Breakdown

Andreas_S_2 — Fri, 19 Apr 2013 12:45:55 GMT

Hi,

It is usually stated that the L1D latency is around 4 cycles. How are those 4 cycles utilized?

Is it:

1c: Calculate effective address
1c: Send request from core to cache
1c: Do cache access
1c: Send data back to core pipeline

Is there any available information on that?

Hello Andreas,

Patrick_F_Intel1 — Fri, 19 Apr 2013 14:03:40 GMT

Hello Andreas,

The 4 cycle count is based on standard latency measurements. By 'standard' I mean load-to-use, dependent-chain and the array fits in L1d. The prefetchers need to be disabled in the BIOS (if you have use a stride like 64 bytes which the prefetchers can latch on to). Turbo mode needs to be disabled (again, you can do this from the bios) to get the 4 cycle count.

So the 4 cycle count is 'load to use' so... from the time the load is issued until the data is received. The 'dependent load' means that the address of the next line to be read is contained in the current data being fetched.

In my tools, I use latency kernels based on Calibrator from Stefan Manegold http://www.cwi.nl/~manegold/

Your 4 steps correspond to the load-to-use scenario.

Pat

topic L1D Latency Breakdown in Software Tuning, Performance Optimization & Platform Monitoring

L1D Latency Breakdown

Hello Andreas,