sandy bridge cache architecture

oliver_cscs · ‎09-06-2012

Hi

I am preparing a short presentation on sandy bridge's cache architecture. My primary reference is the "Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, April 2012" and there I found the following note on the L1 DCache prefetchers (2.1.5.4 Data Prefetching):

Two hardware prefetchers load data to the L1 DCache:
• Data cache unit (DCU) prefetcher. This prefetcher, also known as the
streaming prefetcher, is triggered by an ascending access to very recently loaded
data. The processor assumes that this access is part of a streaming algorithm
and automatically fetches the next line.
• Instruction pointer (IP)-based stride prefetcher. This prefetcher keeps
track of individual load instructions. If a load instruction is detected to have a
regular stride, then a prefetch is sent to the next address which is the sum of the
current address and the stride. This prefetcher can prefetch forward or backward
and can detect strides of up to 2K bytes.

The question I have now is from where these hardware prefetchers read the data - is simply an access to L2 or can they bypass lower levels such as L2 or even LLC?

Regards,

Oliver

TimP · ‎09-07-2012

Is your question on whether the L1 prefetcher can trigger prefetch in L2 or LLC in a case where the prefetchers haven't already gone to work in LLC? I don't even know whether such a situation is realistically possible. Is it about automatically triggered hardware prefetch or about software prefetch?

oliver_cscs · ‎09-10-2012

Hello Tim a. I would like to know more precisely how the data-flow works when the hardware prefetchers are active. Could you maybe point out in what ways a cache line can get from memory to the L1D cache? Can the last level cache be bypassed somehow? b. I have read that L2 is non-inclusive/non-exclusive to L1D, does this mean that a cache line can be transported form memory to L1D cache without being written to L2? c. Concerning software prefetching: What is the effect of PREFETCHNTA when used with Sandy Bridge? Thank you! Oliver

McCalpinJohn · ‎09-25-2012

Oliver, You should probably assume that the L1 hardware prefetchers are creating transactions that are very similar to the transactions created by an ordinary L1 cache miss. The hardware does keep track the fact that they are prefetches, rather than demand misses, but under most circumstances they behave the same way. The prefetch engine will produce an address, the address will be looked up in the L1 cache, the L2 cache, the L3 cache, then (assuming all these miss) issue a global snoop and a DRAM read. No matter where the data is found, it is sent to the (private) L1, (private) L2, and (shared) L3 caches. The L1, L2, and L3 caches each select a victim line for the new entry to replace. The algorithms for choosing the victim are independent for each of the cache levels. The L3 is inclusive, so that if the L3 victim is a line that is also contained in one more more L1 or L2 caches on the chip, it will be invalidated in all of those caches. The L2 is non-inclusive, so that if the L2 victim is a line that is also contained in the corresponding L1 cache, the line in the L1 is *not* invalidated. I don't know what happens on the writeback path if the L1 victim is dirty and it no longer has a corresponding entry in the local L2. The implementation might be simpler if the L1 writeback just bypassed the L2 and goes straight to the L3 (where there is guaranteed to be a copy of the line by the inclusivity property), but I have not seen any discussion of this in the Intel literature. The other option, of course, is that the L2 choose a *second* victim line to be replaced by the dirty writeback from the L1 cache. Prefetches generated by the L2 prefetchers are completely different.