Will 1 L1 miss trigger both 1 L2 reference and 1 L3 reference at the same time?

oleotiger · ‎12-05-2021

Assuming there is a load instruction and the cache line is not stored in L1 and L2. But it can be found in L3.

Q1:

According to the document of Intel, L2 and L3 are non-inclusive. So, if there is a L1 miss, will the cache line be looking up both in L2 and L3 at the same time?

Or it just be looked up in L2, if missed , go to L3?

Q2:
If the cache line is looked up both in L2 and L3 and L2 missed but L3 hitted, will the cache line be transferred to L1 directly or through L2? Will L2 keep one copy of it in case of nect reference to avoding long latency for accessing L3?

McCalpinJohn · ‎01-03-2022

In general, the caches will be checked sequentially.

In this particular case, the L2 is private and the L3 is shared -- potentially across many cores (up to 40 in Ice Lake Xeon) -- so filtering out L2 hits is critical for keeping the L3 access rate low enough to be serviceable.

Some processors have done lookups in parallel -- AMD's Family 10h processors are an example. The L1 miss would generate a request to the L2 and a request to the L3. If the request hit in the L2, a "cancel" message would be sent to the L3. This allowed the L3 to drop the original request if it was still in a queue waiting to be processed.

In some ways, L2 hardware prefetching on Intel processors serves the same function as concurrent lookup to L2 and L3. When the L2 cache sees two or more accesses within a 4KiB (aligned) block of memory, it computes the stride between the accesses and generates one or more prefetches for addresses in the (assumed) sequence. If the processor actually performs loads on these predicted addresses, the values may be found in the L2 or L3 caches -- even if those addresses had not been in any caches when the sequence of loads started.

Intel processors have more and more dynamically adaptive behaviors, based on current load and recent history. These mechanisms can change the timing of cache transactions in ways that are difficult to predict, control, or measure. For example, Intel processors have a "HITME" cache that looks for cache lines that are being rapidly exchanged between two or more cores and (if I understand correctly) modifies the specific cache transactions used to attempt to handle these transfers more efficiently.

In some cases an implementation may have more "steps" in a cache lookup than are immediately obvious. Every cache transaction will involve a tag check to compare the requested address to the addresses in the cache. In the event of a cache hit, the cache data array must be read to obtain the data values. The tag and data array accesses can also be sequential or concurrent. At the very lowest levels there can be "partially-overlapped" implementations -- e.g., while the tag array is being checked, the data array circuits are being "warmed up". This costs a little extra energy in the case of a miss, but overlaps part of the data array access latency in case of a hit.

Implementations differ on whether newly read cache lines are installed in multiple levels of the cache --- or whether they are even allowed to be installed in multiple levels of the cache. Intel processors *typically* install lines in both the L1 and L2 caches on an L2 miss, but this is not required. For example, if a prefetch hint or an adaptive hardware prefetcher thinks that a cache line will only be used once, it makes sense to put it only in the L1 and leave the L2 free to handle more of the data that will be re-used.