Solved: Re: SW & HW Prefetch LFB usage in Gracemont

ShaiHulud · ‎11-12-2024

Hello!

I have a few questions regarding prefetch behavior specific to the Gracemont microarchitecture and how different prefetch mechanisms interact with the Line Fill Buffer (LFB).

I’ve been studying Intel’s whitepaper on hardware prefetch controls for Atom cores (link), and I’d appreciate clarification on a few points:

1. Do all flavors of software prefetch instructions pass through the LFB? Does this include prefetches that target the L2 cache specifically?

2. Do both L1 and L2 hardware prefetchers utilize the LFB? From what I gather in the whitepaper, this seems to be the case, but I’d like confirmation.

3. Are software prefetches discarded when there are no available slots in the LFB?

Thanks in advance for any insights or technical clarifications on these questions!

McCalpinJohn · ‎11-25-2024

I have not spent time with the Atom series of processors lately, but typically the LFB must handle:

all L1 Data Cache misses for all demand accesses (read, RFO, and Uncached or WriteCombining Stores), and
all SW prefetch accesses that will return data to the L1, and
all L1 Data HW prefetch accesses
- In all the Intel systems I have seen the L1 HW prefetch operations all bring data into the L1 caches, so the operations must be tracked by the LFB.
- This includes the core's Next-Page Prefetcher (which is often excluded from documentation of the other HW prefetchers)

The LFB is not directly involved in prefetches generated by the L2 HW prefetchers. These are not directly generated by the core and do not access the L1 Data Caches of the cores sharing the L2 Cache and L2 HW Prefetchers.

The L2Q must handle

copies of the transactions coming from the LFB's of the cores, and
transactions coming from the L2 HW prefetchers that target the L2 cache (MLC_Streamer, Adaptive Multi-Path, and L2 Next Line Prefetch)

The XQ must handle

copies of transactions in the L2Q that miss in the L2 cache, and
LLC_Prefetch transactions from the LLC Streamer (one of the L2 HW Prefetchers)

Transactions will occupy their queue entry for different periods of time:

A snoop request that does not need a data return may only occupy the queue for a few cycles.
A data request that hits in the next-level cache will remain in the queue a bit longer.
Data requests that have to retrieve values from memory will remain in the queue the longest.

To make it more confusing, transactions can change type while occupying a queue entry. A common case is an L2Q entry allocated for an L2 Hardware Prefetch transaction that is still pending when a demand miss for the line arrives from the cores. This changes the transaction type, which may change the way it is counted by certain hardware performance counters.

I have seen conflicting answers on whether Intel processors drop SW prefetches when the LFB is full. This may indicate that the exact behavior is model-dependent.

View solution in original post

ShaiHulud · ‎11-12-2024

I realized I should clarify the terminology here. Where I’ve referred to the Line Fill Buffer (LFB), it would be more accurate to use L2 Queue (L2Q) instead. Both the whitepaper and Intel’s Architectures Optimization and Reference Manual (August 2023, Figure 4-1) label it this way. Apologies for any confusion!

McCalpinJohn · ‎11-25-2024

I have not spent time with the Atom series of processors lately, but typically the LFB must handle:

all L1 Data Cache misses for all demand accesses (read, RFO, and Uncached or WriteCombining Stores), and
all SW prefetch accesses that will return data to the L1, and
all L1 Data HW prefetch accesses
- In all the Intel systems I have seen the L1 HW prefetch operations all bring data into the L1 caches, so the operations must be tracked by the LFB.
- This includes the core's Next-Page Prefetcher (which is often excluded from documentation of the other HW prefetchers)

The LFB is not directly involved in prefetches generated by the L2 HW prefetchers. These are not directly generated by the core and do not access the L1 Data Caches of the cores sharing the L2 Cache and L2 HW Prefetchers.

The L2Q must handle

copies of the transactions coming from the LFB's of the cores, and
transactions coming from the L2 HW prefetchers that target the L2 cache (MLC_Streamer, Adaptive Multi-Path, and L2 Next Line Prefetch)

The XQ must handle

copies of transactions in the L2Q that miss in the L2 cache, and
LLC_Prefetch transactions from the LLC Streamer (one of the L2 HW Prefetchers)

Transactions will occupy their queue entry for different periods of time:

A snoop request that does not need a data return may only occupy the queue for a few cycles.
A data request that hits in the next-level cache will remain in the queue a bit longer.
Data requests that have to retrieve values from memory will remain in the queue the longest.

To make it more confusing, transactions can change type while occupying a queue entry. A common case is an L2Q entry allocated for an L2 Hardware Prefetch transaction that is still pending when a demand miss for the line arrives from the cores. This changes the transaction type, which may change the way it is counted by certain hardware performance counters.

I have seen conflicting answers on whether Intel processors drop SW prefetches when the LFB is full. This may indicate that the exact behavior is model-dependent.

ShaiHulud · ‎12-01-2024

Thank you, this is most helpful!

If I may, I would like to pick your brain a bit more

You mentioned that that typically the LFB must handle "all SW prefetch accesses that will return data to the L1".
prefetchNTA and prefetchT0 produce such prefetch accesses.

But what about prefetchT1?

I found the following figure in https://dl.acm.org/doi/10.1145/3662010.3663451 (not specific to the atom),
where the authors imply that prefetchT1 goest through the LFB:

McCalpinJohn · ‎12-02-2024

Good point on the PREFETCHT1/PREFETCHT2 instructions. It is possible that these might be implemented in a way that avoids holding an LFB entry, but this would have to be tested. (Assuming that the Atom processors have the necessary events -- on the mainstream Xeon processors I use events like L1D_PEND_MISS.PENDING and related events to compute LFB occupancy.)