<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: SW &amp;amp; HW Prefetch LFB usage in Gracemont in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/SW-amp-HW-Prefetch-LFB-usage-in-Gracemont/m-p/1646934#M8481</link>
    <description>&lt;P&gt;Good point on the PREFETCHT1/PREFETCHT2 instructions. &amp;nbsp;It is possible that these might be implemented in a way that avoids holding an LFB entry, but this would have to be tested. &amp;nbsp;(Assuming that the Atom processors have the necessary events -- on the mainstream Xeon processors I use events like L1D_PEND_MISS.PENDING and related events to compute LFB occupancy.)&lt;/P&gt;</description>
    <pubDate>Mon, 02 Dec 2024 20:47:12 GMT</pubDate>
    <dc:creator>McCalpinJohn</dc:creator>
    <dc:date>2024-12-02T20:47:12Z</dc:date>
    <item>
      <title>SW &amp; HW Prefetch LFB usage in Gracemont</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/SW-amp-HW-Prefetch-LFB-usage-in-Gracemont/m-p/1642705#M8468</link>
      <description>&lt;P&gt;Hello!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a few questions regarding prefetch behavior specific to the &lt;STRONG&gt;Gracemont microarchitecture&lt;/STRONG&gt; and how different prefetch mechanisms interact with the &lt;STRONG&gt;Line Fill Buffer (LFB)&lt;/STRONG&gt;.&lt;BR /&gt;&lt;BR /&gt;I’ve been studying Intel’s whitepaper on hardware prefetch controls for Atom cores (&lt;A href="https://www.intel.com/content/www/us/en/content-details/795247/hardware-prefetch-controls-for-intel-atom-cores.html" target="_blank" rel="noopener"&gt;link&lt;/A&gt;), and I’d appreciate clarification on a few points:&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;1. &lt;/SPAN&gt;&lt;STRONG&gt;Do all flavors of software prefetch instructions pass through the LFB?&lt;/STRONG&gt; Does this include prefetches that target the L2 cache specifically?&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;2. &lt;/SPAN&gt;&lt;STRONG&gt;Do both L1 and L2 hardware prefetchers utilize the LFB?&lt;/STRONG&gt; From what I gather in the whitepaper, this seems to be the case, but I’d like confirmation.&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;3. &lt;/SPAN&gt;&lt;STRONG&gt;Are software prefetches discarded when there are no available slots in the LFB?&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks in advance for any insights or technical clarifications on these questions!&lt;/P&gt;</description>
      <pubDate>Tue, 12 Nov 2024 08:29:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/SW-amp-HW-Prefetch-LFB-usage-in-Gracemont/m-p/1642705#M8468</guid>
      <dc:creator>ShaiHulud</dc:creator>
      <dc:date>2024-11-12T08:29:14Z</dc:date>
    </item>
    <item>
      <title>Re: SW &amp; HW Prefetch LFB usage in Gracemont</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/SW-amp-HW-Prefetch-LFB-usage-in-Gracemont/m-p/1642721#M8469</link>
      <description>&lt;P&gt;I realized I should clarify the terminology here. Where I’ve referred to the &lt;STRONG&gt;Line Fill Buffer (LFB)&lt;/STRONG&gt;, it would be more accurate to use &lt;STRONG&gt;L2 Queue (L2Q)&lt;/STRONG&gt; instead. Both the whitepaper and Intel’s &lt;I&gt;Architectures Optimization and Reference Manual&lt;/I&gt; (August 2023, Figure 4-1) label it this way. Apologies for any confusion!&lt;/P&gt;</description>
      <pubDate>Tue, 12 Nov 2024 09:45:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/SW-amp-HW-Prefetch-LFB-usage-in-Gracemont/m-p/1642721#M8469</guid>
      <dc:creator>ShaiHulud</dc:creator>
      <dc:date>2024-11-12T09:45:44Z</dc:date>
    </item>
    <item>
      <title>Re: SW &amp; HW Prefetch LFB usage in Gracemont</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/SW-amp-HW-Prefetch-LFB-usage-in-Gracemont/m-p/1645624#M8477</link>
      <description>&lt;P&gt;I have not spent time with the Atom series of processors lately, but typically the LFB must handle:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;all L1 Data Cache misses for all demand accesses (read, RFO, and Uncached or WriteCombining Stores), and&lt;/LI&gt;&lt;LI&gt;all SW prefetch accesses that will return data to the L1, and&lt;/LI&gt;&lt;LI&gt;all L1 Data HW prefetch accesses&lt;UL&gt;&lt;LI&gt;In all the Intel systems I have seen the L1 HW prefetch operations all bring data into the L1 caches, so the operations must be tracked by the LFB.&lt;/LI&gt;&lt;LI&gt;This includes the core's Next-Page Prefetcher (which is often excluded from documentation of the other HW prefetchers)&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;The LFB is not directly involved in prefetches generated by the L2 HW prefetchers. &amp;nbsp;These are not directly generated by the core and do not access the L1 Data Caches of the cores sharing the L2 Cache and L2 HW Prefetchers.&lt;/P&gt;&lt;P&gt;The L2Q must handle&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;copies of the transactions coming from the LFB's of the cores, and&lt;/LI&gt;&lt;LI&gt;transactions coming from the L2 HW prefetchers that target the L2 cache (MLC_Streamer, Adaptive Multi-Path, and L2 Next Line Prefetch)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;The XQ must handle&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;copies of transactions in the L2Q that miss in the L2 cache, and&lt;/LI&gt;&lt;LI&gt;LLC_Prefetch transactions from the LLC Streamer (one of the L2 HW Prefetchers)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Transactions will occupy their queue entry for different periods of time:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;A snoop request that does not need a data return may only occupy the queue for a few cycles.&lt;/LI&gt;&lt;LI&gt;A data request that hits in the next-level cache will remain in the queue a bit longer.&lt;/LI&gt;&lt;LI&gt;Data requests that have to retrieve values from memory will remain in the queue the longest.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;To make it more confusing, transactions can change type while occupying a queue entry. &amp;nbsp;A common case is an L2Q entry allocated for an L2 Hardware Prefetch transaction that is still pending when a demand miss for the line arrives from the cores. &amp;nbsp;This changes the transaction type, which may change the way it is counted by certain hardware performance counters.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have seen conflicting answers on whether Intel processors drop SW prefetches when the LFB is full. This may indicate that the exact behavior is model-dependent.&lt;/P&gt;</description>
      <pubDate>Mon, 25 Nov 2024 18:58:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/SW-amp-HW-Prefetch-LFB-usage-in-Gracemont/m-p/1645624#M8477</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2024-11-25T18:58:24Z</dc:date>
    </item>
    <item>
      <title>Re: SW &amp; HW Prefetch LFB usage in Gracemont</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/SW-amp-HW-Prefetch-LFB-usage-in-Gracemont/m-p/1646701#M8480</link>
      <description>&lt;P&gt;Thank you, this is most helpful!&lt;BR /&gt;&lt;BR /&gt;If I may, I would like to pick your brain a bit more &lt;LI-EMOJI id="lia_slightly-smiling-face" title=":slightly_smiling_face:"&gt;&lt;/LI-EMOJI&gt;&lt;BR /&gt;&lt;BR /&gt;You mentioned that that typically the LFB must handle "all SW prefetch accesses that will return data to the L1".&lt;BR /&gt;prefetchNTA and prefetchT0 produce such prefetch accesses.&lt;BR /&gt;&lt;BR /&gt;But what about prefetchT1?&lt;BR /&gt;&lt;BR /&gt;I found the following figure in&amp;nbsp;&lt;A href="https://dl.acm.org/doi/10.1145/3662010.3663451" target="_blank"&gt;https://dl.acm.org/doi/10.1145/3662010.3663451&lt;/A&gt;&amp;nbsp;(not specific to the atom),&lt;BR /&gt;where the authors imply that prefetchT1 goest through the LFB:&lt;BR /&gt;&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="lifecycle-sw-prefs.png" style="width: 503px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/60722i1D379A02A55BEB4F/image-dimensions/503x177/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="503" height="177" role="button" title="lifecycle-sw-prefs.png" alt="lifecycle-sw-prefs.png" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 01 Dec 2024 19:19:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/SW-amp-HW-Prefetch-LFB-usage-in-Gracemont/m-p/1646701#M8480</guid>
      <dc:creator>ShaiHulud</dc:creator>
      <dc:date>2024-12-01T19:19:28Z</dc:date>
    </item>
    <item>
      <title>Re: SW &amp; HW Prefetch LFB usage in Gracemont</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/SW-amp-HW-Prefetch-LFB-usage-in-Gracemont/m-p/1646934#M8481</link>
      <description>&lt;P&gt;Good point on the PREFETCHT1/PREFETCHT2 instructions. &amp;nbsp;It is possible that these might be implemented in a way that avoids holding an LFB entry, but this would have to be tested. &amp;nbsp;(Assuming that the Atom processors have the necessary events -- on the mainstream Xeon processors I use events like L1D_PEND_MISS.PENDING and related events to compute LFB occupancy.)&lt;/P&gt;</description>
      <pubDate>Mon, 02 Dec 2024 20:47:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/SW-amp-HW-Prefetch-LFB-usage-in-Gracemont/m-p/1646934#M8481</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2024-12-02T20:47:12Z</dc:date>
    </item>
  </channel>
</rss>

