<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic [Sandy-bridge loop buffer]  in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-loop-buffer/m-p/774581#M242</link>
    <description>Zakaria,&lt;BR /&gt;&lt;BR /&gt;you are welcome. The Optimization Reference Guide is full of gems, but they are very easy to miss.&lt;BR /&gt;&lt;BR /&gt;Kind regards&lt;BR /&gt;Thomas</description>
    <pubDate>Fri, 16 Dec 2011 15:01:45 GMT</pubDate>
    <dc:creator>Thomas_W_Intel</dc:creator>
    <dc:date>2011-12-16T15:01:45Z</dc:date>
    <item>
      <title>[Sandy-bridge loop buffer]</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-loop-buffer/m-p/774578#M239</link>
      <description>Hello all,&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I have a question on the branch prediction in the sandy-bridge plateform.&lt;BR /&gt;&lt;BR /&gt;Well, it is known that in the latest intel architectures there is a loop buffer after the decoder which turns down the pipeline's front-end if the loop size doesn't exceed a certain bound. &lt;BR /&gt;&lt;BR /&gt;My question is what if the loop fits into the loop buffer and has a branch inside. Will the buffer be activated until the branch is encountered or is there any piping with the branch predictor ?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Zakaria</description>
      <pubDate>Fri, 16 Dec 2011 10:03:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-loop-buffer/m-p/774578#M239</guid>
      <dc:creator>zakaria-bendifallah</dc:creator>
      <dc:date>2011-12-16T10:03:45Z</dc:date>
    </item>
    <item>
      <title>[Sandy-bridge loop buffer]</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-loop-buffer/m-p/774579#M240</link>
      <description>Zakaria,&lt;BR /&gt;&lt;BR /&gt;there can be branches inside a loop that is executed by the the loop stream detector.&lt;BR /&gt;&lt;BR /&gt;The &lt;A href="http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html"&gt;Intel 64 and IA-32 Architectures Optimization Reference Manual &lt;/A&gt;lists in section 2.1.2 the necessary conditions:&lt;BR /&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;EM&gt;The loops with the following attributes qualify for LSD/micro-op queue replay:&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt; Up to eight chunk fetches of 32-instruction-bytes&lt;BR /&gt;&lt;/EM&gt;&lt;EM&gt; Up to 28 micro-ops (~28 instructions)&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt; All micro-ops are also resident in the Decoded ICache&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt; Can contain no more than eight taken branches and none of them can be a CALL or RET&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt; Cannot have mismatched stack operations. For example, more PUSH than POP instructions&lt;/EM&gt;.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;BR /&gt;Kind regards&lt;BR /&gt;Thomas</description>
      <pubDate>Fri, 16 Dec 2011 13:40:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-loop-buffer/m-p/774579#M240</guid>
      <dc:creator>Thomas_W_Intel</dc:creator>
      <dc:date>2011-12-16T13:40:22Z</dc:date>
    </item>
    <item>
      <title>[Sandy-bridge loop buffer]</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-loop-buffer/m-p/774580#M241</link>
      <description>Hi Thomas,&lt;BR /&gt;&lt;BR /&gt;Sorry i forgot to check the manual.&lt;BR /&gt;Well, up to 8 branches, this is just wonderful :)&lt;BR /&gt;&lt;BR /&gt;Thank you a lot.&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Zakaria</description>
      <pubDate>Fri, 16 Dec 2011 14:51:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-loop-buffer/m-p/774580#M241</guid>
      <dc:creator>zakaria-bendifallah</dc:creator>
      <dc:date>2011-12-16T14:51:20Z</dc:date>
    </item>
    <item>
      <title>[Sandy-bridge loop buffer]</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-loop-buffer/m-p/774581#M242</link>
      <description>Zakaria,&lt;BR /&gt;&lt;BR /&gt;you are welcome. The Optimization Reference Guide is full of gems, but they are very easy to miss.&lt;BR /&gt;&lt;BR /&gt;Kind regards&lt;BR /&gt;Thomas</description>
      <pubDate>Fri, 16 Dec 2011 15:01:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-loop-buffer/m-p/774581#M242</guid>
      <dc:creator>Thomas_W_Intel</dc:creator>
      <dc:date>2011-12-16T15:01:45Z</dc:date>
    </item>
    <item>
      <title>[Sandy-bridge loop buffer]</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-loop-buffer/m-p/774582#M243</link>
      <description>One I missed is the question of whether there is any distinction in Loop Stream detection from Nehalem to Sandy Bridge. The micro-op cache on Sandy Bridge is intended to supplement Loop Stream detection, as I understand it.&lt;BR /&gt;For gnu compilers, I have found the option -funroll-loops --param max-unroll-times=4 effective with the Loop Stream implementation from Nehalem on. Small loops may be unrolled by 4 to good effect and still hit loop stream detector or micro-op cache (even when there is an if block), eliminating a need for more aggressive unrolling.</description>
      <pubDate>Mon, 19 Dec 2011 12:21:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-loop-buffer/m-p/774582#M243</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-12-19T12:21:37Z</dc:date>
    </item>
  </channel>
</rss>

