<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hi Iiya &amp; Tim, in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974344#M2902</link>
    <description>&lt;P&gt;Hi Iiya &amp;amp; Tim,&lt;/P&gt;

&lt;P&gt;I haven't heard about the partial flag stalls but I tried it on icore7 and the results aren't much different other than both fn()s being executed faster, there's still a noticeable difference between execution time of fn1() and fn2()&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;A href="http://pastie.org/8694561" target="_blank"&gt;http://pastie.org/8694561&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;i have included the relevant parts - gcc seems to optimize /10 and %10 and uses mul instead of div.&lt;/P&gt;</description>
    <pubDate>Mon, 03 Feb 2014 15:59:59 GMT</pubDate>
    <dc:creator>mlf_c_</dc:creator>
    <dc:date>2014-02-03T15:59:59Z</dc:date>
    <item>
      <title>Slow code execution</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974339#M2897</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;when I try to execute the following code on my intel penryn ULV 1.4 core2duo, which consists of fn1() and fn2():&lt;/P&gt;

&lt;P&gt;&lt;A href="http://paste.org/70232" target="_blank"&gt;http://paste.org/70232&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;fn1() is visibly slower than fn2() - upon inspection of .s assembly code resulting from gcc -S I&amp;nbsp;noticed that fn1() basically loops a decl instruction ~64 times and fn2() does seem to consist of ~23 instructions including 2 mul iinstructions which need to be repeated 10 times in this example. Despite this fn1() has ~3 times slower execution. (Compilation without -O otherwise gcc applies optimizations that alter the nature of fn1())&lt;/P&gt;

&lt;P&gt;Would someone be so kind and elaborate what the cause is for fn1() slower execution?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;thanks,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;M&lt;/P&gt;</description>
      <pubDate>Sat, 01 Feb 2014 23:18:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974339#M2897</guid>
      <dc:creator>mlf_c_</dc:creator>
      <dc:date>2014-02-01T23:18:20Z</dc:date>
    </item>
    <item>
      <title>While looking at source code</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974340#M2898</link>
      <description>&lt;P&gt;While looking at source code it seems that fn2() should be slower because of modulo operation and division-assignment operation.Upon closer inspection of fn2() variable i &amp;nbsp;is not used and optimizing compiler can exclude this line of code from the compilation.First function has 64 decrement operations and backward conditional jumps.&lt;/P&gt;

&lt;P&gt;I suppose that during the looped execution of both functions inside the main() &amp;nbsp;fn2() could be further optimized by compiler when it realizes that fn2() is performing the same operation every loop cycle.&lt;/P&gt;</description>
      <pubDate>Sun, 02 Feb 2014 10:43:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974340#M2898</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-02-02T10:43:01Z</dc:date>
    </item>
    <item>
      <title>@mlf.c</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974341#M2899</link>
      <description>&lt;P&gt;@mlf.c&lt;/P&gt;

&lt;P&gt;Can you post disassembled code?&lt;/P&gt;</description>
      <pubDate>Sun, 02 Feb 2014 10:44:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974341#M2899</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-02-02T10:44:52Z</dc:date>
    </item>
    <item>
      <title>Are you trying to verify past</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974342#M2900</link>
      <description>&lt;P&gt;Are you trying to verify past research about Penryn partial flag stalls?&lt;/P&gt;

&lt;P&gt;Do you remember how Intel worked to get compilers changed to use addl -1 in place of decl, and the world refused to use special options to handle this?&lt;/P&gt;

&lt;P&gt;Are you tied to some specific combination of gcc version and -mtune options?&lt;/P&gt;</description>
      <pubDate>Sun, 02 Feb 2014 13:58:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974342#M2900</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-02-02T13:58:35Z</dc:date>
    </item>
    <item>
      <title>@Tim</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974343#M2901</link>
      <description>&lt;P&gt;@Tim&lt;/P&gt;

&lt;P&gt;Do you mean partial flag merge stalls?&lt;/P&gt;</description>
      <pubDate>Mon, 03 Feb 2014 08:11:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974343#M2901</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-02-03T08:11:06Z</dc:date>
    </item>
    <item>
      <title>Hi Iiya &amp; Tim,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974344#M2902</link>
      <description>&lt;P&gt;Hi Iiya &amp;amp; Tim,&lt;/P&gt;

&lt;P&gt;I haven't heard about the partial flag stalls but I tried it on icore7 and the results aren't much different other than both fn()s being executed faster, there's still a noticeable difference between execution time of fn1() and fn2()&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;A href="http://pastie.org/8694561" target="_blank"&gt;http://pastie.org/8694561&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;i have included the relevant parts - gcc seems to optimize /10 and %10 and uses mul instead of div.&lt;/P&gt;</description>
      <pubDate>Mon, 03 Feb 2014 15:59:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974344#M2902</guid>
      <dc:creator>mlf_c_</dc:creator>
      <dc:date>2014-02-03T15:59:59Z</dc:date>
    </item>
    <item>
      <title>Hello mlf,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974345#M2903</link>
      <description>&lt;P&gt;Hello mlf,&lt;/P&gt;

&lt;P&gt;Are these 2 sections of code important to&amp;nbsp;a real&amp;nbsp;application or are you just curious?&lt;/P&gt;

&lt;P&gt;When I run with optimizing turned on VC12, both routines get optimized away... since they don't return a value and don't change any non-local variable.&lt;/P&gt;

&lt;P&gt;Assuming this is not just idle curiosity or a homework assignment: You don't really have any timer info around the routines so it is hard to say how many instructions/clocktick are getting executed by each function.&lt;/P&gt;

&lt;P&gt;Pat&lt;/P&gt;</description>
      <pubDate>Mon, 03 Feb 2014 17:09:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974345#M2903</guid>
      <dc:creator>Patrick_F_Intel1</dc:creator>
      <dc:date>2014-02-03T17:09:55Z</dc:date>
    </item>
    <item>
      <title>Hi Pat,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974346#M2904</link>
      <description>&lt;P&gt;Hi Pat,&lt;/P&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV&gt;I have removed all parts of code that didn't seem to affect the speed of execution in order to pinpoint the problem and ended up with this simple piece of code - using clock() does show fn1() is much slower although its not very precise, but from looking at the assembly code posted above I assume movl, addl, subl, sall, shrl, cmp and jumps are still one clock instructions (haven't been coding for a while :) so there are 22 instructions + 2 mulls repeated 10 times as opposed to slower subl, cmp jns repeated 65 times.&amp;nbsp;&lt;/DIV&gt;</description>
      <pubDate>Mon, 03 Feb 2014 19:38:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974346#M2904</guid>
      <dc:creator>mlf_c_</dc:creator>
      <dc:date>2014-02-03T19:38:35Z</dc:date>
    </item>
    <item>
      <title>@mlf.c</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974347#M2905</link>
      <description>&lt;P&gt;@mlf.c&lt;/P&gt;

&lt;P&gt;Maybe presence of shrl instruction causes aferomentioned flags merge stalls?&lt;/P&gt;

&lt;P&gt;Can you run VTune analysis on your code?&lt;/P&gt;</description>
      <pubDate>Wed, 05 Feb 2014 05:40:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974347#M2905</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-02-05T05:40:51Z</dc:date>
    </item>
    <item>
      <title>Actually cmp jmp branch</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974348#M2906</link>
      <description>&lt;P&gt;Actually cmp jmp branch instruction can be executed in parallel with variable decrement instruction,although dec instruction uop must wait probably for the result of branch instruction.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 06 Feb 2014 04:53:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Slow-code-execution/m-p/974348#M2906</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-02-06T04:53:53Z</dc:date>
    </item>
  </channel>
</rss>

