<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Question regarding parallel execution of instructions in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-regarding-parallel-execution-of-instructions/m-p/860019#M2261</link>
    <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/406443"&gt;houyunqing&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;P&gt;&lt;SPAN style="font-family: verdana; font-size: 13px;"&gt;The processor i'm using is Intel Core&lt;SUP&gt;TM&lt;/SUP&gt; Duo for Centrino, it's 1.67Ghz&lt;BR /&gt;it has 3 ALUs in one core, so i'm doing a test to see its ability of executing instructions in parallel&lt;BR /&gt;[...]&lt;BR /&gt;so what's the thing that's causing the extra latency??? is it due to some instruction/decoding caching mechanism?&lt;BR /&gt;&lt;BR /&gt;testmac macro&lt;BR /&gt; add eax, 1  ;instruction 1&lt;BR /&gt; add ebx, 1  ;instruction 2&lt;BR /&gt; add edx, 1  ;instruction 3&lt;BR /&gt;endm&lt;BR /&gt; movecx, 100000&lt;BR /&gt; moveax, 0&lt;BR /&gt; movebx, 0&lt;BR /&gt; movedx, 0&lt;BR /&gt;align 16&lt;BR /&gt; @@loop:&lt;BR /&gt; testmac1000 ; this is just a macro container 1000 testmac&lt;BR /&gt; testmac1000&lt;BR /&gt; testmac1000&lt;BR /&gt; subecx, 1&lt;BR /&gt; jnz@@loop&lt;/SPAN&gt;&lt;/P&gt;
&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;May be branch prediction? Can you test the following rewrite?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;align 16&lt;/P&gt;
&lt;P&gt;@@loop:&lt;/P&gt;
&lt;P&gt;testmac1000&lt;/P&gt;
&lt;P&gt;testmac1000&lt;/P&gt;
&lt;P&gt;testmac1000&lt;/P&gt;
&lt;P&gt;dec ecx&lt;/P&gt;
&lt;P&gt;jz @@exit&lt;/P&gt;
&lt;P&gt;jmp @@loop&lt;/P&gt;
&lt;P&gt;@@exit:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;On AMD by default branch prediction works the following way: it assumes that the branch will not be taken, if the branch is taken, even once, then the processor will try to fetch and execute at the same time the code located at @@loop and the code located at @@exit. So I always reorder my loops to have a conditionnal jump for the exit path (not often true) and an inconditionnal jump for looping. I don't know if Intel processors use the same technique, but it's easy to test.&lt;/P&gt;
&lt;P&gt;Best regards&lt;/P&gt;</description>
    <pubDate>Sat, 25 Oct 2008 14:51:10 GMT</pubDate>
    <dc:creator>fb251</dc:creator>
    <dc:date>2008-10-25T14:51:10Z</dc:date>
    <item>
      <title>Question regarding parallel execution of instructions</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-regarding-parallel-execution-of-instructions/m-p/860018#M2260</link>
      <description>&lt;P&gt;&lt;SPAN style="font-family: verdana; font-size: 13px;"&gt;The processor i'm using is Intel Core&lt;SUP&gt;TM&lt;/SUP&gt;Duo for Centrino, it's 1.67Ghz&lt;BR /&gt;it has 3 ALUs in one core, so i'm doing a test to see its ability of executing instructions in parallel&lt;BR /&gt;&lt;BR /&gt;in below, the macro testmac is the set of instructions to be executed in parallel, and testmac1000 in the middle of the code is another macro containing 1000 testmac&lt;BR /&gt;&lt;BR /&gt;when i execute the following code, (in WinXP) giving the process realtime priority, the time it takes to finish to loop varies between 0.203 to 0.219 second (minimum 1.13 cycle/loop)&lt;BR /&gt;the strange thing is, when I remove instruction 3, the time varies between 0.187 to 0.204 second (minimum 1.04 cycle/loop)&lt;BR /&gt;and when i remove instruction 2 as well, the time varies between 0.172 to 0.188 second (minimum 0.957 cycle/loop)&lt;BR /&gt;&lt;BR /&gt;Why should there be a difference?&lt;BR /&gt;the core has 4 decoders, so decoding shouldn't be the factor that's resulting the extra latency right?&lt;BR /&gt;and it's capable of retiring up to 4 instructions per cycle, my code only requires it to retire 3 instructions per cycle, so retirement also can't be the problem right?&lt;BR /&gt;so what's the thing that's causing the extra latency??? is it due to some instruction/decoding caching mechanism?&lt;BR /&gt;&lt;BR /&gt;testmac macro&lt;BR /&gt;add eax, 1  ;instruction 1&lt;BR /&gt;add ebx, 1  ;instruction 2&lt;BR /&gt;add edx, 1  ;instruction 3&lt;BR /&gt;endm&lt;BR /&gt;movecx, 100000&lt;BR /&gt;moveax, 0&lt;BR /&gt;movebx, 0&lt;BR /&gt;movedx, 0&lt;BR /&gt;align 16&lt;BR /&gt;@@loop:&lt;BR /&gt;testmac1000 ; this is just a macro container 1000 testmac&lt;BR /&gt;testmac1000&lt;BR /&gt;testmac1000&lt;BR /&gt;subecx, 1&lt;BR /&gt;jnz@@loop&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 25 Oct 2008 13:30:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-regarding-parallel-execution-of-instructions/m-p/860018#M2260</guid>
      <dc:creator>houyunqing</dc:creator>
      <dc:date>2008-10-25T13:30:34Z</dc:date>
    </item>
    <item>
      <title>Re: Question regarding parallel execution of instructions</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-regarding-parallel-execution-of-instructions/m-p/860019#M2261</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/406443"&gt;houyunqing&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;P&gt;&lt;SPAN style="font-family: verdana; font-size: 13px;"&gt;The processor i'm using is Intel Core&lt;SUP&gt;TM&lt;/SUP&gt; Duo for Centrino, it's 1.67Ghz&lt;BR /&gt;it has 3 ALUs in one core, so i'm doing a test to see its ability of executing instructions in parallel&lt;BR /&gt;[...]&lt;BR /&gt;so what's the thing that's causing the extra latency??? is it due to some instruction/decoding caching mechanism?&lt;BR /&gt;&lt;BR /&gt;testmac macro&lt;BR /&gt; add eax, 1  ;instruction 1&lt;BR /&gt; add ebx, 1  ;instruction 2&lt;BR /&gt; add edx, 1  ;instruction 3&lt;BR /&gt;endm&lt;BR /&gt; movecx, 100000&lt;BR /&gt; moveax, 0&lt;BR /&gt; movebx, 0&lt;BR /&gt; movedx, 0&lt;BR /&gt;align 16&lt;BR /&gt; @@loop:&lt;BR /&gt; testmac1000 ; this is just a macro container 1000 testmac&lt;BR /&gt; testmac1000&lt;BR /&gt; testmac1000&lt;BR /&gt; subecx, 1&lt;BR /&gt; jnz@@loop&lt;/SPAN&gt;&lt;/P&gt;
&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;May be branch prediction? Can you test the following rewrite?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;align 16&lt;/P&gt;
&lt;P&gt;@@loop:&lt;/P&gt;
&lt;P&gt;testmac1000&lt;/P&gt;
&lt;P&gt;testmac1000&lt;/P&gt;
&lt;P&gt;testmac1000&lt;/P&gt;
&lt;P&gt;dec ecx&lt;/P&gt;
&lt;P&gt;jz @@exit&lt;/P&gt;
&lt;P&gt;jmp @@loop&lt;/P&gt;
&lt;P&gt;@@exit:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;On AMD by default branch prediction works the following way: it assumes that the branch will not be taken, if the branch is taken, even once, then the processor will try to fetch and execute at the same time the code located at @@loop and the code located at @@exit. So I always reorder my loops to have a conditionnal jump for the exit path (not often true) and an inconditionnal jump for looping. I don't know if Intel processors use the same technique, but it's easy to test.&lt;/P&gt;
&lt;P&gt;Best regards&lt;/P&gt;</description>
      <pubDate>Sat, 25 Oct 2008 14:51:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-regarding-parallel-execution-of-instructions/m-p/860019#M2261</guid>
      <dc:creator>fb251</dc:creator>
      <dc:date>2008-10-25T14:51:10Z</dc:date>
    </item>
    <item>
      <title>Re: Question regarding parallel execution of instructions</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-regarding-parallel-execution-of-instructions/m-p/860020#M2262</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;P&gt;Branch prediction, of course, is an issue only for the first 2 or 3 and last time through the loop; with loop counts of 1000 and no changes in path, branch prediction should be 99.5%. Both of you might want to read up on Loop Stream Detector, which attempts to minimize the need for unrolling to maintain performance. According to the results presented here, it hasn't been fully successful. Of course, no one would optimize hardware for such a simple useless loop.&lt;/P&gt;</description>
      <pubDate>Sat, 25 Oct 2008 15:32:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-regarding-parallel-execution-of-instructions/m-p/860020#M2262</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2008-10-25T15:32:43Z</dc:date>
    </item>
  </channel>
</rss>

