<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic loop alignment in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859130#M7436</link>
    <description>&lt;P style="margin-bottom: 0in;"&gt;Hi,&lt;/P&gt;

&lt;P style="margin-bottom: 0in;"&gt;Inserting "leal 0(%eax),%eax"
( form of a NOP ) anywhere ( at least I tried in many places ) inside
the body of the loop&lt;/P&gt;




























&lt;P style="margin-bottom: 0in;"&gt;&lt;BR /&gt;___dcox86_wl_3_:&lt;BR /&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;movsd as1+32040(,%edx,8),%xmm7&lt;BR /&gt;addl $8,%edx&lt;BR /&gt;movsd as1+31984(,%edx,8),%xmm6&lt;BR /&gt;cmpl %ecx,%edx&lt;BR /&gt;movsd as1+31992(,%edx,8),%xmm5&lt;BR /&gt;movsd as1+32000(,%edx,8),%xmm4&lt;BR /&gt;movsd as1+32008(,%edx,8),%xmm3&lt;BR /&gt;movsd as1+32016(,%edx,8),%xmm2&lt;BR /&gt;movsd as1+32024(,%edx,8),%xmm1&lt;BR /&gt;movsd as1+32032(,%edx,8),%xmm0&lt;BR /&gt;subsd as1+31968(,%edx,8),%xmm7&lt;BR /&gt;subsd as1+31976(,%edx,8),%xmm6&lt;BR /&gt;subsd as1+31984(,%edx,8),%xmm5&lt;BR /&gt;subsd as1+31992(,%edx,8),%xmm4&lt;BR /&gt;subsd as1+32000(,%edx,8),%xmm3&lt;BR /&gt;subsd as1+32008(,%edx,8),%xmm2&lt;BR /&gt;movsd %xmm7,as1+23960(,%edx,8)&lt;BR /&gt;movsd %xmm6,as1+23968(,%edx,8)&lt;BR /&gt;subsd as1+32016(,%edx,8),%xmm1&lt;BR /&gt;movsd %xmm5,as1+23976(,%edx,8)&lt;BR /&gt;subsd as1+32024(,%edx,8),%xmm0&lt;BR /&gt;movsd %xmm4,as1+23984(,%edx,8)&lt;BR /&gt;movsd %xmm3,as1+23992(,%edx,8)&lt;BR /&gt;movsd %xmm2,as1+24000(,%edx,8)&lt;BR /&gt;movsd %xmm1,as1+24008(,%edx,8)&lt;BR /&gt;movsd %xmm0,as1+24016(,%edx,8)&lt;BR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P style="margin-bottom: 0in;"&gt;jne ___dcox86_wl_3_&lt;/P&gt;


&lt;P style="margin-bottom: 0in;"&gt;make it run faster by more than 20% (
test is done on Pentium4 ).&lt;/P&gt;&lt;P style="margin-bottom: 0in;"&gt;Any idea why?&lt;/P&gt;

&lt;P style="margin-bottom: 0in;"&gt;Nothing happens if "leal
0(%eax),%eax" inserted right before the loop.&lt;/P&gt;

&lt;P style="margin-bottom: 0in;"&gt;Thank you in advance,&lt;/P&gt;

&lt;P style="margin-bottom: 0in;"&gt;David&lt;/P&gt;

&lt;P style="margin-bottom: 0in;"&gt;------&lt;/P&gt;

&lt;PRE&gt;David Livshin

&lt;FONT size="2"&gt;&lt;A href="http://www.dalsoft.com" target="_blank"&gt;http://www.dalsoft.com&lt;/A&gt;&lt;/FONT&gt;&lt;/PRE&gt;&lt;P style="margin-bottom: 0in;"&gt;
&lt;BR /&gt;
&lt;/P&gt;</description>
    <pubDate>Fri, 01 Jun 2007 07:45:20 GMT</pubDate>
    <dc:creator>david_livshin1</dc:creator>
    <dc:date>2007-06-01T07:45:20Z</dc:date>
    <item>
      <title>loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859130#M7436</link>
      <description>&lt;P style="margin-bottom: 0in;"&gt;Hi,&lt;/P&gt;

&lt;P style="margin-bottom: 0in;"&gt;Inserting "leal 0(%eax),%eax"
( form of a NOP ) anywhere ( at least I tried in many places ) inside
the body of the loop&lt;/P&gt;




























&lt;P style="margin-bottom: 0in;"&gt;&lt;BR /&gt;___dcox86_wl_3_:&lt;BR /&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;movsd as1+32040(,%edx,8),%xmm7&lt;BR /&gt;addl $8,%edx&lt;BR /&gt;movsd as1+31984(,%edx,8),%xmm6&lt;BR /&gt;cmpl %ecx,%edx&lt;BR /&gt;movsd as1+31992(,%edx,8),%xmm5&lt;BR /&gt;movsd as1+32000(,%edx,8),%xmm4&lt;BR /&gt;movsd as1+32008(,%edx,8),%xmm3&lt;BR /&gt;movsd as1+32016(,%edx,8),%xmm2&lt;BR /&gt;movsd as1+32024(,%edx,8),%xmm1&lt;BR /&gt;movsd as1+32032(,%edx,8),%xmm0&lt;BR /&gt;subsd as1+31968(,%edx,8),%xmm7&lt;BR /&gt;subsd as1+31976(,%edx,8),%xmm6&lt;BR /&gt;subsd as1+31984(,%edx,8),%xmm5&lt;BR /&gt;subsd as1+31992(,%edx,8),%xmm4&lt;BR /&gt;subsd as1+32000(,%edx,8),%xmm3&lt;BR /&gt;subsd as1+32008(,%edx,8),%xmm2&lt;BR /&gt;movsd %xmm7,as1+23960(,%edx,8)&lt;BR /&gt;movsd %xmm6,as1+23968(,%edx,8)&lt;BR /&gt;subsd as1+32016(,%edx,8),%xmm1&lt;BR /&gt;movsd %xmm5,as1+23976(,%edx,8)&lt;BR /&gt;subsd as1+32024(,%edx,8),%xmm0&lt;BR /&gt;movsd %xmm4,as1+23984(,%edx,8)&lt;BR /&gt;movsd %xmm3,as1+23992(,%edx,8)&lt;BR /&gt;movsd %xmm2,as1+24000(,%edx,8)&lt;BR /&gt;movsd %xmm1,as1+24008(,%edx,8)&lt;BR /&gt;movsd %xmm0,as1+24016(,%edx,8)&lt;BR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P style="margin-bottom: 0in;"&gt;jne ___dcox86_wl_3_&lt;/P&gt;


&lt;P style="margin-bottom: 0in;"&gt;make it run faster by more than 20% (
test is done on Pentium4 ).&lt;/P&gt;&lt;P style="margin-bottom: 0in;"&gt;Any idea why?&lt;/P&gt;

&lt;P style="margin-bottom: 0in;"&gt;Nothing happens if "leal
0(%eax),%eax" inserted right before the loop.&lt;/P&gt;

&lt;P style="margin-bottom: 0in;"&gt;Thank you in advance,&lt;/P&gt;

&lt;P style="margin-bottom: 0in;"&gt;David&lt;/P&gt;

&lt;P style="margin-bottom: 0in;"&gt;------&lt;/P&gt;

&lt;PRE&gt;David Livshin

&lt;FONT size="2"&gt;&lt;A href="http://www.dalsoft.com" target="_blank"&gt;http://www.dalsoft.com&lt;/A&gt;&lt;/FONT&gt;&lt;/PRE&gt;&lt;P style="margin-bottom: 0in;"&gt;
&lt;BR /&gt;
&lt;/P&gt;</description>
      <pubDate>Fri, 01 Jun 2007 07:45:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859130#M7436</guid>
      <dc:creator>david_livshin1</dc:creator>
      <dc:date>2007-06-01T07:45:20Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859131#M7437</link>
      <description>Such situations may be easier to analyze if measures are taken to align the top of the loop, as your title hints. One of the more effective methods is the conditional alignment directive emitted by gnu compilers and supported by gnu ld. This pads with NOP equivalents only when 7 bytes or less of NOPs are needed.&lt;BR /&gt;You might get some insight into remaining effects by running an event collecting profiler like Intel VTune. Needless to say, it would take a real expert to enumerate all likely causes of the effect you observed. &lt;BR /&gt;A possibility which comes to mind is that some stalls occasion a retry some fixed number of cycles later. Then, inserting a shorter delay, sufficient to avoid the stall, could increase performance. So you would look with an event profiler to see if you can identify a stall which becomes less frequent with the padding.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 01 Jun 2007 14:10:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859131#M7437</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2007-06-01T14:10:04Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859132#M7438</link>
      <description>&lt;PRE&gt;___dcox86_wl_3_:&lt;BR /&gt; movapd xmm0, [rcx+rdx]&lt;BR /&gt; shufpd xmm2, xmm0, 1&lt;BR /&gt; movapd xmm1, xmm0&lt;BR /&gt; subpd xmm0, xmm2&lt;BR /&gt; movapd [rax+rdx], xmm0&lt;BR /&gt; movapd xmm0, [rcx+rdx+16]&lt;BR /&gt; shufpd xmm1, xmm0, 1&lt;BR /&gt; movapd xmm2, xmm0&lt;BR /&gt; subpd xmm0, xmm1&lt;BR /&gt; movapd [rax+rdx+16], xmm0&lt;BR /&gt; add rdx, 32&lt;BR /&gt;jns ___dcox86_wl_3_&lt;BR /&gt;&lt;/PRE&gt;</description>
      <pubDate>Sun, 03 Jun 2007 01:20:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859132#M7438</guid>
      <dc:creator>xorpd</dc:creator>
      <dc:date>2007-06-03T01:20:04Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859133#M7439</link>
      <description>Nice code ( although 64-bit and not in gnu as ) - but what about "movapd" alignment?&lt;BR /&gt;&lt;BR /&gt;If alignment is not known ( like it is in the case I posted which is gcc 4.2.0 generated &lt;BR /&gt;loop of the kernel 12 from the Livermoore loops benchmark:&lt;BR /&gt;


	
	
	
	
	
	
	
	
	





&lt;P style="margin-bottom: 0in;"&gt;   /*&lt;BR /&gt;*******************************************************************&lt;BR /&gt;*   Kernel 12 -- first difference&lt;BR /&gt;*******************************************************************&lt;BR /&gt;*/&lt;/P&gt;
&lt;P style="margin-bottom: 0in;"&gt;     
&lt;/P&gt;
&lt;P style="margin-bottom: 0in;"&gt;    parameters (12);&lt;/P&gt;






&lt;P style="margin-bottom: 0in;"&gt;    do&lt;BR /&gt;{&lt;BR /&gt; for ( k=0 ; k&lt;N&gt; {&lt;BR /&gt; x&lt;K&gt; = y[k+1] - y&lt;K&gt;;&lt;BR /&gt; }&lt;/K&gt;&lt;/K&gt;&lt;/N&gt;&lt;/P&gt;



&lt;P style="margin-bottom: 0in;"&gt; endloop (12);&lt;BR /&gt;}&lt;BR /&gt;while (count &amp;lt; loop);&lt;/P&gt;
&lt;BR /&gt;&lt;BR /&gt;) the code, when packing is attempted, is:&lt;BR /&gt;&lt;PRE&gt;
	
&lt;P style="margin-bottom: 0in;"&gt;___dcox86_wl_3_:&lt;BR /&gt;	movsd as1+32040(,%edx,8),%xmm2&lt;BR /&gt;	addl $8,%edx&lt;BR /&gt;	movhpd as1+31984(,%edx,8),%xmm2&lt;BR /&gt;	cmpl %ecx,%edx&lt;BR /&gt;	movsd as1+31968(,%edx,8),%xmm5&lt;BR /&gt;	movhpd as1+31976(,%edx,8),%xmm5&lt;BR /&gt;	movsd as1+31992(,%edx,8),%xmm3&lt;BR /&gt;	movhpd as1+32000(,%edx,8),%xmm3&lt;BR /&gt;	movsd as1+31984(,%edx,8),%xmm6&lt;BR /&gt;	movhpd as1+31992(,%edx,8),%xmm6&lt;BR /&gt;	movsd as1+32008(,%edx,8),%xmm4&lt;BR /&gt;	movhpd as1+32016(,%edx,8),%xmm4&lt;BR /&gt;	subpd %xmm5,%xmm2&lt;BR /&gt;	movsd as1+32000(,%edx,8),%xmm7&lt;BR /&gt;	movhpd as1+32008(,%edx,8),%xmm7&lt;BR /&gt;	movsd as1+32024(,%edx,8),%xmm1&lt;BR /&gt;	movsd %xmm2,as1+23960(,%edx,8)&lt;BR /&gt;	movhpd as1+32032(,%edx,8),%xmm1&lt;BR /&gt;	subpd %xmm6,%xmm3&lt;BR /&gt;	movsd as1+32016(,%edx,8),%xmm0&lt;BR /&gt;	movhpd %xmm2,as1+23968(,%edx,8)&lt;BR /&gt;	movhpd as1+32024(,%edx,8),%xmm0&lt;BR /&gt;	movsd %xmm3,as1+23976(,%edx,8)&lt;BR /&gt;	movhpd %xmm3,as1+23984(,%edx,8)&lt;BR /&gt;	subpd %xmm7,%xmm4&lt;BR /&gt;	movsd %xmm4,as1+23992(,%edx,8)&lt;BR /&gt;	movhpd %xmm4,as1+24000(,%edx,8)&lt;BR /&gt;	subpd %xmm0,%xmm1&lt;BR /&gt;	movsd %xmm1,as1+24008(,%edx,8)&lt;BR /&gt;	movhpd %xmm1,as1+24016(,%edx,8)&lt;BR /&gt;	jne ___dcox86_wl_3_&lt;/P&gt;
which is not near as efficient as one, you posted, appears to be.&lt;BR /&gt;&lt;BR /&gt;----&lt;BR /&gt;David Livshin&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://www.dalsoft.com" target="_blank"&gt;http://www.dalsoft.com&lt;/A&gt;
&lt;BR /&gt;&lt;/PRE&gt;</description>
      <pubDate>Sun, 03 Jun 2007 08:54:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859133#M7439</guid>
      <dc:creator>david_livshin1</dc:creator>
      <dc:date>2007-06-03T08:54:55Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859134#M7440</link>
      <description>Compilers normally put in a single conditional scalar loop remainder iteration (before and after) to align the destination, so that vector code such as this (icc) can be used:&lt;BR /&gt;&lt;BR /&gt;..B1.133: # Preds ..B1.133 ..B1.132&lt;BR /&gt; movaps 32040+space1_(%rdi), %xmm1&lt;BR /&gt; movsd 32032+space1_(%rdi), %xmm0&lt;BR /&gt; movhpd 32040+space1_(%rdi), %xmm0&lt;BR /&gt; movaps 32056+space1_(%rdi), %xmm3&lt;BR /&gt; movsd 32048+space1_(%rdi), %xmm2&lt;BR /&gt; movhpd 32056+space1_(%rdi), %xmm2&lt;BR /&gt; movaps 32072+space1_(%rdi), %xmm5&lt;BR /&gt; movsd 32064+space1_(%rdi), %xmm4&lt;BR /&gt; movhpd 32072+space1_(%rdi), %xmm4&lt;BR /&gt; movaps 32088+space1_(%rdi), %xmm7&lt;BR /&gt; movsd 32080+space1_(%rdi), %xmm6&lt;BR /&gt; movhpd 32088+space1_(%rdi), %xmm6&lt;BR /&gt; subpd %xmm0, %xmm1&lt;BR /&gt; movaps %xmm1, 24024+space1_(%rdi)&lt;BR /&gt; subpd %xmm2, %xmm3&lt;BR /&gt; movaps %xmm3, 24040+space1_(%rdi)&lt;BR /&gt; subpd %xmm4, %xmm5&lt;BR /&gt; movaps %xmm5, 24056+space1_(%rdi)&lt;BR /&gt; subpd %xmm6, %xmm7&lt;BR /&gt; movaps %xmm7, 24072+space1_(%rdi)&lt;BR /&gt; addq $64, %rdi&lt;BR /&gt; cmpq %rsi, %rdi&lt;BR /&gt; jl ..B1.133 # Prob 99%&lt;BR /&gt;&lt;BR /&gt;Current CPUs do work better avoiding unaligned load by half register loads. The job certainly can be done with fewer instructions.&lt;BR /&gt;</description>
      <pubDate>Mon, 04 Jun 2007 04:51:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859134#M7440</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2007-06-04T04:51:26Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859135#M7441</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;DIV&gt;Nice code ( although 64-bit and not in gnu as ) - but what about "movapd" alignment?&lt;/DIV&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Of course a second loop must be coded to take into account the possibility that source and destination may be misaligned relative to one another:&lt;/P&gt;&lt;PRE&gt;___dcox86_wl_4:&lt;BR /&gt; movapd xmm0, [rcx+rdx]&lt;BR /&gt; shufpd xmm2, xmm0, 1&lt;BR /&gt; subpd xmm2, xmm1&lt;BR /&gt; movapd xmm1, xmm0&lt;BR /&gt; movapd [rax+rdx], xmm2&lt;BR /&gt; movapd xmm2, [rcx+rdx+16]&lt;BR /&gt; shufpd xmm0, xmm2, 1&lt;BR /&gt; subpd xmm0, xmm1&lt;BR /&gt; movapd xmm1, xmm2&lt;BR /&gt; movapd [rax+rdx+16], xmm0&lt;BR /&gt; add rdx, 32&lt;BR /&gt;js ___dcox86_wl_4 ; Should have been js in original code, too.&lt;BR /&gt;&lt;/PRE&gt;
&lt;P&gt;The prolog must be capable of detecting this relative misalignment and selecting the appropriate inner loop as well as setting up registers for induction variable elimination and picking off the first destination element if the destination array is misaligned.The epilog must be ready to pick off a stray destination element as well if necessary.&lt;/P&gt;
&lt;P&gt;Multiple loops to handle the same task depending on alignment are to be expected when SIMD operations are used. In code to replace memcpy() with movapd instructions, &lt;A href="http://xorpd.home.comcast.net/memcpy.asm" target="_blank" title="http://xorpd.home.comcast.net/memcpy.asm"&gt;memcpy.asm&lt;/A&gt; I needed 16 version of the inner loop because palignr only takes an immediate shift count. &lt;A href="http://xorpd.home.comcast.net/memcpy.txt" target="_blank" title="http://xorpd.home.comcast.net/memcpy.txt"&gt;Results:&lt;/A&gt; the worst case was under 1200 clocks to copy 8000 bytes, and the easiest case (already aligned) was 670 clocks. I think I could have gotten the worst case under 1000 clocks by improving prolog and epilog code and by running the loops in descending order instead of ascending. As it stands that example compares favorably with rep movsq which takes over 1000 clocks for the most favorable alignment and over 3000 clocks for slightly unfavorable alignment and over 10000 clocks for hated alignment.&lt;/P&gt;</description>
      <pubDate>Mon, 04 Jun 2007 17:37:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859135#M7441</guid>
      <dc:creator>xorpd</dc:creator>
      <dc:date>2007-06-04T17:37:30Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859136#M7442</link>
      <description>In the case originally cited, there are 2 overlapping sources, one of which must be aligned relative to the destination. Context, not shown, but visible to the compiler, actually determines which is aligned with the destination, so 2 versions aren't needed.&lt;BR /&gt;</description>
      <pubDate>Mon, 04 Jun 2007 19:01:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859136#M7442</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2007-06-04T19:01:08Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859137#M7443</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;DIV&gt;Current CPUs do work better avoiding unaligned load by half register loads.&lt;/DIV&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I thought that I might take the time to stigmatize the above as total rubbish. In my memcpy() example cited above, I get from 640 to 790 clocks including prolog, epilog, and function call to copy 8000 bytes when source and destination have relative alignment 8 mod 16; if half loads were used the best that could have been obtained would have been 1000 clocks due to the 1000 instructions issued to port 2.&lt;/P&gt;
&lt;P&gt;In the present case as well, the icc code couldn't possibly retire 16 bytes of destination in fewer than 3 clocks because of the 3 loads required to do so, but there doesn't seem to be anything stopping a core microarchitecture processor from retiring that same 16 bytes in slightly under 2 clocks. Some improvement might also be seen on a P4, and I would like to see the original poster implement our differing opinions in his format so that he could determine which code is the faster.&lt;/P&gt;
&lt;P&gt;On at least core microarchitecture machines one is normally better off with fewer loads and instead using dedicated data-swizzling operations on the fly so as to avoid scalar operationsor data reloads to the extent possible.&lt;/P&gt;</description>
      <pubDate>Tue, 05 Jun 2007 01:32:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859137#M7443</guid>
      <dc:creator>xorpd</dc:creator>
      <dc:date>2007-06-05T01:32:56Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859138#M7444</link>
      <description>&lt;FONT color="#008000"&gt;In the case originally cited, there are 2 overlapping sources, one of
which must be aligned relative to the destination. Context, not shown,
but visible to the compiler, actually determines which is aligned with
the destination, so 2 versions aren't needed.&lt;BR /&gt;&lt;/FONT&gt;&lt;BR /&gt;The source loop&lt;BR /&gt;&lt;BLOCKQUOTE&gt;&lt;B&gt;for ( k=0 ; k&lt;N&gt;&lt;BR /&gt;&lt;B&gt; {&lt;/B&gt;&lt;BR /&gt;&lt;B&gt; x&lt;K&gt; = y[k+1] - y&lt;K&gt;;&lt;/K&gt;&lt;/K&gt;&lt;/B&gt;&lt;BR /&gt;&lt;B&gt; }



&lt;/B&gt;&lt;BR /&gt;&lt;/N&gt;&lt;/B&gt;&lt;/BLOCKQUOTE&gt;was unrolled by the compiler ( gcc 4.2.0 ) which, in order to insure the loop count ( &lt;B&gt;n&lt;/B&gt; ) to be multiple of the unroll count of 4, prefaced it with conditional code which make it impossible to determine the alignment of memory references, e.g. in&lt;BR /&gt;&lt;BLOCKQUOTE&gt;&lt;B&gt;movsd as1+32040(,%edx,8),%xmm7&lt;/B&gt;&lt;BR /&gt;&lt;/BLOCKQUOTE&gt;at the entry to the loop &lt;B&gt;%ebx&lt;/B&gt; is in the range from 0 to 3, so alignment of &lt;B&gt;as1+32040(,%edx,8)&lt;/B&gt; is not clear ( even thought alignment of &lt;B&gt;as1&lt;/B&gt; and therefore of &lt;B&gt;as1+32040 &lt;/B&gt;could be determined ).&lt;BR /&gt;&lt;BR /&gt;Also, I don't understand why shall the source &lt;FONT color="#008000"&gt;"be aligned relative to the destination"&lt;FONT color="#000000"&gt;.&lt;BR /&gt;&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;DIV align="justify"&gt;&lt;FONT color="#008000"&gt;&lt;FONT color="#000000"&gt;&lt;U&gt;Back to my original posting&lt;/U&gt;.&lt;/FONT&gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;/DIV&gt;&lt;FONT color="#008000"&gt;&lt;FONT color="#000000"&gt;I am writing x86 assembly code optimizer ( see &lt;/FONT&gt;&lt;/FONT&gt;&lt;A href="http://www.dalsoft.com"&gt;http://www.dalsoft.com&lt;/A&gt;
) and need to understand the situation I described in order to be able to find the solution and translate it to C++. Perhaps someone from &lt;SPAN style="font-size: 10pt; color: black; font-family: Arial;"&gt;&lt;FONT face="Arial" size="2"&gt;IntelSoftware NetworkSupport &lt;/FONT&gt;&lt;/SPAN&gt;may provide me with proper documentation and/or algorithm and/or heuristic that might help me to solve my problem. Are there other forums that might be appropriate to post my question?&lt;BR /&gt;&lt;BR /&gt;&lt;P style="margin-bottom: 0in;"&gt;------&lt;/P&gt;

&lt;PRE&gt;David Livshin&lt;BR /&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;&lt;A href="http://www.dalsoft.com" target="_blank"&gt;http://www.dalsoft.com&lt;/A&gt;&lt;/FONT&gt;&lt;/PRE&gt;
&lt;BR /&gt;&lt;FONT color="#008000"&gt;&lt;FONT color="#000000"&gt;&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;</description>
      <pubDate>Tue, 05 Jun 2007 06:37:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859138#M7444</guid>
      <dc:creator>david_livshin1</dc:creator>
      <dc:date>2007-06-05T06:37:01Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859139#M7445</link>
      <description>&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;David,&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;Posting your questions to this forum is fine, although you might also be interested in the Intel VTune Performance Analyzer for &lt;A href="https://community.intel.com/en-us/forums/"&gt;Windows&lt;/A&gt;* or &lt;A href="https://community.intel.com/en-us/forums/"&gt;Linux&lt;/A&gt;* forums ifyou'd like to try Tim18's suggestions, and the &lt;A href="https://community.intel.com/en-us/forums/"&gt;Intel C++ Compiler&lt;/A&gt; forummight contain some useful information for you on C++ in general.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;Of course,there are alsothe &lt;A href="http://developer.intel.com/products/processor/manuals/index.htm"&gt;Intel 64 and IA-32 Architectures Software Developer's Manuals&lt;/A&gt;, specifically Volumes 3A and 3B, System Programming Guide, and the Intel 64 and IA-32 Architectures Optimization Reference Manual.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;==&lt;/P&gt;
&lt;P class="MsoNormal" style="MARGIN: 0in 0in 0pt"&gt;&lt;SPAN style="FONT-SIZE: 10pt; COLOR: black; FONT-FAMILY: Arial"&gt;Lexi S.&lt;/SPAN&gt;&lt;SPAN style="FONT-SIZE: 11pt; COLOR: black; FONT-FAMILY: Arial"&gt;&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="MsoNormal" style="MARGIN: 0in 0in 0pt"&gt;&lt;SPAN style="FONT-SIZE: 10pt; COLOR: black; FONT-FAMILY: Arial"&gt;IntelSoftware NetworkSupport&lt;/SPAN&gt;&lt;SPAN style="FONT-SIZE: 11pt; COLOR: black; FONT-FAMILY: Arial"&gt;&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="MsoNormal" style="MARGIN: 0in 0in 0pt"&gt;&lt;SPAN style="FONT-SIZE: 10pt; COLOR: black; FONT-FAMILY: Arial"&gt;&lt;A href="http://www.intel.com/software"&gt;&lt;FONT color="#800080"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.intel.com/software" target="_blank"&gt;http://www.intel.com/software&lt;/A&gt; &lt;/SPAN&gt;&lt;SPAN style="FONT-SIZE: 11pt; COLOR: black; FONT-FAMILY: Arial"&gt;&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="MsoNormal" style="MARGIN: 0in 0in 0pt"&gt;&lt;SPAN style="FONT-SIZE: 10pt; COLOR: black; FONT-FAMILY: Arial"&gt;&lt;A href="http://www.intel.com/cd/ids/developer/asmo-na/eng/58987.htm"&gt;Contact us&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-SIZE: 11pt; COLOR: black; FONT-FAMILY: Arial"&gt;&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 05 Jun 2007 06:52:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859139#M7445</guid>
      <dc:creator>Intel_Software_Netw1</dc:creator>
      <dc:date>2007-06-05T06:52:13Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859140#M7446</link>
      <description>In order to use aligned parallel stores, and aligned parallel loads for the appropriate operand, the compiler needs to find out which operand is aligned relative to the destination. In the case in question, the arrays are defined in a struct, and the vectorizing compiler sees that y[k+1] and x&lt;K&gt; are relatively aligned. Compile time tests determine whether to execute just one loop iteration for alignment, as well as which source operand is aligned. If the alignments were not known at compile time, it would be done with run-time tests.&lt;BR /&gt;As xorpd noted, the aligned operand can be read by aligned loads; the unaligned operand, being the same except for the 8-byte offset, could be set up from the aligned one by mov and shufpd instructions. This still involves half-register operations, in my view. &lt;BR /&gt;I note that gnu compilers don't manage to vectorize this loop in my copy of Livermore Kernels, although gfortran 4.3 vectorizes a fair amount of LFK. I doubt that a preference for the scheme which gcc uses to adjust for non-vector unrolling is a reason for not vectorizing; gcc knows an appropriate method for remainder loops for vectorization, involving remainders both before and after the vectorized loop body. More likely, the problem is this loop exhibits special requirements which come up relatively rarely.&lt;BR /&gt;Hardware vendors have recognized the desirability of minimizing
performance penalty of unaligned full width loads; if that happened, gcc might be able to treat vector loops more like scalar.&lt;BR /&gt;&lt;/K&gt;</description>
      <pubDate>Tue, 05 Jun 2007 22:46:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859140#M7446</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2007-06-05T22:46:49Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859141#M7447</link>
      <description>&lt;P&gt;May I suggest this little experiment:&lt;/P&gt;
&lt;PRE&gt;
	jmp	loop_start
	align	16
loop_start:
	...
	jne	loop_start
&lt;/PRE&gt;
&lt;P&gt;If that also helps then it is the alignment. If not, then it is most likely that by including the delay you encounter less replays if you are running the above code on a NetBurst architecture CPU.&lt;/P&gt;
&lt;P&gt;Moreover, if you have more than 100 iterations this form of the loop would probably be faster:&lt;/P&gt;
&lt;PRE&gt;
	mov	edx, [count]
	jmp	loop_start
	align	16
loop_start:
	test	edx, edx
	jz	loop_exit
	...
	sub	edx, n		; n = number of elements
				; processed in one loop pass
	jmp	loop_start
loop_exit:
	...
&lt;/PRE&gt;
&lt;P&gt;Or if you want to be able to use edx as an array index:&lt;/P&gt;
&lt;PRE&gt;
	mov	ecx, [count]
	xor	edx, edx
	jmp	loop_start
	align	16
loop_start:
	cmp	edx, ecx
	jz	loop_exit
	...
	add	edx, n		; n = number of elements
				; processed in one loop pass
	jmp	loop_start
&lt;/PRE&gt;</description>
      <pubDate>Wed, 18 Jul 2007 20:24:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859141#M7447</guid>
      <dc:creator>levicki</dc:creator>
      <dc:date>2007-07-18T20:24:39Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859142#M7448</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;B&gt;&lt;FONT face="Courier New"&gt;May I suggest this little experiment:&lt;/FONT&gt;&lt;/B&gt;&lt;/P&gt;&lt;PRE&gt;&lt;B&gt;&lt;FONT face="Courier New"&gt;	jmp	loop_start&lt;BR /&gt;	align	16&lt;BR /&gt;loop_start:&lt;BR /&gt;	...&lt;BR /&gt;	jne	loop_start&lt;BR /&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;/PRE&gt;&lt;/BLOCKQUOTE&gt;

&lt;BR /&gt;No, you modification doesn't affect execution time. Actually I tried many modifications of the code before submitting my original message, and the only difference was when NOP was inserted inside the loop, interestingly, &lt;U&gt;anywhere&lt;/U&gt; inside the loop.&lt;BR /&gt;&lt;BR /&gt;The exact execution times on 2.8GHz Pentium4 under Linux:&lt;BR /&gt;&lt;BR /&gt;&lt;TABLE align="" cellpadding="1" cellspacing="1"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;Original loop&lt;BR /&gt;&lt;/TD&gt;&lt;TD&gt;5.52 seconds&lt;BR /&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;With NOP inside the loop&lt;BR /&gt;&lt;/TD&gt;&lt;TD&gt;4.4 seconds&lt;BR /&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;BR /&gt;&lt;PRE class="moz-signature"&gt;-- &lt;BR /&gt;David Livshin&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://www.dalsoft.com"&gt;http://www.dalsoft.com&lt;/A&gt;&lt;/PRE&gt;
&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Sun, 22 Jul 2007 07:40:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859142#M7448</guid>
      <dc:creator>david_livshin</dc:creator>
      <dc:date>2007-07-22T07:40:57Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859143#M7449</link>
      <description>&lt;P&gt;David,&lt;/P&gt;
&lt;P&gt;What may be happening, and this is only a guess on my part, is the P4 does speculative loads and executions. i.e. the loop as coded without the NOP is situated such that the P4 is executing the (or part of the)instruction(s) following the jne during each iteration of the loop.This may be a case where the padd from NOP inside the loop causes the look ahead (speculative load and/or execution) to work less (or with less introduced latency).&lt;/P&gt;
&lt;P&gt;What happens if you place a (or a series of) NOP following the loop jne? If this effects the run time at all, then it would be a strong indicator that the look ahead is causing the effect. You may need a few NOPs to fill the pipeline.&lt;/P&gt;
&lt;P&gt;An alternative test would be something like&lt;/P&gt;
&lt;P&gt;jne loop_start&lt;BR /&gt;jmp short further&lt;BR /&gt;NOP;NOP;NOP;NOP;NOP&lt;BR /&gt;further:&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;The idea is to make the speculative load/execution give up by seeing the jmp.&lt;/P&gt;
&lt;P&gt;Now if you find this is the case, you might also be able to determine a generalized rule to end your loops&lt;/P&gt;
&lt;P&gt;jne loop_start&lt;BR /&gt;NOP&lt;BR /&gt; align(4)&lt;/P&gt;
&lt;P&gt;jne loop_start&lt;BR /&gt; jmpover&lt;BR /&gt;NOP&lt;BR /&gt; align(16)&lt;BR /&gt;over:&lt;/P&gt;
&lt;P&gt;Or something like that.&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Aug 2007 22:42:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859143#M7449</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2007-08-30T22:42:12Z</dc:date>
    </item>
    <item>
      <title>Re: loop alignment</title>
      <link>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859144#M7450</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;FONT color="#008000"&gt;What happens if you place a (or a series of) NOP following the loop
jne? If this effects the run time at all, then it would be a strong
indicator that the look ahead is causing the effect. You may need a few
NOPs to fill the pipeline.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
This modification doesn't affect execution time.&lt;BR /&gt;&lt;BR /&gt;My current "theory" about behavior of this loop based on the observation that every iteration of the loop introduces a cache miss. Perhaps induction of an instruction that doesn't use the cache ( like the NOP I was using ) somehow relieves the queue of instructions that may not be issued due to cache miss ( if such a queue exists at all ) and which need to be issued later ( when data from the cache will become available ) and, perhaps, such a retry is expensive. Unfortunately I couldn't find enough information to verify my assumptions.&lt;BR /&gt;&lt;BR /&gt;&lt;PRE class="moz-signature"&gt;-- &lt;BR /&gt;David Livshin&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://www.dalsoft.com/"&gt;http://www.dalsoft.com&lt;/A&gt;&lt;/PRE&gt;
&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Sun, 02 Sep 2007 13:32:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/loop-alignment/m-p/859144#M7450</guid>
      <dc:creator>david_livshin</dc:creator>
      <dc:date>2007-09-02T13:32:03Z</dc:date>
    </item>
  </channel>
</rss>

