<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Memory Order Machine Clear Issues in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961846#M5317</link>
    <description>&lt;DIV&gt;&lt;/DIV&gt;
&lt;P&gt;SSE vectorization could produce an overlap between an aligned store and a following un-aligned load, with effects analogous to false sharing, but within a single thread. Future Intel compilers will look for this, and take action to remove the conflict from generated code:&lt;/P&gt;
&lt;P&gt;DO I = 5,N&lt;/P&gt;
&lt;P&gt; A&lt;I&gt; = ...&lt;/I&gt;&lt;/P&gt;
&lt;P&gt; B&lt;I&gt; = A[I-1] + B&lt;I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;END DO&lt;/P&gt;
&lt;P&gt;The compiler vectorizes by use of parallel SSE operations. The un-aligned loads of segments of A[] have to pick up datastored in both the previous and the currentiterations of the parallelized loop. The parallel code is slower than serial code, because the memory order hardware has to delay the load until the preceding store has gone to memory. In this simple case, that is corrected by "distributing" (splitting) the loop, letting all A[] go to memory before reading themback, without incurring the memory order clear for each load.&lt;/P&gt;
&lt;DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Tue, 11 May 2004 07:14:49 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2004-05-11T07:14:49Z</dc:date>
    <item>
      <title>Memory Order Machine Clear Issues</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961839#M5310</link>
      <description>&lt;DIV&gt;Can anyone explain how (and why) the pipeline is cleared due to memory ordering issues? Any example?&lt;BR /&gt;&lt;BR /&gt;Thanks. &lt;/DIV&gt;</description>
      <pubDate>Tue, 27 Apr 2004 10:24:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961839#M5310</guid>
      <dc:creator>jrzhou</dc:creator>
      <dc:date>2004-04-27T10:24:20Z</dc:date>
    </item>
    <item>
      <title>Re: Memory Order Machine Clear Issues</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961840#M5311</link>
      <description>&lt;P&gt;jrzhou -&lt;/P&gt;
&lt;P&gt;What architecture are you asking about? Is this with regards to Pentium, Xeon, Itanium, or something else?&lt;/P&gt;
&lt;P&gt;-- clay&lt;/P&gt;
&lt;DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 30 Apr 2004 04:59:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961840#M5311</guid>
      <dc:creator>ClayB</dc:creator>
      <dc:date>2004-04-30T04:59:57Z</dc:date>
    </item>
    <item>
      <title>Re: Memory Order Machine Clear Issues</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961841#M5312</link>
      <description>I am talking about hyper-threading Pentium 4 cpus. (I guess it also applies to hyper-threading Xeon cpus.)</description>
      <pubDate>Fri, 30 Apr 2004 21:29:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961841#M5312</guid>
      <dc:creator>jrzhou</dc:creator>
      <dc:date>2004-04-30T21:29:56Z</dc:date>
    </item>
    <item>
      <title>Re: Memory Order Machine Clear Issues</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961842#M5313</link>
      <description>&lt;P&gt;jrzhou -&lt;/P&gt;
&lt;P&gt;I've consulted with some architecture experts within Intel and they've given me several reasons that might cause excessive pipeline clearing.&lt;/P&gt;
&lt;P&gt;The first is due to &lt;EM&gt;false sharing&lt;/EM&gt;. If you've got data that is being accessed by two threads within the same cache line, you have false sharing. When one thread modifies its variable, the cache line becomes "dirty" and must be written out to each physical or logical processor that has this line in local cache. &lt;/P&gt;
&lt;P&gt;The second case is when the processor detects the chance of a memory order violation. Since such a violation would result in an incorrect program execution, the hardware needs to make sure the correct memory order is maintained. How this problem is handled is going to be specific to the hardware implementation, but every Intel processor is built to detect the possibility and takes steps to guarantee correct memory ordering. There weren't many details about this, but I'm assuming that out of order execution can lead to accessing memory in an incorrect order.&lt;/P&gt;
&lt;P&gt;The third possibility mentionedhad todo with self-modifying code, but I'm hoping that we don't need to deal with that.&lt;/P&gt;
&lt;P&gt;Are any of the above helpful?&lt;/P&gt;
&lt;P&gt;-- clay&lt;/P&gt;</description>
      <pubDate>Tue, 11 May 2004 05:24:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961842#M5313</guid>
      <dc:creator>ClayB</dc:creator>
      <dc:date>2004-05-11T05:24:23Z</dc:date>
    </item>
    <item>
      <title>Re: Memory Order Machine Clear Issues</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961843#M5314</link>
      <description>&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;The "&lt;A href="http://www.intel.com/cd/ids/developer/asmo-na/eng/technologies/threading/hyperthreading/53797.htm" target="_blank"&gt;Developing Multithreaded Applications: A Platform Consistent Approach&lt;/A&gt;" document on Intel Developer Services has a section describing false sharing and how to detect it with VTune. If you're interested, see "Avoiding and Identifying False Sharing Among Threads with the VTune Performance Analyzer" in Chapter 2.&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;Henry&lt;/DIV&gt;</description>
      <pubDate>Tue, 11 May 2004 06:26:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961843#M5314</guid>
      <dc:creator>Henry_G_Intel</dc:creator>
      <dc:date>2004-05-11T06:26:54Z</dc:date>
    </item>
    <item>
      <title>Re: Memory Order Machine Clear Issues</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961844#M5315</link>
      <description>&lt;P&gt;SSE vectorization could produce an overlap between an aligned store and a following un-aligned load, with effects analogous to false sharing, but within a single thread. Future Intel compiler will look for this, and take action to remove the conflict from generated code:&lt;/P&gt;
&lt;P&gt;FORALL(I=5:N)&lt;/P&gt;
&lt;P&gt;A[5:N] = ...&lt;/P&gt;
&lt;P&gt;B[1:N] = A[4:N] + B[1:N]&lt;/P&gt;
&lt;P&gt;The compiler vectorizes by use of parallel SSE operations. The un-aligned loads of segments of A[] have to pick up datastored in both the previous and the currentiterations of the parallelized loop.&lt;/P&gt;
&lt;DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 11 May 2004 07:09:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961844#M5315</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2004-05-11T07:09:04Z</dc:date>
    </item>
    <item>
      <title>Re: Memory Order Machine Clear Issues</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961845#M5316</link>
      <description>&lt;P&gt;SSE vectorization could produce an overlap between an aligned store and a following un-aligned load, with effects analogous to false sharing, but within a single thread. Future Intel compiler will look for this, and take action to remove the conflict from generated code:&lt;/P&gt;
&lt;P&gt;FORALL(I=5:N)&lt;/P&gt;
&lt;P&gt; A[5:N] = ...&lt;/P&gt;
&lt;P&gt;B[1:N] = A[4:N] + B[1:N]&lt;/P&gt;
&lt;P&gt;The compiler vectorizes by use of parallel SSE operations. The un-aligned loads of segments of A[] have to pick up datastored in both the previous and the currentiterations of the parallelized loop.&lt;/P&gt;
&lt;DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 11 May 2004 07:09:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961845#M5316</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2004-05-11T07:09:04Z</dc:date>
    </item>
    <item>
      <title>Re: Memory Order Machine Clear Issues</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961846#M5317</link>
      <description>&lt;DIV&gt;&lt;/DIV&gt;
&lt;P&gt;SSE vectorization could produce an overlap between an aligned store and a following un-aligned load, with effects analogous to false sharing, but within a single thread. Future Intel compilers will look for this, and take action to remove the conflict from generated code:&lt;/P&gt;
&lt;P&gt;DO I = 5,N&lt;/P&gt;
&lt;P&gt; A&lt;I&gt; = ...&lt;/I&gt;&lt;/P&gt;
&lt;P&gt; B&lt;I&gt; = A[I-1] + B&lt;I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;END DO&lt;/P&gt;
&lt;P&gt;The compiler vectorizes by use of parallel SSE operations. The un-aligned loads of segments of A[] have to pick up datastored in both the previous and the currentiterations of the parallelized loop. The parallel code is slower than serial code, because the memory order hardware has to delay the load until the preceding store has gone to memory. In this simple case, that is corrected by "distributing" (splitting) the loop, letting all A[] go to memory before reading themback, without incurring the memory order clear for each load.&lt;/P&gt;
&lt;DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 11 May 2004 07:14:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-Order-Machine-Clear-Issues/m-p/961846#M5317</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2004-05-11T07:14:49Z</dc:date>
    </item>
  </channel>
</rss>

