<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: LOCK vs MFENCE in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898189#M4089</link>
    <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Sorry. I forgot to mention that - I did declare the synchonizing flag as a volatile. Beats me why this piece of code is still non -determisitic. To be doubly sure that the compiler itself hasn't reordered code, I made an non-optimized build - (with gcc using -O0).&lt;BR /&gt;&lt;BR /&gt;-Addy&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/347331"&gt;Dmitriy Vyukov&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;Try to declare X as volatile.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;</description>
    <pubDate>Tue, 17 Nov 2009 02:12:23 GMT</pubDate>
    <dc:creator>addyvarma</dc:creator>
    <dc:date>2009-11-17T02:12:23Z</dc:date>
    <item>
      <title>LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898179#M4079</link>
      <description>I've measured latency for 'lock cmpxchg' and 'mfence' instructions on Pentium 4 processor. I've got following results:&lt;BR /&gt;&lt;BR /&gt;lock cmpxchg - 100 cycles&lt;BR /&gt;mfence - 104 cycles&lt;BR /&gt;&lt;BR /&gt;So I conclude that they are nearly identical wrt consumed cycles.&lt;BR /&gt;&lt;BR /&gt;But is there some difference between them wrt system performance? Especially on modern multicore processors (Core 2 Duo, Core 2 Quad)?&lt;BR /&gt;&lt;BR /&gt;Is following assumption correct: Lock prefix affects bus/cache locking, so has impact on total system performance. And mfence has only local impact on current core.&lt;BR /&gt;&lt;BR /&gt;Or more practical: If I have 2 algorithms - one use lock prefix, and another use mfence. Other things being equal, what I must prefer?&lt;BR /&gt;&lt;BR /&gt;Thanks for any advance&lt;BR /&gt;Dmitriy V'jukov&lt;BR /&gt;</description>
      <pubDate>Tue, 30 Oct 2007 12:20:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898179#M4079</guid>
      <dc:creator>Dmitry_Vyukov</dc:creator>
      <dc:date>2007-10-30T12:20:54Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898180#M4080</link>
      <description>&lt;P&gt;Those two instructions do completely different things. You cannot use mfence instead of lock prefix.&lt;/P&gt;
&lt;P&gt;&lt;B&gt;Description:&lt;/B&gt;&lt;/P&gt;
&lt;P&gt;Performs a serializing operation on all load and store instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction).&lt;/P&gt;
&lt;P&gt;Weakly ordered memory types can enable higher performance through such techniques as out-of-order issue, speculative reads, write-combining, and write-collapsing. The degree to which a consumer of data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the producer of this data. The MFENCE instruction provides a performance-efficient way of ensuring ordering between routines that produce weakly-ordered results and routines that consume this data.&lt;/P&gt;
&lt;P&gt;It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types). The PREFETCHh instruction is considered a hint to this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, the MFENCE instruction is not ordered with respect to PREFETCHh or any of the speculative fetching mechanisms (that is, data could be speculative loaded into the cache just before, during, or after the execution of an MFENCE instruction).&lt;/P&gt;</description>
      <pubDate>Fri, 02 Nov 2007 23:10:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898180#M4080</guid>
      <dc:creator>levicki</dc:creator>
      <dc:date>2007-11-02T23:10:34Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898181#M4081</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;DIV&gt;&lt;IMG src="https://community.intel.com/file/6745" /&gt; &lt;STRONG&gt;IgorLevicki:&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;P&gt;Those two instructions do completely different things. You cannot use mfence instead of lock prefix.&lt;/P&gt;&lt;/DIV&gt;&lt;/BLOCKQUOTE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I know. &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;This can be learned from basic documentation. What can't be learned from basic documentation - their impact on system performance.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Dmitriy V'jukov&lt;BR /&gt;</description>
      <pubDate>Sun, 04 Nov 2007 15:41:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898181#M4081</guid>
      <dc:creator>Dmitry_Vyukov</dc:creator>
      <dc:date>2007-11-04T15:41:48Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898182#M4082</link>
      <description>&lt;P&gt;I do not know what you are trying to do but I can tell you this — most I ever needed was SFENCE instruction when I was using non-temporal stores to copy data.&lt;/P&gt;
&lt;P&gt;That said, I haven't noticed any performance degradation from SFENCE. If there was any, it was offset by faster transfer speed which came from using non-temporal stores. Bear in mind though, that those stores are meant to be used only for data which won't be immediately reused and that store buffers are scarce resource in some CPUs.&lt;/P&gt;
&lt;P&gt;Finally, whatever their impact may be, they are needed for coherence so it is out of the question whether you should use them or not if your code needs them.&lt;/P&gt;</description>
      <pubDate>Sun, 04 Nov 2007 18:24:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898182#M4082</guid>
      <dc:creator>levicki</dc:creator>
      <dc:date>2007-11-04T18:24:41Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898183#M4083</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;DIV&gt;&lt;IMG src="https://community.intel.com/file/6745" /&gt; &lt;STRONG&gt;IgorLevicki:&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;P&gt;I do not know what you are trying to do...&lt;/P&gt;&lt;/DIV&gt;&lt;/BLOCKQUOTE&gt;&lt;BR /&gt;&lt;BR /&gt;Consider for example following situation.&lt;BR /&gt;Program use sufficiently large amount of mutexes. Every particular mutex synchronize only 2 threads.&lt;BR /&gt;I can implement mutex with:&lt;BR /&gt;1. "Traditional" scheme. Based on "lock xchg" in acquire operation and "naked store" in release operation.&lt;BR /&gt;2. Peterson algorithm. Based on #StoreLoad memory barrier (mfence) in acquire operation and "naked store" in release operation.&lt;BR /&gt;&lt;BR /&gt;So net difference is - LOCK vs MFENCE.&lt;BR /&gt;&lt;BR /&gt;The question is: Will be any difference in system performance on quad core machine?&lt;BR /&gt;&lt;BR /&gt;Dmitriy V'jukov&lt;BR /&gt;</description>
      <pubDate>Mon, 05 Nov 2007 13:16:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898183#M4083</guid>
      <dc:creator>Dmitry_Vyukov</dc:creator>
      <dc:date>2007-11-05T13:16:35Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898184#M4084</link>
      <description>&lt;P class="MsoNormal"&gt;lock has a similar effect to mfence, so from that respect they should have the same performance. &lt;P&gt;&lt;/P&gt;&lt;/P&gt;
&lt;P class="MsoNormal"&gt;The Peterson algorithm that I found on the internet has 3 synchronization variables (one for the loser and two for the interested parties) that the threads share. The traditional algorithm has only one synchronization variable. Therefore the Peterson algorithm has more potential for long latency modified data sharing. &lt;/P&gt;
&lt;P class="MsoNormal"&gt;&lt;P&gt;&lt;/P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 05 Nov 2007 20:03:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898184#M4084</guid>
      <dc:creator>Anat_S_Intel</dc:creator>
      <dc:date>2007-11-05T20:03:52Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898185#M4085</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;DIV&gt;&lt;IMG src="https://community.intel.com/file/6745" /&gt; &lt;STRONG&gt;anshgm:&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;P class="MsoNormal"&gt;lock has a similar effect to mfence, so from that respect they should have the same performance.&lt;/P&gt;&lt;/DIV&gt;&lt;/BLOCKQUOTE&gt;&lt;BR /&gt;&lt;BR /&gt;SHOULD HAVE or HAVE?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BLOCKQUOTE&gt;&lt;DIV&gt;&lt;IMG src="https://community.intel.com/file/6745" /&gt; &lt;STRONG&gt;anshgm:&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;P class="MsoNormal"&gt;The Peterson
algorithm that I found on the internet has 3 synchronization variables
(one for the loser and two for the interested parties) that the threads
share. The traditional algorithm has only one synchronization variable.
Therefore the Peterson algorithm has more potential for long latency
modified data sharing. &lt;/P&gt;
&lt;/DIV&gt;&lt;/BLOCKQUOTE&gt;&lt;BR /&gt;&lt;BR /&gt;IMHO number of shared variables don't play significant role. It's number of shared cache lines what play role. And number of heavy operations (lock, mfence).&lt;BR /&gt;&lt;BR /&gt;Dmitriy V'jukov&lt;BR /&gt;</description>
      <pubDate>Tue, 06 Nov 2007 08:02:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898185#M4085</guid>
      <dc:creator>Dmitry_Vyukov</dc:creator>
      <dc:date>2007-11-06T08:02:32Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898186#M4086</link>
      <description>&lt;P class="MsoNormal"&gt;&lt;FONT face="Courier New"&gt;Here is an implementation I did a while back which makes use of &lt;EM&gt;MFENCE&lt;/EM&gt;:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c49c0658e2607317"&gt;&lt;FONT face="Courier New"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c49c0658e2607317" target="_blank"&gt;http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c49c0658e2607317&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;FONT face="Courier New"&gt;The fact that is has 3 variables doesnt mean that much because, as Dmitriy points out, they all fit within a single L2-Cacheline. The only advantage I can see is that the algorithm contains no interlocked RMW instructions, which should be easier on the FSB.&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 19 Dec 2007 23:14:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898186#M4086</guid>
      <dc:creator>Chris_M__Thomasson</dc:creator>
      <dc:date>2007-12-19T23:14:52Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898187#M4087</link>
      <description>Hi&lt;BR /&gt;&lt;BR /&gt;I was trying to implement a flavor of barrier wait to synchronize pthreads( the machine's Xeon X5, running linux 2.6 and gcc 4.1.1).&lt;BR /&gt;To ensure sequential conistency - i sprinkled my code with mfences - unfortunately i find the memory consistency being broken when i run this code. I was wondering if i've missed something?&lt;BR /&gt;&lt;BR /&gt;I'm using 3 booleans as the synchronization flags; have spread 'em across 3 cache lines to prevent false sharing. Am synchronizing 3 threads (taking 3 as a small example)&lt;BR /&gt;&lt;BR /&gt;uint8_t X[3 * cachelinesize]; (where cache line size is 64 in this case)&lt;BR /&gt;&lt;BR /&gt;my mem barrier is " __asm__ __volatile__ ("mfence" : : : "memory");"&lt;BR /&gt;(pardon the overkill - i've used mem barriers instead of lfences/sfences - just wanted to make sure this works, before i &lt;BR /&gt;optimize it)&lt;BR /&gt;&lt;SPAN style="color: #ff0000;"&gt;&lt;SPAN style="text-decoration: underline;"&gt;&lt;BR /&gt;ThreadA&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;while (1) {&lt;BR /&gt; mem barrier #1;&lt;BR /&gt; while ( X[0 * cacheline_size]  != 0) ; // i tried inserting pause too &lt;STRONG&gt;&lt;SPAN style="color: #ff0000;"&gt;------------&amp;gt; POINT A&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;BR /&gt; mem barrier #2;&lt;BR /&gt; do some work;&lt;BR /&gt; &lt;BR /&gt; mem barrrier #3;&lt;BR /&gt; X[0 *cacheline_size] = 1;&lt;BR /&gt; mem barrier # 4;&lt;BR /&gt;&lt;BR /&gt; while (X[0 * cacheline size] != 2) ; // spin &lt;BR /&gt;&lt;BR /&gt; mem barrier #5;&lt;BR /&gt; do something ;&lt;BR /&gt;&lt;BR /&gt; mem barrier #6;&lt;BR /&gt; X[0 * cacheline size] = 3;&lt;STRONG&gt;&lt;SPAN style="color: #ff0000;"&gt;------------&amp;gt; POINT C&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;BR /&gt; mem barrier # 7;&lt;BR /&gt;}&lt;BR /&gt;............ &lt;BR /&gt; (similar code in the other threads)&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN style="color: #ff0000;"&gt;Main synchronizing thread&lt;/SPAN&gt;&lt;BR /&gt;------------------------------------&lt;BR /&gt;while (1) {&lt;BR /&gt; mem barrier # 8;&lt;BR /&gt; while( all X[ 0...n] not 1) ; / =&amp;gt; spin until (X[ 0 * cacheline size] == 1) || (X[1 * cacheline size == 1]) etc&lt;BR /&gt; mem barrier # 9;&lt;BR /&gt;&lt;BR /&gt; X[0...n]   = 2; // tried doing this in a CAS too - didn't help;&lt;BR /&gt; mem barrier # 10;&lt;BR /&gt;&lt;BR /&gt; while(X[0...n] != 3) ; // spin &lt;STRONG&gt;&lt;SPAN style="color: #ff0000;"&gt;------------&amp;gt; POINT B&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;BR /&gt; mem barrier # 11&lt;BR /&gt; &lt;BR /&gt; do somthing&lt;BR /&gt; mem barrier #12&lt;BR /&gt; &lt;BR /&gt; X[0...n ] = 0;&lt;BR /&gt; mem barrier;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;Problem/Issue&lt;BR /&gt;&lt;BR /&gt;I get into a deadlock at times - thread A is waiting on (X[0] ! = 0)( point A) and the main thread is &lt;BR /&gt;waiting at point B with X[0] = 2- there's no way that could have happened unless thread A bypassed point C.&lt;BR /&gt;&lt;BR /&gt;Could someone tell me if i've screwed up someplace?&lt;BR /&gt;&lt;BR /&gt;Thanks&lt;BR /&gt;-Addy&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt; &lt;BR /&gt;&lt;BR /&gt; &lt;BR /&gt;&lt;BR /&gt; &lt;BR /&gt;&lt;BR /&gt; &lt;BR /&gt; &lt;BR /&gt; &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 16 Nov 2009 13:27:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898187#M4087</guid>
      <dc:creator>addyvarma</dc:creator>
      <dc:date>2009-11-16T13:27:48Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898188#M4088</link>
      <description>Try to declare X as volatile.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 16 Nov 2009 17:12:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898188#M4088</guid>
      <dc:creator>Dmitry_Vyukov</dc:creator>
      <dc:date>2009-11-16T17:12:39Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898189#M4089</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Sorry. I forgot to mention that - I did declare the synchonizing flag as a volatile. Beats me why this piece of code is still non -determisitic. To be doubly sure that the compiler itself hasn't reordered code, I made an non-optimized build - (with gcc using -O0).&lt;BR /&gt;&lt;BR /&gt;-Addy&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/347331"&gt;Dmitriy Vyukov&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;Try to declare X as volatile.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;</description>
      <pubDate>Tue, 17 Nov 2009 02:12:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898189#M4089</guid>
      <dc:creator>addyvarma</dc:creator>
      <dc:date>2009-11-17T02:12:23Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898190#M4090</link>
      <description>Please post full code into new forum thread.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 17 Nov 2009 18:47:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898190#M4090</guid>
      <dc:creator>Dmitry_Vyukov</dc:creator>
      <dc:date>2009-11-17T18:47:14Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898191#M4091</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/452347"&gt;addyvarma&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;my mem barrier is " __asm__ __volatile__ ("mfence" : : : "memory");"&lt;BR /&gt;(pardon the overkill - i've used mem barriers instead of lfences/sfences - just wanted to make sure this works, before i &lt;BR /&gt;optimize it)&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;You probably mis-understand semantics of x86 fences.&lt;BR /&gt;SFENCE is of any use ONLY if you use non-temporal store instructions (e.g. MOVNTI).&lt;BR /&gt;And LFENCE is completely useless, it's basically a no-op.&lt;BR /&gt;&lt;BR /&gt;MFENCE of any use ONLY is you are trying to order critical store-load sequence. As far as I see, there is no critical store-load sequences in your example, so you need no hardware fences on x86 at all. Just declare variables as volatile so that compiler preserve program order.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 17 Nov 2009 19:03:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898191#M4091</guid>
      <dc:creator>Dmitry_Vyukov</dc:creator>
      <dc:date>2009-11-17T19:03:45Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898192#M4092</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Sure, I'll put it in as another post.&lt;BR /&gt;&lt;BR /&gt;But here's one question, before I move it to another post.&lt;BR /&gt;&lt;BR /&gt;I did use just volatiles before I used the fences - and that because just using volatiles didn't help, I had to resort to fences&lt;BR /&gt;(Adve does talk of the sequential consistency issues for busy wait sync.).&lt;BR /&gt;&lt;BR /&gt;So I guess we are missing something really basic here.&lt;BR /&gt;&lt;BR /&gt;Thanks.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/347331"&gt;Dmitriy Vyukov&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/452347"&gt;addyvarma&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;my mem barrier is  " __asm__ __volatile__ ("mfence" : : : "memory");"&lt;BR /&gt;(pardon the overkill - i've used mem barriers instead of lfences/sfences - just wanted to make sure this works, before i &lt;BR /&gt;optimize it)&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;You probably mis-understand semantics of x86 fences.&lt;BR /&gt;SFENCE is of any use ONLY if you use non-temporal store instructions (e.g. MOVNTI).&lt;BR /&gt;And LFENCE is completely useless, it's basically a no-op.&lt;BR /&gt;&lt;BR /&gt;MFENCE of any use ONLY is you are trying to order critical store-load sequence. As far as I see, there is no critical store-load sequences in your example, so you need no hardware fences on x86 at all. Just declare variables as volatile so that compiler preserve program order.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;</description>
      <pubDate>Wed, 18 Nov 2009 17:29:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898192#M4092</guid>
      <dc:creator>addyvarma</dc:creator>
      <dc:date>2009-11-18T17:29:36Z</dc:date>
    </item>
    <item>
      <title>Re: LOCK vs MFENCE</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898193#M4093</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/452347"&gt;addyvarma&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;/EM&gt;I did use just volatiles before I used the fences - and that because just using volatiles didn't help, I had to resort to fences&lt;BR /&gt;(Adve does talk of the sequential consistency issues for busy wait sync.).&lt;BR /&gt;&lt;BR /&gt;So I guess we are missing something really basic here.&lt;EM&gt;d&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;Since your program does not work with fences too, probably the problem is in another place, difficult to say w/o seeing the code.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 18 Nov 2009 17:46:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/LOCK-vs-MFENCE/m-p/898193#M4093</guid>
      <dc:creator>Dmitry_Vyukov</dc:creator>
      <dc:date>2009-11-18T17:46:54Z</dc:date>
    </item>
  </channel>
</rss>

