<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Lock-Free using cmpxchg8b... in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989698#M6123</link>
    <description>&amp;gt; cmp8xchg16 is documented in the Itanium architecture&lt;BR /&gt;&amp;gt; software developers manual.  It's basically a compare&lt;BR /&gt;&amp;gt; 8 bytes, swap 16 bytes.  &lt;BR /&gt;&lt;BR /&gt;Thanks, Joe.  This is what I thought it was from the name.  I guess this corresponds to what the french paper defines as CAS2.  I still can't see how this would work, though, as intended.  The idea about avoiding the ABA problem is to compare (and exchange) both the pointer involved and the count of operations.  If the pointer has ended up with the same value even after other threads have updated the structure, the count of operations will have changed to inform the original thread that it must recheck the pointer and counter values.&lt;BR /&gt;&lt;BR /&gt;Maged Michael's paper from PODC 2002 points out that DCAS (his name for CAS2) needs to make the comparison of both values before instigating the exchange.  This was as I thought, but did not seem to be laid out in the french paper.  Unless, I thought, the memory layout was working with the CAS2 operation.  That is, given a single memory address that was actually going to refer to a double word location.  So, you would need to place the pointer and the reference count into adjoining memory locations, but could use separate (8-byte?) values for comparison and exchange.  That made some sense, but I would think it would be better designed to have the same three arguments as the simple CAS, but double the size of the arguments which would require the user to place values in the hi and lo parts of the double area.  Still, since this was only an idea in the french paper to solve an algorithmic problem, anything would be feasible.&lt;BR /&gt;&lt;BR /&gt;Unfortunately, the cmp8xchg16b looks to only compare the first 8 bytes, but exchange 16 bytes based on that comparison.  This seems to be the same as my first interpretation on the CAS2 primitive.  If so, does this not suffer from the same drawbacks and not give the direct solution to the ABA problem that was given in the french paper?  Since Michael states that the DCAS is not a primitive implemented on any architecture (and I don't think the cmp8xchg16b does a proper job of it), does he have the more realizable ABA avoiding algorithm? &lt;BR /&gt;&lt;BR /&gt;I guess I too wonder why there was no cmpxchg16b implemented on Itanium.  This would seem to have solved all sorts of extra problems that the 8 byte compare can't handle easily.  I suppose we'll have to poll someone involved with Itanium development to find the answer.&lt;BR /&gt;&lt;BR /&gt;-- clay&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
    <pubDate>Mon, 02 Jun 2003 22:20:34 GMT</pubDate>
    <dc:creator>ClayB</dc:creator>
    <dc:date>2003-06-02T22:20:34Z</dc:date>
    <item>
      <title>Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989677#M6102</link>
      <description>Your threading tips are sound. The thread aware memory pools are nice. =)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Here are some of my advanced threading tips: =)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;You should spice up your threading site by adding a:&lt;BR /&gt;&lt;BR /&gt;" Rock-Solid Lock-Free Algo's /w cmpxchg8b &amp;amp; cmp8xchg16b Tutorials " additon to your site.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;You Intel guys really need to promote the cmpxchg8b and the cmp8xchg16b ( for Merced I think ) opcodes. When you code lock-free algo's using these, they make advanced threading a work of art. They are not just for creating locks you know. =)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I have implemented a 100% ABA proof lock-free IBM FreeList with your cmpxchg8b opcode:&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://appcore.home.attbi.com/src/FreeList.c"&gt;FreeList.c&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Joe Seigh and I have developed A 100% ABA proof " real " lock-free atomic pointer and I did a lock-free fifo queue with it:&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://appcore.home.attbi.com/zips/AtomicPtrABATest.zip"&gt;AtomicPtrABATest.zip&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;The AtomicPtr combines 100% ABA proof lock-free algos with the ability for a thread to reference a shared pointer without having to already own a marshaled copy!&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Also, look at my advanced fifo queue code:&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://appcore.home.attbi.com/src/Queue.c"&gt;Queue.c&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;This code is very " thread " friendly. Put this code up against a " normal " fifo shared queue, and it will blow it away.&lt;BR /&gt;&lt;BR /&gt;You should add some similar code examples on the Intel threading sites, or just add a link to my page. ;)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;What do you threading masters think about it?&lt;BR /&gt;&lt;BR /&gt;;)&lt;BR /&gt;&lt;BR /&gt;Thank you.</description>
      <pubDate>Mon, 19 May 2003 07:30:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989677#M6102</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2003-05-19T07:30:32Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989678#M6103</link>
      <description>Hmm,&lt;BR /&gt;&lt;BR /&gt;I've used locks (as in "lock xchg" and "lock inc", et al), but I'm not familiar with cmpxchg8b and the likes.  Can someone explain them to me in a little more detail, and how best to use them?&lt;BR /&gt;&lt;BR /&gt;Thanks!&lt;BR /&gt;-Dale&lt;BR /&gt;</description>
      <pubDate>Tue, 20 May 2003 01:19:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989678#M6103</guid>
      <dc:creator>netdevil</dc:creator>
      <dc:date>2003-05-20T01:19:57Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989679#M6104</link>
      <description>This paper describes some rock-solid lock-free collections ( LIFO &amp;amp; FIFO ) that can be used for very SMP friendly inter-thread &amp;amp; process communication.&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://perso.wanadoo.fr/gmem/evenements/jim2002/articles/L17_Fober.pdf"&gt;Lock-Free Collection PDF&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;After reading that, you should be on track.&lt;BR /&gt;&lt;BR /&gt;Now? How to implement those lock-free alog?s for Pentium?s, Xeon?s, and Itanium processors?&lt;BR /&gt;&lt;BR /&gt;That?s where the cmpxchg8b and cmp8xchg16b opcodes come in to play.&lt;BR /&gt;&lt;BR /&gt;They are CAS operations that swap memory sizes double that of the system pointer size. Very usefull indeed.&lt;BR /&gt;&lt;BR /&gt;You can use this to swap a pointer in a fashion that avoids the ABA problem that the paper talks about. You do this by atomically setting a read ABA id on the target shared memory, and comparing the read id on any CAS on the shared memory. It turns CAS into a LL/SC.&lt;BR /&gt;&lt;BR /&gt;In fact, the Windows API has SList's. They happen to use the exact same lifo algo the paper describes. So Microsoft does use lock-free stuff in there kernals, and even export one for the " general " public. =)&lt;BR /&gt;&lt;BR /&gt;Here are some 64-bit compare and swap's from my library you can use to implement the lock-free algos for 32-bit systems:&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://appcore.home.attbi.com/include/Atomic.h"&gt;Atomic.h&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;The #ifdefs should be self-explanitory.&lt;BR /&gt;&lt;BR /&gt;A 64-bit version ( Itanium ) would use the cmp8xchg16b opcode.&lt;BR /&gt;&lt;BR /&gt;Lock-Free are what SMP systems thrive on. ;)&lt;BR /&gt;&lt;BR /&gt;PowerPC's can use the algo's with LL(reserved)/SC, any processor with a LL(reserved)/SC can do lock-free collections.</description>
      <pubDate>Tue, 20 May 2003 08:29:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989679#M6104</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2003-05-20T08:29:36Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989680#M6105</link>
      <description>lockfree -&lt;BR /&gt;&lt;BR /&gt;Thanks for posting a pointer to the paper.  It helped quite a bit in understanding what you have been proposing.  I had not heard of the "ABA problem" before, but this gave a good example.  Of course, gaining understanding in one aspect of a topic usually brings up new questions and I'd like to post those here for anyone to answer.&lt;BR /&gt;&lt;BR /&gt;But before I get to that, let me first get nitpicky.  The first sentence of the paper starts with the definition: "A shared data structure is lock-free if its operations do not require mutual exclusion".  I would amend that by saying "...do not require separate mutual exclusion objects to control access" since the shared queue and stack data structures clearly need mutual exclusion to get proper functionality.  However, the methods that are used in the paper don't need any additional synchronization object to enforce mutual exclusion.&lt;BR /&gt;&lt;BR /&gt;Back in the old days, when I was first starting my computer science education, we learned about the test and swap instruction.  The compare and swap (CAS) atomic operation described in the paper is a generalization of that with the ability to compare against any value and assign any value into the memory location, rather than testing for zero and setting a '1'.  I did wonder about the CAS2 operation described to assign two values atomically.  Should there not be a second memory location as one of the parameters?  Otherwise, where does the instruction know where it must assign the second value?  Or have I missed something in the description?&lt;BR /&gt;&lt;BR /&gt;My second question is whether these methods could be used in something other than a linear data structure like the LIFO and FIFO examples?  Could we use them to protect elements from an array of shared structs?  It seems to me that the technique is a double check algorithm that uses the fact that only a single element (a head or next pointer) needs to be modified in the existing structure.  If I needed to modify several parts of an object (e.g., x-, y-, and z-coordinates, speed, plus other physical properties of something moving through space), could a lock-free mechanism be created to protect access to this struct?  &lt;BR /&gt;&lt;BR /&gt;I haven't really devoted any time trying to come up with an answer.  I was hoping that someone more familiar with the ideas of lock-free techniques would have already done something like this and can share the more general method(s) with all of the Threading Forum participants.&lt;BR /&gt;&lt;BR /&gt;-- clay</description>
      <pubDate>Wed, 21 May 2003 22:00:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989680#M6105</guid>
      <dc:creator>ClayB</dc:creator>
      <dc:date>2003-05-21T22:00:40Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989681#M6106</link>
      <description>Lock-free algo's do make threads contend, but the overhead of lock-free contention is not even close to the likes of a critical section.  ;)&lt;BR /&gt;&lt;BR /&gt;I consider a shared object lock-free if:&lt;BR /&gt;&lt;BR /&gt;1. It doesn't use any OS provided critical sections.&lt;BR /&gt;&lt;BR /&gt;2. It doesn't make contending threads wait in a line.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;The cmpxchg8b instruction swaps into a 64-bit destination. So, if you pass cmpxchg8b two 32-bit values to swap into the destination. One will be the low, and one will be the high part of the 64-bit destination.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;It is useful in swapping pointers... If it was an array of pointers then you could CAS the cells to new ( updated ) objects. I haven?t worked with lock-free algo?s that implement a dynamic-array yet.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Here is a paper that I find really useful for accessing large collections ( arrays, whatever ) concurrently:&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://msdn.microsoft.com/msdnmag/issues/01/08/Concur/default.aspx"&gt;MSDN Magazine, Concurrent Access&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;I use the papers rock-solid method to protect data when I can?t think up a lock-free solution. =)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;The lock-free collections that I presented are very good at inter-thread comm.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I don?t work really work with Gaming Servers, but?&lt;BR /&gt;&lt;BR /&gt;Say you had a simple SMP Geometry Server, which had two thread pools. A Geometry and a Socket thread pool.&lt;BR /&gt;&lt;BR /&gt;The Socket Pool received incoming data about client character moves,and other various stuff. The Socket Pool parses the messages, and sends them to the Geometry Pool via. lock-free FIFO queue for some math.&lt;BR /&gt;&lt;BR /&gt;The Geometry Pool does some collision detection, updates world objects around the player moves, ect? The sends back world information to the Socket Pool via. lock-free collection.&lt;BR /&gt;&lt;BR /&gt;The Socket Pool then distributes the updated world data to it?s connected tiers.&lt;BR /&gt;&lt;BR /&gt;Well, that would beat the pants off ANY other SMP system that used ? normal, blocking ? FIFO queues for updating world objects under load.&lt;BR /&gt;&lt;BR /&gt;=)&lt;BR /&gt;&lt;BR /&gt;Does that sound useful for a game server at all?</description>
      <pubDate>Thu, 22 May 2003 07:52:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989681#M6106</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2003-05-22T07:52:20Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989682#M6107</link>
      <description>Mr. lockfree,&lt;BR /&gt;I read your article, but looks like your code examples are not thread-safe. Here is the lifo-pop:&lt;BR /&gt;&lt;BR /&gt;lifo-pop (lf: pointer to lifo): pointer to cell&lt;BR /&gt;SC1: loop&lt;BR /&gt;SC2: head = lf-&amp;gt;top # get the top cell of the lifo&lt;BR /&gt;SC2: oc = lf-&amp;gt;ocount # get the pop operations count&lt;BR /&gt;SC3: if head == NULL&lt;BR /&gt;SC4: return NULL # LIFO is empty&lt;BR /&gt;SC5: endif&lt;BR /&gt;SC6: next = head-&amp;gt;next # get the next cell of cell&lt;BR /&gt;SC7: if CAS2 (&amp;amp;lf-&amp;gt;top, head, oc, next, oc + 1) # try to change both the top of the lifo and pop count&lt;BR /&gt;SC8: break&lt;BR /&gt;SC9: endif&lt;BR /&gt;SC10: endloop&lt;BR /&gt;SC11: return head&lt;BR /&gt;Figure 9: lifo-pop catching the ABA problem&lt;BR /&gt;&lt;BR /&gt;It looks like the line SC6 can generate exception in multithreaded environment, if some other thread will assine NULL to the head between SC3 and SC6. If you will fix the example by putting the line SC6 inside try..catch construct, it will work for sure without loosing the performance. &lt;BR /&gt;&lt;BR /&gt;The same problem exactly I found in your fifo-pop example.&lt;BR /&gt;</description>
      <pubDate>Sat, 24 May 2003 03:32:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989682#M6107</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2003-05-24T03:32:11Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989683#M6108</link>
      <description>One more note for Mr. lockfree:&lt;BR /&gt;Proposed algorithm can work only on SMP systems with shared data cache. I think this also should be noted in your work.&lt;BR /&gt;</description>
      <pubDate>Sat, 24 May 2003 03:43:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989683#M6108</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2003-05-24T03:43:51Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989684#M6109</link>
      <description>You have no idea about what you are talking about.&lt;BR /&gt;&lt;BR /&gt;You need to learn about lock-free algo's.&lt;BR /&gt;&lt;BR /&gt;You need to learn about how lock-free algo use memory.&lt;BR /&gt;&lt;BR /&gt;You need to learn about how to use lock-free algo's correctly.&lt;BR /&gt;&lt;BR /&gt;You need to learn why you can't free nodes in lock-free algo's without atomic_ptr or SMR.&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://www.research.ibm.com/people/m/michael/podc-2002.pdf"&gt;IBM SMR Algo&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://appcore.home.attbi.com/src/AtomicPtr.c"&gt;AtomicPtr.c&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;The algos ARE THREAD SAFE period, Microsoft API exposes a lock-free LIFO stack, the SList API:&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Here are both the algos in the paper i cited in working code:&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://appcore.home.attbi.com/src/FreeList.c"&gt;FreeList.c&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://appcore.home.attbi.com/src/FreeQueue.c"&gt;FreeQueue.c&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;You don't need a SMP box to runs these, what in the world made you think that bulls%I^?&lt;BR /&gt;&lt;BR /&gt;Before you trash a set of algo's that have been working for many years, you must do some homework. Or you risk looking like a fool.</description>
      <pubDate>Sat, 24 May 2003 08:10:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989684#M6109</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2003-05-24T08:10:07Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989685#M6110</link>
      <description>Whoops, I forgot to add the Microsoft SList link...&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/interlockedpushentryslist.asp"&gt;Interlocked SList&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Disassemble the SList API on an IA-32 processor, and you will see that it is lock-free. It also uses the exact algo in the paper I cited.&lt;BR /&gt;&lt;BR /&gt;I am sorry if I came across as an ass. I think I did.&lt;BR /&gt;&lt;BR /&gt;I thought that you said there was a bug for sure, instead you were actually asking me if there might me a bug.&lt;BR /&gt;&lt;BR /&gt;Sorry for misreading your post.</description>
      <pubDate>Sat, 24 May 2003 08:58:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989685#M6110</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2003-05-24T08:58:24Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989686#M6111</link>
      <description>To Mr. lockfree&lt;BR /&gt;&lt;BR /&gt;As one that already coded industry-level multithreaded algorithms for many-many years&lt;BR /&gt;&lt;BR /&gt;:-) :-) :-)&lt;BR /&gt;</description>
      <pubDate>Sat, 24 May 2003 20:23:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989686#M6111</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2003-05-24T20:23:30Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989687#M6112</link>
      <description>By the way, did you check your algorithms with Intel Thread Checker ?</description>
      <pubDate>Sat, 24 May 2003 20:29:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989687#M6112</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2003-05-24T20:29:53Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989688#M6113</link>
      <description>&amp;gt; What do you threading masters think about it?&lt;BR /&gt;&lt;BR /&gt;Take a C++ class, Sender. ;-)&lt;BR /&gt;&lt;BR /&gt;regards,&lt;BR /&gt;alexander.&lt;BR /&gt;</description>
      <pubDate>Sun, 25 May 2003 04:06:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989688#M6113</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2003-05-25T04:06:52Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989689#M6114</link>
      <description>How ya doin Alex.  =)&lt;BR /&gt;&lt;BR /&gt;You should post a link to your portable reference counting standards here.&lt;BR /&gt;&lt;BR /&gt;I?ve been using C for ever ( I even use it for light-weight COM from time to time ), and my C++ is a bit odd I think. However I do use C++ to quickly get a project concept up and running from time to time. Then I re-write the production code with C.&lt;BR /&gt;&lt;BR /&gt;;)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;To kdmitry:&lt;BR /&gt;&lt;BR /&gt;? thread checkers ? and lock-free algos don't go to well together.  ;)&lt;BR /&gt;&lt;BR /&gt;For one, they will all probably detect the error you think you found. When in fact, its completely normal behavior for a lock-free algo. Read the first couple of pages on the SMR algo, and it will try to explain how lock-free algos use memory.&lt;BR /&gt;&lt;BR /&gt;Basically? If one thread deletes a node, there may be other threads using that node inside their CAS sections. So a thread checker would not like that. =)&lt;BR /&gt;&lt;BR /&gt;You can use SMR or AtomicPtr to overcome this side effect of most lock-free algo?s.&lt;BR /&gt;&lt;BR /&gt;Again, I am sorry for my earlier post.</description>
      <pubDate>Sun, 25 May 2003 07:10:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989689#M6114</guid>
      <dc:creator>Intel_C_Intel</dc:creator>
      <dc:date>2003-05-25T07:10:11Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989690#M6115</link>
      <description>I've rapidly checked your examples, it appears that in order to serialize the access, threads wait in a spin-loop until the extra "anti ABA" count data member is OK (I'll call this very bevahior a *lock* but that's another story).&lt;BR /&gt;&lt;BR /&gt;With many threads competing for the same data structure these spin-loops are likely hotspots at runtime and probably using the PAUSE instruction will lead to speedups when runing on HT capable P4s. For forthcoming Prescott targets have a look at MONITOR/MWAIT in PNIs they should allow you to optimize your code without wasting all these cycles waiting in loops to aquire the lock (or "counter" if you prefer this name).&lt;BR /&gt;&lt;BR /&gt;btw you use a mfence instruction in the SSE compilation path, it has potentially major side effects on other applications that make heavy use of streaming stores or write to WC/non-cachable memory like a gfx card framebuffer, do you really need these memory barriers ? if so, it looks like a major drawback of your code&lt;BR /&gt;</description>
      <pubDate>Sun, 25 May 2003 16:41:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989690#M6115</guid>
      <dc:creator>bronx</dc:creator>
      <dc:date>2003-05-25T16:41:53Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989691#M6116</link>
      <description>The defintion of lock-free is that a thread can make forward progress&lt;BR /&gt;without requiring any forward progress by other threads.  If a thread&lt;BR /&gt;is not waiting for a specific value to be set by another thread then&lt;BR /&gt;it is not blocking.  There may some looping due to contention but&lt;BR /&gt;usually the hardware synchronization guarantees forward progress&lt;BR /&gt;in the presence of contention up to a certain level, and you can&lt;BR /&gt;control the level of contention by back offs if that becomes an issue.&lt;BR /&gt;&lt;BR /&gt;What is the MONITOR/MWAIT stuff?  Pause until a memory location&lt;BR /&gt;is changed?  I've always thought it would be useful to have alpha&lt;BR /&gt;like addition to its LL/SC, a pause condtional that would wait until&lt;BR /&gt;the reserve was broken by a store by another processor.  It would&lt;BR /&gt;be interruptible of course and should optionally raise an interrupt&lt;BR /&gt;in problem state so that a time slicing supervisor could intercept on&lt;BR /&gt;it and dispatch another thread.&lt;BR /&gt;&lt;BR /&gt;Joe Seigh&lt;BR /&gt;</description>
      <pubDate>Mon, 26 May 2003 03:13:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989691#M6116</guid>
      <dc:creator>jseigh2</dc:creator>
      <dc:date>2003-05-26T03:13:35Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989692#M6117</link>
      <description>&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;&amp;gt;If a thread&lt;BR /&gt;&amp;gt; is not waiting for a specific value to be set by&lt;BR /&gt;&amp;gt; another thread then&lt;BR /&gt;&amp;gt; it is not blocking. &lt;BR /&gt;&lt;BR /&gt;sure, in the french paper posted by lockfree they fix what they call the "ABA problem" by adding new data members ("ocount", "icount") tentatively set using atomic compare&amp;amp;swap along with the pointers stuff. Until count and pointer are updated successfully, threads wait in a loop. The major problem here (particularly on SMP systems) is that at each iteration of the loop a snoop request is issued in order to check if the value has been changed by the other CPU. It leads to 128-byte L2 cache lines exchanged accross the system bus like mad. Inserting a PAUSE instruction in such cases is advised and should provide some speedups.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&amp;gt; What is the MONITOR/MWAIT stuff? Pause until a&lt;BR /&gt;&amp;gt; memory location&lt;BR /&gt;&amp;gt; is changed? &lt;BR /&gt;&lt;BR /&gt;yes, exactly. Information is still scarce, &lt;BR /&gt;&lt;BR /&gt;(see a ref. here :) &lt;A href="http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/43988.htm" target="_blank"&gt;PNIs&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;but we can guess that instead of looping wasting execution slots and bus bandwith, the waiting threads will sleep until some criterion is met. They will be woken up immediately, just in time without the penalty of exiting a loop (quite high on P4s with a 20-stage pipeline)&lt;BR /&gt;&lt;BR /&gt;
&lt;P&gt;&lt;SPAN class="time_text"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Message Edited by intel.software.network.support on &lt;SPAN class="date_text"&gt;12-09-2005&lt;/SPAN&gt; &lt;SPAN class="time_text"&gt;02:09 PM&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 26 May 2003 15:41:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989692#M6117</guid>
      <dc:creator>bronx</dc:creator>
      <dc:date>2003-05-26T15:41:16Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989693#M6118</link>
      <description>&amp;gt; sure, in the french paper posted by lockfree they fix&lt;BR /&gt;&amp;gt; what they call the "ABA problem" by adding new data&lt;BR /&gt;&amp;gt; members ("ocount", "icount") tentatively set using&lt;BR /&gt;&amp;gt; atomic compare&amp;amp;swap along with the pointers stuff.&lt;BR /&gt;&amp;gt; Until count and pointer are updated successfully,&lt;BR /&gt;&amp;gt; threads wait in a loop. The major problem here&lt;BR /&gt;&amp;gt; (particularly on SMP systems) is that at each&lt;BR /&gt;&amp;gt; iteration of the loop a snoop request is issued in&lt;BR /&gt;&amp;gt; order to check if the value has been changed by the&lt;BR /&gt;&amp;gt; other CPU. It leads to 128-byte L2 cache lines&lt;BR /&gt;&amp;gt; exchanged accross the system bus like mad. Inserting&lt;BR /&gt;&amp;gt; a PAUSE instruction in such cases is advised and&lt;BR /&gt;&amp;gt; should provide some speedups.&lt;BR /&gt;&lt;BR /&gt;When I first read the french paper, I too was concerned about spin-waits affecting performance and thought to recommend using PAUSE for HT implementations.  Of course, there may be some unwritten requirement that the data structures don't have much contention in the first place.  That is, even with hundreds or thousands of threads, there may only be two or three at any time trying to update the data structures.  &lt;BR /&gt;&lt;BR /&gt;In any event, one should hope that the lock-free objects would have low contention to justify the few spin-waits that would be needed while not paying the heavy cost of using some synchronization object and attendant API.  With most everything in life there would be some tradeoff point where use of a sync object would yield better throughput than the lock-free implementation.  I'm guessing that this would be some function of number of threads, size of objects, contention ratio, etc.&lt;BR /&gt;&lt;BR /&gt;-- clay&lt;BR /&gt;&lt;BR /&gt;P.S.  Thanks for all remaining (mostly) calm.  These lock-free objects are new ideas for most people and those actively espousing their virtues will need a bit of patience until the rest of us catch up with their level of understanding. :-)</description>
      <pubDate>Wed, 28 May 2003 06:07:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989693#M6118</guid>
      <dc:creator>ClayB</dc:creator>
      <dc:date>2003-05-28T06:07:41Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989694#M6119</link>
      <description>I'm embarrassed to admit it (esp. since I work for Intel), but I'm having a bit of trouble locating any documentation on the cmp8xchg16b instruction as proposed by lockfree.  I've seen some contradictions in some of the papers that have been pointed to in this thread and was hoping to see what the details were on this instruction that might confirm some assumptions that I've made on this topic.&lt;BR /&gt;&lt;BR /&gt;I haven't dealt with assembly language in any depth since I was programming 6502-based Apple IIs way, way back.  ("In my days we didn't have any fancy text editors or hoity-toity Windows.  To correct a program back then you had to punch new holes in Hollerith cards with a dull pocket knife...and we liked it!" :-)  So, I'm not up on the latest documentation for assembly language instructions.&lt;BR /&gt;&lt;BR /&gt;Thanks in advance.&lt;BR /&gt;&lt;BR /&gt;-- clay&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 28 May 2003 06:17:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989694#M6119</guid>
      <dc:creator>ClayB</dc:creator>
      <dc:date>2003-05-28T06:17:55Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989695#M6120</link>
      <description>cmp8xchg16 is documented in the Itanium architecture software developers manual.  It's basically a compare 8 bytes, swap 16 bytes.  There's some question as to why cmpxchg8b was not ported to Itanium as compare 16 bytes, swap 16 bytes.  atomic_ptr uses a doubleword compare and swap.   While I can port atomic_ptr to Itanium, there's two or three things I can do, it's giving lockfree(aka sender) fits as he's trying to put into atomic_ptr some ABA free stuff.  While I don't think there's any benefit to having the ABA stuff in atomic_ptr, I don't want to implement atomic_ptr in a way that makes it impossible for him to do that if that is what he wants to do.</description>
      <pubDate>Wed, 28 May 2003 08:16:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989695#M6120</guid>
      <dc:creator>jseigh2</dc:creator>
      <dc:date>2003-05-28T08:16:26Z</dc:date>
    </item>
    <item>
      <title>Re: Lock-Free using cmpxchg8b...</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989696#M6121</link>
      <description>&amp;gt; objects would have low contention to justify the few&lt;BR /&gt;&amp;gt; spin-waits that would be needed while not paying the&lt;BR /&gt;&amp;gt; heavy cost of using some synchronization object and&lt;BR /&gt;&amp;gt; attendant API.  With most everything in life there&lt;BR /&gt;&lt;BR /&gt;ok, but how about a general purpose lightweight lock, like one based on inlined code without call to the OS ?&lt;BR /&gt;a "cmpxchg8b" based loop will do the trick, for the stacks and queues examples it will require one CAS loop on entry and another CAS loop on exit (+ ultra simple code in between, *easily maintainable*), instead of a single convoluted CAS2 loop, I will be very interested to compare actual timings (btw how can we implement CAS2 for IA-32 targets ?)&lt;BR /&gt;</description>
      <pubDate>Wed, 28 May 2003 16:01:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Lock-Free-using-cmpxchg8b/m-p/989696#M6121</guid>
      <dc:creator>bronx</dc:creator>
      <dc:date>2003-05-28T16:01:26Z</dc:date>
    </item>
  </channel>
</rss>

