Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Lock-Free using cmpxchg8b...

Intel_C_Intel
Employee
3,324 Views
Your threading tips are sound. The thread aware memory pools are nice. =)


Here are some of my advanced threading tips: =)


You should spice up your threading site by adding a:

" Rock-Solid Lock-Free Algo's /w cmpxchg8b & cmp8xchg16b Tutorials " additon to your site.


You Intel guys really need to promote the cmpxchg8b and the cmp8xchg16b ( for Merced I think ) opcodes. When you code lock-free algo's using these, they make advanced threading a work of art. They are not just for creating locks you know. =)


I have implemented a 100% ABA proof lock-free IBM FreeList with your cmpxchg8b opcode:

FreeList.c


Joe Seigh and I have developed A 100% ABA proof " real " lock-free atomic pointer and I did a lock-free fifo queue with it:

AtomicPtrABATest.zip

The AtomicPtr combines 100% ABA proof lock-free algos with the ability for a thread to reference a shared pointer without having to already own a marshaled copy!


Also, look at my advanced fifo queue code:

Queue.c

This code is very " thread " friendly. Put this code up against a " normal " fifo shared queue, and it will blow it away.

You should add some similar code examples on the Intel threading sites, or just add a link to my page. ;)


What do you threading masters think about it?

;)

Thank you.
0 Kudos
30 Replies
netdevil
Beginner
2,142 Views
Hmm,

I've used locks (as in "lock xchg" and "lock inc", et al), but I'm not familiar with cmpxchg8b and the likes. Can someone explain them to me in a little more detail, and how best to use them?

Thanks!
-Dale
0 Kudos
Intel_C_Intel
Employee
2,142 Views
This paper describes some rock-solid lock-free collections ( LIFO & FIFO ) that can be used for very SMP friendly inter-thread & process communication.

Lock-Free Collection PDF

After reading that, you should be on track.

Now? How to implement those lock-free alog?s for Pentium?s, Xeon?s, and Itanium processors?

That?s where the cmpxchg8b and cmp8xchg16b opcodes come in to play.

They are CAS operations that swap memory sizes double that of the system pointer size. Very usefull indeed.

You can use this to swap a pointer in a fashion that avoids the ABA problem that the paper talks about. You do this by atomically setting a read ABA id on the target shared memory, and comparing the read id on any CAS on the shared memory. It turns CAS into a LL/SC.

In fact, the Windows API has SList's. They happen to use the exact same lifo algo the paper describes. So Microsoft does use lock-free stuff in there kernals, and even export one for the " general " public. =)

Here are some 64-bit compare and swap's from my library you can use to implement the lock-free algos for 32-bit systems:

Atomic.h

The #ifdefs should be self-explanitory.

A 64-bit version ( Itanium ) would use the cmp8xchg16b opcode.

Lock-Free are what SMP systems thrive on. ;)

PowerPC's can use the algo's with LL(reserved)/SC, any processor with a LL(reserved)/SC can do lock-free collections.
0 Kudos
ClayB
New Contributor I
2,142 Views
lockfree -

Thanks for posting a pointer to the paper. It helped quite a bit in understanding what you have been proposing. I had not heard of the "ABA problem" before, but this gave a good example. Of course, gaining understanding in one aspect of a topic usually brings up new questions and I'd like to post those here for anyone to answer.

But before I get to that, let me first get nitpicky. The first sentence of the paper starts with the definition: "A shared data structure is lock-free if its operations do not require mutual exclusion". I would amend that by saying "...do not require separate mutual exclusion objects to control access" since the shared queue and stack data structures clearly need mutual exclusion to get proper functionality. However, the methods that are used in the paper don't need any additional synchronization object to enforce mutual exclusion.

Back in the old days, when I was first starting my computer science education, we learned about the test and swap instruction. The compare and swap (CAS) atomic operation described in the paper is a generalization of that with the ability to compare against any value and assign any value into the memory location, rather than testing for zero and setting a '1'. I did wonder about the CAS2 operation described to assign two values atomically. Should there not be a second memory location as one of the parameters? Otherwise, where does the instruction know where it must assign the second value? Or have I missed something in the description?

My second question is whether these methods could be used in something other than a linear data structure like the LIFO and FIFO examples? Could we use them to protect elements from an array of shared structs? It seems to me that the technique is a double check algorithm that uses the fact that only a single element (a head or next pointer) needs to be modified in the existing structure. If I needed to modify several parts of an object (e.g., x-, y-, and z-coordinates, speed, plus other physical properties of something moving through space), could a lock-free mechanism be created to protect access to this struct?

I haven't really devoted any time trying to come up with an answer. I was hoping that someone more familiar with the ideas of lock-free techniques would have already done something like this and can share the more general method(s) with all of the Threading Forum participants.

-- clay
0 Kudos
Intel_C_Intel
Employee
2,143 Views
Lock-free algo's do make threads contend, but the overhead of lock-free contention is not even close to the likes of a critical section. ;)

I consider a shared object lock-free if:

1. It doesn't use any OS provided critical sections.

2. It doesn't make contending threads wait in a line.


The cmpxchg8b instruction swaps into a 64-bit destination. So, if you pass cmpxchg8b two 32-bit values to swap into the destination. One will be the low, and one will be the high part of the 64-bit destination.


It is useful in swapping pointers... If it was an array of pointers then you could CAS the cells to new ( updated ) objects. I haven?t worked with lock-free algo?s that implement a dynamic-array yet.


Here is a paper that I find really useful for accessing large collections ( arrays, whatever ) concurrently:

MSDN Magazine, Concurrent Access

I use the papers rock-solid method to protect data when I can?t think up a lock-free solution. =)


The lock-free collections that I presented are very good at inter-thread comm.


I don?t work really work with Gaming Servers, but?

Say you had a simple SMP Geometry Server, which had two thread pools. A Geometry and a Socket thread pool.

The Socket Pool received incoming data about client character moves,and other various stuff. The Socket Pool parses the messages, and sends them to the Geometry Pool via. lock-free FIFO queue for some math.

The Geometry Pool does some collision detection, updates world objects around the player moves, ect? The sends back world information to the Socket Pool via. lock-free collection.

The Socket Pool then distributes the updated world data to it?s connected tiers.

Well, that would beat the pants off ANY other SMP system that used ? normal, blocking ? FIFO queues for updating world objects under load.

=)

Does that sound useful for a game server at all?
0 Kudos
Intel_C_Intel
Employee
2,143 Views
Mr. lockfree,
I read your article, but looks like your code examples are not thread-safe. Here is the lifo-pop:

lifo-pop (lf: pointer to lifo): pointer to cell
SC1: loop
SC2: head = lf->top # get the top cell of the lifo
SC2: oc = lf->ocount # get the pop operations count
SC3: if head == NULL
SC4: return NULL # LIFO is empty
SC5: endif
SC6: next = head->next # get the next cell of cell
SC7: if CAS2 (&lf->top, head, oc, next, oc + 1) # try to change both the top of the lifo and pop count
SC8: break
SC9: endif
SC10: endloop
SC11: return head
Figure 9: lifo-pop catching the ABA problem

It looks like the line SC6 can generate exception in multithreaded environment, if some other thread will assine NULL to the head between SC3 and SC6. If you will fix the example by putting the line SC6 inside try..catch construct, it will work for sure without loosing the performance.

The same problem exactly I found in your fifo-pop example.
0 Kudos
Intel_C_Intel
Employee
2,143 Views
One more note for Mr. lockfree:
Proposed algorithm can work only on SMP systems with shared data cache. I think this also should be noted in your work.
0 Kudos
Intel_C_Intel
Employee
2,143 Views
You have no idea about what you are talking about.

You need to learn about lock-free algo's.

You need to learn about how lock-free algo use memory.

You need to learn about how to use lock-free algo's correctly.

You need to learn why you can't free nodes in lock-free algo's without atomic_ptr or SMR.

IBM SMR Algo

AtomicPtr.c


The algos ARE THREAD SAFE period, Microsoft API exposes a lock-free LIFO stack, the SList API:



Here are both the algos in the paper i cited in working code:

FreeList.c

FreeQueue.c

You don't need a SMP box to runs these, what in the world made you think that bulls%I^?

Before you trash a set of algo's that have been working for many years, you must do some homework. Or you risk looking like a fool.
0 Kudos
Intel_C_Intel
Employee
2,143 Views
Whoops, I forgot to add the Microsoft SList link...

Interlocked SList

Disassemble the SList API on an IA-32 processor, and you will see that it is lock-free. It also uses the exact algo in the paper I cited.

I am sorry if I came across as an ass. I think I did.

I thought that you said there was a bug for sure, instead you were actually asking me if there might me a bug.

Sorry for misreading your post.
0 Kudos
Intel_C_Intel
Employee
2,143 Views
To Mr. lockfree

As one that already coded industry-level multithreaded algorithms for many-many years

:-) :-) :-)
0 Kudos
Intel_C_Intel
Employee
2,143 Views
By the way, did you check your algorithms with Intel Thread Checker ?
0 Kudos
Intel_C_Intel
Employee
2,143 Views
> What do you threading masters think about it?

Take a C++ class, Sender. ;-)

regards,
alexander.
0 Kudos
Intel_C_Intel
Employee
2,143 Views
How ya doin Alex. =)

You should post a link to your portable reference counting standards here.

I?ve been using C for ever ( I even use it for light-weight COM from time to time ), and my C++ is a bit odd I think. However I do use C++ to quickly get a project concept up and running from time to time. Then I re-write the production code with C.

;)


To kdmitry:

? thread checkers ? and lock-free algos don't go to well together. ;)

For one, they will all probably detect the error you think you found. When in fact, its completely normal behavior for a lock-free algo. Read the first couple of pages on the SMR algo, and it will try to explain how lock-free algos use memory.

Basically? If one thread deletes a node, there may be other threads using that node inside their CAS sections. So a thread checker would not like that. =)

You can use SMR or AtomicPtr to overcome this side effect of most lock-free algo?s.

Again, I am sorry for my earlier post.
0 Kudos
bronx
Beginner
2,143 Views
I've rapidly checked your examples, it appears that in order to serialize the access, threads wait in a spin-loop until the extra "anti ABA" count data member is OK (I'll call this very bevahior a *lock* but that's another story).

With many threads competing for the same data structure these spin-loops are likely hotspots at runtime and probably using the PAUSE instruction will lead to speedups when runing on HT capable P4s. For forthcoming Prescott targets have a look at MONITOR/MWAIT in PNIs they should allow you to optimize your code without wasting all these cycles waiting in loops to aquire the lock (or "counter" if you prefer this name).

btw you use a mfence instruction in the SSE compilation path, it has potentially major side effects on other applications that make heavy use of streaming stores or write to WC/non-cachable memory like a gfx card framebuffer, do you really need these memory barriers ? if so, it looks like a major drawback of your code
0 Kudos
jseigh2
Beginner
2,143 Views
The defintion of lock-free is that a thread can make forward progress
without requiring any forward progress by other threads. If a thread
is not waiting for a specific value to be set by another thread then
it is not blocking. There may some looping due to contention but
usually the hardware synchronization guarantees forward progress
in the presence of contention up to a certain level, and you can
control the level of contention by back offs if that becomes an issue.

What is the MONITOR/MWAIT stuff? Pause until a memory location
is changed? I've always thought it would be useful to have alpha
like addition to its LL/SC, a pause condtional that would wait until
the reserve was broken by a store by another processor. It would
be interruptible of course and should optionally raise an interrupt
in problem state so that a time slicing supervisor could intercept on
it and dispatch another thread.

Joe Seigh
0 Kudos
bronx
Beginner
2,143 Views
>If a thread
> is not waiting for a specific value to be set by
> another thread then
> it is not blocking.

sure, in the french paper posted by lockfree they fix what they call the "ABA problem" by adding new data members ("ocount", "icount") tentatively set using atomic compare&swap along with the pointers stuff. Until count and pointer are updated successfully, threads wait in a loop. The major problem here (particularly on SMP systems) is that at each iteration of the loop a snoop request is issued in order to check if the value has been changed by the other CPU. It leads to 128-byte L2 cache lines exchanged accross the system bus like mad. Inserting a PAUSE instruction in such cases is advised and should provide some speedups.


> What is the MONITOR/MWAIT stuff? Pause until a
> memory location
> is changed?

yes, exactly. Information is still scarce,

(see a ref. here :) PNIs

but we can guess that instead of looping wasting execution slots and bus bandwith, the waiting threads will sleep until some criterion is met. They will be woken up immediately, just in time without the penalty of exiting a loop (quite high on P4s with a 20-stage pipeline)

Message Edited by intel.software.network.support on 12-09-2005 02:09 PM

0 Kudos
ClayB
New Contributor I
2,143 Views
> sure, in the french paper posted by lockfree they fix
> what they call the "ABA problem" by adding new data
> members ("ocount", "icount") tentatively set using
> atomic compare&swap along with the pointers stuff.
> Until count and pointer are updated successfully,
> threads wait in a loop. The major problem here
> (particularly on SMP systems) is that at each
> iteration of the loop a snoop request is issued in
> order to check if the value has been changed by the
> other CPU. It leads to 128-byte L2 cache lines
> exchanged accross the system bus like mad. Inserting
> a PAUSE instruction in such cases is advised and
> should provide some speedups.

When I first read the french paper, I too was concerned about spin-waits affecting performance and thought to recommend using PAUSE for HT implementations. Of course, there may be some unwritten requirement that the data structures don't have much contention in the first place. That is, even with hundreds or thousands of threads, there may only be two or three at any time trying to update the data structures.

In any event, one should hope that the lock-free objects would have low contention to justify the few spin-waits that would be needed while not paying the heavy cost of using some synchronization object and attendant API. With most everything in life there would be some tradeoff point where use of a sync object would yield better throughput than the lock-free implementation. I'm guessing that this would be some function of number of threads, size of objects, contention ratio, etc.

-- clay

P.S. Thanks for all remaining (mostly) calm. These lock-free objects are new ideas for most people and those actively espousing their virtues will need a bit of patience until the rest of us catch up with their level of understanding. :-)
0 Kudos
ClayB
New Contributor I
2,143 Views
I'm embarrassed to admit it (esp. since I work for Intel), but I'm having a bit of trouble locating any documentation on the cmp8xchg16b instruction as proposed by lockfree. I've seen some contradictions in some of the papers that have been pointed to in this thread and was hoping to see what the details were on this instruction that might confirm some assumptions that I've made on this topic.

I haven't dealt with assembly language in any depth since I was programming 6502-based Apple IIs way, way back. ("In my days we didn't have any fancy text editors or hoity-toity Windows. To correct a program back then you had to punch new holes in Hollerith cards with a dull pocket knife...and we liked it!" :-) So, I'm not up on the latest documentation for assembly language instructions.

Thanks in advance.

-- clay

0 Kudos
jseigh2
Beginner
2,143 Views
cmp8xchg16 is documented in the Itanium architecture software developers manual. It's basically a compare 8 bytes, swap 16 bytes. There's some question as to why cmpxchg8b was not ported to Itanium as compare 16 bytes, swap 16 bytes. atomic_ptr uses a doubleword compare and swap. While I can port atomic_ptr to Itanium, there's two or three things I can do, it's giving lockfree(aka sender) fits as he's trying to put into atomic_ptr some ABA free stuff. While I don't think there's any benefit to having the ABA stuff in atomic_ptr, I don't want to implement atomic_ptr in a way that makes it impossible for him to do that if that is what he wants to do.
0 Kudos
bronx
Beginner
2,143 Views
> objects would have low contention to justify the few
> spin-waits that would be needed while not paying the
> heavy cost of using some synchronization object and
> attendant API. With most everything in life there

ok, but how about a general purpose lightweight lock, like one based on inlined code without call to the OS ?
a "cmpxchg8b" based loop will do the trick, for the stacks and queues examples it will require one CAS loop on entry and another CAS loop on exit (+ ultra simple code in between, *easily maintainable*), instead of a single convoluted CAS2 loop, I will be very interested to compare actual timings (btw how can we implement CAS2 for IA-32 targets ?)
0 Kudos
Intel_C_Intel
Employee
2,048 Views
cmpxch8b implements the french papers CAS2.

Here is an older FIFO queue algo that I have working on my site:

Lock-Free FIFO Algo

It?s the well-known fifo mike and scott algo.



The lock-free version will beat a spin-lock protected version.

You don?t want to have threads wait in a line to access the queue.

You want all the threads in at the same time, each trying to push or pop nodes on each CAS2.



I have implemented my new lock-free system that only requires a normal CAS ( InterlockedCompareExchange ) to work, which makes it portable to any processor that has a normal CAS, or a normal LL/SC.

LockFree.c

FreeQueue.c

FreeList.c

Its proving to be perfect for balanced consumer / producer communication.

A quick description for my new algo is at the root of this thread here:

New Lock-Free Algo



Also,

You don?t need that mfence opcode after the cas?s I have in Atomic.h.

One needs a release barrier on success.

The other needs a release on success, and an acquire after failure.
0 Kudos
Reply