Your threading tips are sound. The thread aware memory pools are nice. =)
Here are some of my advanced threading tips: =)
You should spice up your threading site by adding a:
" Rock-Solid Lock-Free Algo's /w cmpxchg8b & cmp8xchg16b Tutorials " additon to your site.
You Intel guys really need to promote the cmpxchg8b and the cmp8xchg16b ( for Merced I think ) opcodes. When you code lock-free algo's using these, they make advanced threading a work of art. They are not just for creating locks you know. =)
I have implemented a 100% ABA proof lock-free IBM FreeList with your cmpxchg8b opcode:
Now? How to implement those lock-free alog?s for Pentium?s, Xeon?s, and Itanium processors?
That?s where the cmpxchg8b and cmp8xchg16b opcodes come in to play.
They are CAS operations that swap memory sizes double that of the system pointer size. Very usefull indeed.
You can use this to swap a pointer in a fashion that avoids the ABA problem that the paper talks about. You do this by atomically setting a read ABA id on the target shared memory, and comparing the read id on any CAS on the shared memory. It turns CAS into a LL/SC.
In fact, the Windows API has SList's. They happen to use the exact same lifo algo the paper describes. So Microsoft does use lock-free stuff in there kernals, and even export one for the " general " public. =)
Here are some 64-bit compare and swap's from my library you can use to implement the lock-free algos for 32-bit systems:
Thanks for posting a pointer to the paper. It helped quite a bit in understanding what you have been proposing. I had not heard of the "ABA problem" before, but this gave a good example. Of course, gaining understanding in one aspect of a topic usually brings up new questions and I'd like to post those here for anyone to answer.
But before I get to that, let me first get nitpicky. The first sentence of the paper starts with the definition: "A shared data structure is lock-free if its operations do not require mutual exclusion". I would amend that by saying "...do not require separate mutual exclusion objects to control access" since the shared queue and stack data structures clearly need mutual exclusion to get proper functionality. However, the methods that are used in the paper don't need any additional synchronization object to enforce mutual exclusion.
Back in the old days, when I was first starting my computer science education, we learned about the test and swap instruction. The compare and swap (CAS) atomic operation described in the paper is a generalization of that with the ability to compare against any value and assign any value into the memory location, rather than testing for zero and setting a '1'. I did wonder about the CAS2 operation described to assign two values atomically. Should there not be a second memory location as one of the parameters? Otherwise, where does the instruction know where it must assign the second value? Or have I missed something in the description?
My second question is whether these methods could be used in something other than a linear data structure like the LIFO and FIFO examples? Could we use them to protect elements from an array of shared structs? It seems to me that the technique is a double check algorithm that uses the fact that only a single element (a head or next pointer) needs to be modified in the existing structure. If I needed to modify several parts of an object (e.g., x-, y-, and z-coordinates, speed, plus other physical properties of something moving through space), could a lock-free mechanism be created to protect access to this struct?
I haven't really devoted any time trying to come up with an answer. I was hoping that someone more familiar with the ideas of lock-free techniques would have already done something like this and can share the more general method(s) with all of the Threading Forum participants.
Lock-free algo's do make threads contend, but the overhead of lock-free contention is not even close to the likes of a critical section. ;)
I consider a shared object lock-free if:
1. It doesn't use any OS provided critical sections.
2. It doesn't make contending threads wait in a line.
The cmpxchg8b instruction swaps into a 64-bit destination. So, if you pass cmpxchg8b two 32-bit values to swap into the destination. One will be the low, and one will be the high part of the 64-bit destination.
It is useful in swapping pointers... If it was an array of pointers then you could CAS the cells to new ( updated ) objects. I haven?t worked with lock-free algo?s that implement a dynamic-array yet.
Here is a paper that I find really useful for accessing large collections ( arrays, whatever ) concurrently:
I use the papers rock-solid method to protect data when I can?t think up a lock-free solution. =)
The lock-free collections that I presented are very good at inter-thread comm.
I don?t work really work with Gaming Servers, but?
Say you had a simple SMP Geometry Server, which had two thread pools. A Geometry and a Socket thread pool.
The Socket Pool received incoming data about client character moves,and other various stuff. The Socket Pool parses the messages, and sends them to the Geometry Pool via. lock-free FIFO queue for some math.
The Geometry Pool does some collision detection, updates world objects around the player moves, ect? The sends back world information to the Socket Pool via. lock-free collection.
The Socket Pool then distributes the updated world data to it?s connected tiers.
Well, that would beat the pants off ANY other SMP system that used ? normal, blocking ? FIFO queues for updating world objects under load.
Mr. lockfree, I read your article, but looks like your code examples are not thread-safe. Here is the lifo-pop:
lifo-pop (lf: pointer to lifo): pointer to cell SC1: loop SC2: head = lf->top # get the top cell of the lifo SC2: oc = lf->ocount # get the pop operations count SC3: if head == NULL SC4: return NULL # LIFO is empty SC5: endif SC6: next = head->next # get the next cell of cell SC7: if CAS2 (&lf->top, head, oc, next, oc + 1) # try to change both the top of the lifo and pop count SC8: break SC9: endif SC10: endloop SC11: return head Figure 9: lifo-pop catching the ABA problem
It looks like the line SC6 can generate exception in multithreaded environment, if some other thread will assine NULL to the head between SC3 and SC6. If you will fix the example by putting the line SC6 inside try..catch construct, it will work for sure without loosing the performance.
The same problem exactly I found in your fifo-pop example.
You should post a link to your portable reference counting standards here.
I?ve been using C for ever ( I even use it for light-weight COM from time to time ), and my C++ is a bit odd I think. However I do use C++ to quickly get a project concept up and running from time to time. Then I re-write the production code with C.
? thread checkers ? and lock-free algos don't go to well together. ;)
For one, they will all probably detect the error you think you found. When in fact, its completely normal behavior for a lock-free algo. Read the first couple of pages on the SMR algo, and it will try to explain how lock-free algos use memory.
Basically? If one thread deletes a node, there may be other threads using that node inside their CAS sections. So a thread checker would not like that. =)
You can use SMR or AtomicPtr to overcome this side effect of most lock-free algo?s.
I've rapidly checked your examples, it appears that in order to serialize the access, threads wait in a spin-loop until the extra "anti ABA" count data member is OK (I'll call this very bevahior a *lock* but that's another story).
With many threads competing for the same data structure these spin-loops are likely hotspots at runtime and probably using the PAUSE instruction will lead to speedups when runing on HT capable P4s. For forthcoming Prescott targets have a look at MONITOR/MWAIT in PNIs they should allow you to optimize your code without wasting all these cycles waiting in loops to aquire the lock (or "counter" if you prefer this name).
btw you use a mfence instruction in the SSE compilation path, it has potentially major side effects on other applications that make heavy use of streaming stores or write to WC/non-cachable memory like a gfx card framebuffer, do you really need these memory barriers ? if so, it looks like a major drawback of your code
The defintion of lock-free is that a thread can make forward progress without requiring any forward progress by other threads. If a thread is not waiting for a specific value to be set by another thread then it is not blocking. There may some looping due to contention but usually the hardware synchronization guarantees forward progress in the presence of contention up to a certain level, and you can control the level of contention by back offs if that becomes an issue.
What is the MONITOR/MWAIT stuff? Pause until a memory location is changed? I've always thought it would be useful to have alpha like addition to its LL/SC, a pause condtional that would wait until the reserve was broken by a store by another processor. It would be interruptible of course and should optionally raise an interrupt in problem state so that a time slicing supervisor could intercept on it and dispatch another thread.
>If a thread > is not waiting for a specific value to be set by > another thread then > it is not blocking.
sure, in the french paper posted by lockfree they fix what they call the "ABA problem" by adding new data members ("ocount", "icount") tentatively set using atomic compare&swap along with the pointers stuff. Until count and pointer are updated successfully, threads wait in a loop. The major problem here (particularly on SMP systems) is that at each iteration of the loop a snoop request is issued in order to check if the value has been changed by the other CPU. It leads to 128-byte L2 cache lines exchanged accross the system bus like mad. Inserting a PAUSE instruction in such cases is advised and should provide some speedups.
> What is the MONITOR/MWAIT stuff? Pause until a > memory location > is changed?
but we can guess that instead of looping wasting execution slots and bus bandwith, the waiting threads will sleep until some criterion is met. They will be woken up immediately, just in time without the penalty of exiting a loop (quite high on P4s with a 20-stage pipeline)
Message Edited by intel.software.network.support on 12-09-200502:09 PM
> sure, in the french paper posted by lockfree they fix > what they call the "ABA problem" by adding new data > members ("ocount", "icount") tentatively set using > atomic compare&swap along with the pointers stuff. > Until count and pointer are updated successfully, > threads wait in a loop. The major problem here > (particularly on SMP systems) is that at each > iteration of the loop a snoop request is issued in > order to check if the value has been changed by the > other CPU. It leads to 128-byte L2 cache lines > exchanged accross the system bus like mad. Inserting > a PAUSE instruction in such cases is advised and > should provide some speedups.
When I first read the french paper, I too was concerned about spin-waits affecting performance and thought to recommend using PAUSE for HT implementations. Of course, there may be some unwritten requirement that the data structures don't have much contention in the first place. That is, even with hundreds or thousands of threads, there may only be two or three at any time trying to update the data structures.
In any event, one should hope that the lock-free objects would have low contention to justify the few spin-waits that would be needed while not paying the heavy cost of using some synchronization object and attendant API. With most everything in life there would be some tradeoff point where use of a sync object would yield better throughput than the lock-free implementation. I'm guessing that this would be some function of number of threads, size of objects, contention ratio, etc.
P.S. Thanks for all remaining (mostly) calm. These lock-free objects are new ideas for most people and those actively espousing their virtues will need a bit of patience until the rest of us catch up with their level of understanding. :-)
I'm embarrassed to admit it (esp. since I work for Intel), but I'm having a bit of trouble locating any documentation on the cmp8xchg16b instruction as proposed by lockfree. I've seen some contradictions in some of the papers that have been pointed to in this thread and was hoping to see what the details were on this instruction that might confirm some assumptions that I've made on this topic.
I haven't dealt with assembly language in any depth since I was programming 6502-based Apple IIs way, way back. ("In my days we didn't have any fancy text editors or hoity-toity Windows. To correct a program back then you had to punch new holes in Hollerith cards with a dull pocket knife...and we liked it!" :-) So, I'm not up on the latest documentation for assembly language instructions.
cmp8xchg16 is documented in the Itanium architecture software developers manual. It's basically a compare 8 bytes, swap 16 bytes. There's some question as to why cmpxchg8b was not ported to Itanium as compare 16 bytes, swap 16 bytes. atomic_ptr uses a doubleword compare and swap. While I can port atomic_ptr to Itanium, there's two or three things I can do, it's giving lockfree(aka sender) fits as he's trying to put into atomic_ptr some ABA free stuff. While I don't think there's any benefit to having the ABA stuff in atomic_ptr, I don't want to implement atomic_ptr in a way that makes it impossible for him to do that if that is what he wants to do.
> objects would have low contention to justify the few > spin-waits that would be needed while not paying the > heavy cost of using some synchronization object and > attendant API. With most everything in life there
ok, but how about a general purpose lightweight lock, like one based on inlined code without call to the OS ? a "cmpxchg8b" based loop will do the trick, for the stacks and queues examples it will require one CAS loop on entry and another CAS loop on exit (+ ultra simple code in between, *easily maintainable*), instead of a single convoluted CAS2 loop, I will be very interested to compare actual timings (btw how can we implement CAS2 for IA-32 targets ?)