I'm not sure if it's the right forum to ask this question, but it seems so; sorry if I'm wrong. I've been studying lock-free algorithms extensively for some time, read many papers, group discussions, etc. My question is: is it in general possible to create lock-free dynamic data structures without garbage collection?
Let us consider the following simple example: a linked list. struct elem { //... some data elem *nxt; //... constructors, etc. };
Now, any function, traversing through the list has to access an element having the hook to the previous one. So, we have a pointer elem *tmp that points to an element of the list and we want to access tmp->nxt. So, we check if (tmp->nxt != NULL) and... we still know nothing, because there is no way to assure other thread did not delete the element between the checking and another operation we do (whatever it is). Any form of reference counting, etc. keeps this race condition problem,
Do I overlook anything? So? Is garbage collection inevitable?
Thanks in advance for any clarifications or comments. Best regards
Garbage collection is not required due to you trashing the list by ways of unprotected CAS/DCAS/other techniques.
Garbage collection is required _because_ although you can remove a node from the list (safely or unsafely) you won't know if the node is in use (without adding reference counters etc...).
Since you currently do not know if the node is in use, your recourse is to ONLY remove the node from the list, then at some time later determine the node is no longer in use (then delete - garbage collect)e.g. at position in code when you know the list is not in use or at a time when you know no thread could possibly contain a pointer to that node.
If you use a reference counter .AND. place the reference counter in the node, you cannot find the node and bump the reference count instantaneously.
You can use coding practices to circumvent this issue. (at a cost). Example:
Place reference counter in node. Place "intending to bump reference counter" counter in list header.
Then
Bump "intending to bump reference counter" counter in list header (return value tells you delete may be in progress so dec, back off and retry)
Search list for node
Bump referencecounter in node (return value tells you if node is reserved for exclusive use, if so, dec, back off, retry)
Un-bump "intending to bump reference counter" counter in list header
This requires 2 interlocked inc and 1 interlocked dec to reference the node
Releasing the node requires 1 interlocked dec
This is 4 expensive interlocked operations simply because you want any and every thread to concurrently insert/delete nodes.
Now that you know how much this costs, you might be willing to accept a less expensive round about means for adding and/or deleting nodes. Possibly one without interlocks (or with one interlock) for n * insert/deletes.
The answer depends on the rules you set up for accessing and managing the linked list.
You can set the rules up such that deletes only occure at a point in the code when no other threads are accessing the list. This would be the lowest overhead and least intrusion into your single threaded list design.
Adds to either end of the list could be done to the list using simple CAS _provide_ deletes are not occuring. By the rule above, if any thread is in a position to use the list then deletes cannot occure. Therefore CAS to either end of list need not have ABA protection and thusdo not change the node structure from single threaded model.
Concurrent inserts into the middle, deletes from the list, will require ABA protection (typically done using DCAS with pointer+counter method). This also generally requires reference counters as well. The overhead in managing a list with reference counters and ABA protection is significantly higher than with simple list. Review your requirements closely to see if you can restriction the points in the code where you are permited to delete or insert into middle.
Yes, I know that using some forms of room synchronization migh help; also the case with no deletions at all is rather simple, What I'm interested about is the case when insertion and deletion may be concurrent. You suggested DCAS. Yes, I guess it could help to atomically check if the pointer is assinged and increase the reference count... But how commonly encountered is DCAS on today architectures? What about high-level libraries? I guess it's not implemented in TBB, is it?
So, I guess, if we use only the good old CAS, we are doomed to use a garbage collector, aren't we?
Garbage collection is not required due to you trashing the list by ways of unprotected CAS/DCAS/other techniques.
Garbage collection is required _because_ although you can remove a node from the list (safely or unsafely) you won't know if the node is in use (without adding reference counters etc...).
Since you currently do not know if the node is in use, your recourse is to ONLY remove the node from the list, then at some time later determine the node is no longer in use (then delete - garbage collect)e.g. at position in code when you know the list is not in use or at a time when you know no thread could possibly contain a pointer to that node.
If you use a reference counter .AND. place the reference counter in the node, you cannot find the node and bump the reference count instantaneously.
You can use coding practices to circumvent this issue. (at a cost). Example:
Place reference counter in node. Place "intending to bump reference counter" counter in list header.
Then
Bump "intending to bump reference counter" counter in list header (return value tells you delete may be in progress so dec, back off and retry)
Search list for node
Bump referencecounter in node (return value tells you if node is reserved for exclusive use, if so, dec, back off, retry)
Un-bump "intending to bump reference counter" counter in list header
This requires 2 interlocked inc and 1 interlocked dec to reference the node
Releasing the node requires 1 interlocked dec
This is 4 expensive interlocked operations simply because you want any and every thread to concurrently insert/delete nodes.
Now that you know how much this costs, you might be willing to accept a less expensive round about means for adding and/or deleting nodes. Possibly one without interlocks (or with one interlock) for n * insert/deletes.
>>But how commonly encountered is DCAS on today architectures?
Current Intel and AMD processors have CMPXCHG8B and CMPXCHG16B instructions and CPUID means to test to see if instructions are available.
FWIW
I hold a U.S. Patent application, Publication number 20080228784, titled
Double word compare and swap implemented by using triple single word compare and swap
The technique uses CAS on three words to emulate the functional equivilent of DCAS on a machine without DCAS capability. i.e. a Wait-Free technique that for all intents and purposes appears as atomic DCAS.
When (should) thread beginning emulated DCAS get interrupted (context switched) any of the other threads can complete the emulated DCAS. Should the system have CMPXCHG8B or CMPXCHG16B instructions (whichever is required) then a hardware DCAS is used.
This would be a good candidate for a high-end library to implement a generic function for DCAS that works on systems with/without CMPXCHG8B or CMPXCHG16B instructions but uses them when present.
"Garbage collection is required _because_ although you can remove a node from the list (safely or unsafely) you won't know if the node is in use (without adding reference counters etc...)." Yes, that's what I meant. If I delete the node, it might "breakthe ground" for anotherthread. But nothing bad will happen if I only disconnect the node from nodes that link to it. Inwhich case I have to either: i) accept a memorty leak, ii) keep pointers to these disconnected nodes manually (aww...) to delete them physically later, iii) use a garbage collector.
Only the third possibility seems acceptable.
"You can use coding practices to circumvent this issue. (at a cost). Example..."
Yes, it should work that way. And yes, it sounds complicated (fascinating, but complicated ;-) ). Also, I have to admit, I don't think the papers I read did address this issue in its whole complexity.
ii and iii are the same thing. If you choose to use a provided garbage collector keep in mind that a general purpose garbage collecter has no knowledge of your code and may inadvertantly delete an in-use object. Examine their code (if possible) to assure the correct behavior.
You can overload delete OR write a template to take a pointer to arbitrary object, insert into queue (most likely FIFO) since you cannot be assured the object has a next pointer or if it is safe for you to reuse the next pointer if it does. However, your generic template may not be sufficient either.
The better route, IMHO, is to write or finda concurrent list class that specificly addresses these issues in an efficient manner.
"ii and iii are the same thing. If you choose to use a provided garbage collector keep in mind that a general purpose garbage collecter has no knowledge of your code and may inadvertantly delete an in-use object. Examine their code (if possible) to assure the correct behavior."
Yes, thay are equivalent in the sense that (ii) means implementing a garbage collector "manually" and (iii) means using the one provided by the environment. I'm not sure, how likely GCs are to be buggy or not MT-safe? I believe Java's GC can be given some credit; the problem is I'm sticking to C++...
Anyway, is my impression correct, or you're not very enthusiastic about using the non-blocking staff? Or maybe just implementing it from scratch?
A proper GC system will tend to have smart pointers (i.e. a wrapper) which include reference counters and other synchronization capabilities (exclusive use, delete pending, ...). These all come with a cost in execution time.
When you favor ease in programming over performance then go with standardized (general) GC classes. When performance is a concern, you must weigh the costs.
If you do not understand the object persistence requirements of your program it is easy to improperly use GC and this may cause problems. When you do understand the object persistence requirements it is relatively easy to ascertain if deletions can be done in times of exclusive use of a list. Or at epoch intervals of threads. e.g. each thread has an epoch interval counter. Delete can snapshot these intervals. Then when all epoch intervals have changed (after node removed from list) then no thread can possibly reference node - delete permitted. Thread epoch intervals can be maintained without CAS.
There are 2 general techniques for handling memory in lock-free algorithms.
First is GC. However GC is not a very good name for it, because (1) full-fledged GC semantics may be too much for safe-memory reclamation in the context of lock-free algorithms (too costly and unnecessary), (2) there many techniques for it and they are not limited to GC (as most people understand the term now). I (and some other people) prefer to call it PDR (Partial copy-on-write Deferred Reclamation).
Second is TSM (Type Stable Memory), with private case of not reclaiming any memory at all. Yes, it requires prevention of ABA problem.
PDR is much nicer for lock-free algorithms, because it solves not only safe operation of lock-free algorithm, but also safe memory reclamation and ABA prevention. With TSM you sometimes still have to decide how you will reclaim memory and how you will handle ABA.
Re: Any form of reference counting, etc. keeps this race condition problem
As I said, there are many PDR techniques. Plain old reference counting with basic thread-safety (which allows one to acquire another reference IFF he already has one) (boost::shared_ptr<>) is unable to solve the problem. However there is reference counting with strong thread-safety (which allows one to acquire another reference even if he does not have one yet) and it indeed solves the problem. For example see Joe Seigh's excellent atomic-ptr-plus:
Regarding PDR and GC. Specialized PDR algorithms may be so efficient so that some people do use PDR in environments with built-in GC (Java, .NET).
Regarding DWCAS. If you are targeting only x86 then IMO you may rely on the presence of DWCAS today. However you have to code some back path for the situation if DWCAS is not present. It's not of much significance how exactly you will emulate DWCAS, because it MUST present on most todays processors. Personally I would prefer emulation which uses global hashed array of spin-mutexes and just lock the mutex for the duration of DWCAS, this will allow you to pay only 1 single-word atomic RMW for DWCAS in general case.
And if you target other architectures then DWCAS is not necessary present. In this situation it's better to use technique which does not rely on DWCAS at all (some form of PDR). Paying 3 atomic RMWs for wait-free emulation on fast-path is IMO too costly, and approach based on hashed array of spin-mutexes may cause costly spinning in general case (not lock-free).
Note that there are other techniques to emulate DWCAS, like pointer packing (used by MS Interlocked SList API), usage of offsets instead of pointers, usage of 4GB memory pools (which allows you to have 32-bit pointers in 64-bit mode).
The important thing is that there are important lock-free algorithms which do not need any form of safe memory reclamation at all. For example you may take a look at the following multi-producer/single-consumer wait-free queue:
I've especially built the algorithm around the problem of memory reclamation and ABA.
There are also some tricks that are applicable in private cases. Jim described example with separate helper counter in persistent memory (which is actually no more than manually coded spin-mutex). There is more efficient implementation of that (atomic reference counting with strong thread-safety), which is wait-free and uses only 1 atomic RMW to acquire a reference (requires DWCAS or some sort of emulation though):
This algorithm may be used in another way. Such that you will get basically zero overhead iteration over lock-free linked-list (and similar data structures). I.e. access to the next node will require only 1 plain load instruction (not counting Alpha architecture which will require memory fence before dereferencing next node). The trick is that you must not acquire and release reference to each individual node but instead freeze and unfreeze memory reclamation.
The other example is MS Interlocked SLIst API. The trick is that with lock-free stack the only type of memory access which may go to already freed memory is a read (no writes to freed memory). So access to reused memory is handled by failed CAS, and access to actually unmapped from process's address space memory is handled by especially setup exception handler which catches memory access violations and just retries the operation.
Sorry no references here, they do not document the technique. If you are interested in details investigate with disassembler, there are subtle things like handling real access violations (provided user coding bug).
The interesting thing is that regardless of the fact that the technique looks a bit snoopy, it provided basically optimal performance and characteristics.
Proper GC will not tend to use smart pointer nor comes with overheads. Smart pointers is a performance anti-pattern in the multi-core era. Proper GC provides virtually zero overhead object life-time management and eliminates overheads related to other techniques. For example you may see my concurrent hash map which uses a bunch of advanced synchronization techniques, on key of which is a PDR (GC) system:
It has perfect linear scaling on read-mostly work-load, on quad-core processor it beats TBB concurrent hash map by a factor of 50. On bigger systems I expect it to have even higher advantage (several orders of magnitude).
Here is a sketch of another distributed scalable low-overhead PDR algorithm I developed:
- Reference counting with basic-thread safety - Reference counting with strong-thread safety - Low overhead PDR - PDR with long-term cross-thread references
As-is has some limitations, though.
Other PDR ("GC") techniques include RCU, SMR, VZOOM. Note that most stuff is patented here. My algorithms are not patented and hopefullyare in public domain (does not mean that will not be patented in future by other "inventors").
Good post, good references. I would like to comment on:
>> Jim described example with separate helper counter in persistent memory (which is actually no more than manually coded spin-mutex).
While the counter is in persistent memory (one counter per thread per list/container) it is not used as a spinlock.
Recall that one of the options for handling the linked list problem is to isolate the activity pattern (generally present in most applicaitons) whereby there are periods of use of list and periods of unuse of the list and then perform the deletions during the periods of unuse. This technique will work well in most applications _except_ when the list is use completely asychronously. IOW there is no assuredperiod when all threads will not be using the list. It should be noted though that there will be a period, perhaps very small, where each thread will have a specific list in a condition of unuse. The counter in persistent memory (one per thread using the list)is to be incremented each time the thread passes throughits period of unuse.Using this rule, delete is performed with
1) unlink node (thread safely) 2) snapshot counters for list 3) pack node pointer, shapshot, and pointer back to active counters into struct and enqueue to "to be deleted" list 4) return to other work
Periodically a GC routine runs, examines "to be deleted" list when/if entry in list contains snapshot where all counters differ with active counters then delete node (formerly in list) and reclaim "to be deleted" list node.
Note, 3) may also require the address of a functor to perform the dtor on the object should that be necessary.
Thanks, indeed, for this detailed and insightful post. Now I have much to read and to think about ;-) (which is on which I counted).
A few things I can say now: i) It seems to be much confusion about these topics. The Wikipedia article about garbage collection does not seem to be aware of any possible misinterpretations. And when I tired to google for "deferred reclamation", the post you just wrote was one of the first 10 pages to appear! ii) Constructing non-blocking algorithms is still an art. Well, I knew it theoretically, but appearently there are even fewer general, widely applicable techniques than I thought. iii) Oh and just - thanks for your great algorithms. I hope they won't be patented. Fortunatley, Europe seems to be free from software patents (yet!), but...
Good post, good references. I would like to comment on:
>> Jim described example with separate helper counter in persistent memory (which is actually no more than manually coded spin-mutex).
While the counter is in persistent memory (one counter per thread per list/container) it is not used as a spinlock.
Recall that one of the options for handling the linked list problem is to isolate the activity pattern (generally present in most applicaitons) whereby there are periods of use of list and periods of unuse of the list and then perform the deletions during the periods of unuse. This technique will work well in most applications _except_ when the list is use completely asychronously. IOW there is no assuredperiod when all threads will not be using the list. It should be noted though that there will be a period, perhaps very small, where each thread will have a specific list in a condition of unuse. The counter in persistent memory (one per thread using the list)is to be incremented each time the thread passes throughits period of unuse.Using this rule, delete is performed with
1) unlink node (thread safely) 2) snapshot counters for list 3) pack node pointer, shapshot, and pointer back to active counters into struct and enqueue to "to be deleted" list 4) return to other work
Periodically a GC routine runs, examines "to be deleted" list when/if entry in list contains snapshot where all counters differ with active counters then delete node (formerly in list) and reclaim "to be deleted" list node.
Note, 3) may also require the address of a functor to perform the dtor on the object should that be necessary.
Jim Dempsey
That sounds like counter-based RCU to me; there may be patent issues...One other point. Your going to probably have to guard mutations to the per-thread counters with an expensive #StoreLoad memory barrier. You would not want a mutation to the counter to be hoisted up above the previous list access. You can get around this, but the techniques are extremely OS specific.
Also, you don't necessarily have to keep a snapshot of the counters on a per-object basis. You can instead enqueue the object onto the defer list and a dedicated polling thread can periodically flush the entire list into a private list (e.g., pending), perform a snapshot, and when that snapshot differs from a subsequent snapshot it transfers objects to another list (e.g., generation). Objects that were in the generation list prior to the transfer are now in a persistent quiescent state and can be safely reclaimed. The logic can look something like the following high-level pseudo-code:
Good post, good references. I would like to comment on:
>> Jim described example with separate helper counter in persistent memory (which is actually no more than manually coded spin-mutex).
While the counter is in persistent memory (one counter per thread per list/container) it is not used as a spinlock.
Recall that one of the options for handling the linked list problem is to isolate the activity pattern (generally present in most applicaitons) whereby there are periods of use of list and periods of unuse of the list and then perform the deletions during the periods of unuse. This technique will work well in most applications _except_ when the list is use completely asychronously. IOW there is no assuredperiod when all threads will not be using the list. It should be noted though that there will be a period, perhaps very small, where each thread will have a specific list in a condition of unuse. The counter in persistent memory (one per thread using the list)is to be incremented each time the thread passes throughits period of unuse.Using this rule, delete is performed with
1) unlink node (thread safely) 2) snapshot counters for list 3) pack node pointer, shapshot, and pointer back to active counters into struct and enqueue to "to be deleted" list 4) return to other work
Periodically a GC routine runs, examines "to be deleted" list when/if entry in list contains snapshot where all counters differ with active counters then delete node (formerly in list) and reclaim "to be deleted" list node.
Note, 3) may also require the address of a functor to perform the dtor on the object should that be necessary.
Jim Dempsey
I forgot to mention that this technique is not really sufficient enough to prevent memory leaks unless threads are always going to be bumping that counter at periodic/episodic intervals. Think of a situation in which you delete a node, take a snapshot, and one of the threads decides not to access the list anymore thus its counter remains the same. Well, that node will never be deleted. You generally want to have a counter and a boolean per-thread and do something like:
Then invoke `thread_enter()' before you access the list, and `thread_leave()' when your finished. This gives critical information to the polling logic which can be used to ensure that no objects will be leaked...
The same amount of overhead occures in making a snapshot of the counters regardless as to if the polling thread performs the snapshot or the thread that enqueues the object for deletion makes the snapshot. Moving the snapshot code to the polling thread makes sense when the polling thread has lower priority (which it will have when it sleeps).
The counters need not be associated with the list. As an example, a counter might be incrimented in the outer most loop (before or after any objects referenced in the list and/or several lists). This reduces the polling frequency (at expense of increased latency to deletion of object). And reduces the number of counters as well as eliminates the thread_enter/thread_leave.
Good post, good references. I would like to comment on:
>> Jim described example with separate helper counter in persistent memory (which is actually no more than manually coded spin-mutex).
While the counter is in persistent memory (one counter per thread per list/container) it is not used as a spinlock.
Oh, I meant different your example:
------------------------------------------------------------------- Place reference counter in node. Place reference counter in node. Place "intending to bump reference counter" counter in list header.
Then
Bump "intending to bump reference counter" counter in list header (return value tells you delete may be in progress so dec, back off and retry)
Search list for node
Bump reference counter in node (return value tells you if node is reserved for exclusive use, if so, dec, back off, retry)
Un-bump "intending to bump reference counter" counter in list header -------------------------------------------------------------------
Recall that one of the options for handling the linked list problem is to isolate the activity pattern (generally present in most applicaitons) whereby there are periods of use of list and periods of unuse of the list and then perform the deletions during the periods of unuse. This technique will work well in most applications _except_ when the list is use completely asychronously. IOW there is no assuredperiod when all threads will not be using the list. It should be noted though that there will be a period, perhaps very small, where each thread will have a specific list in a condition of unuse. The counter in persistent memory (one per thread using the list)is to be incremented each time the thread passes throughits period of unuse.Using this rule, delete is performed with
1) unlink node (thread safely) 2) snapshot counters for list 3) pack node pointer, shapshot, and pointer back to active counters into struct and enqueue to "to be deleted" list 4) return to other work
Periodically a GC routine runs, examines "to be deleted" list when/if entry in list contains snapshot where all counters differ with active counters then delete node (formerly in list) and reclaim "to be deleted" list node.
Note, 3) may also require the address of a functor to perform the dtor on the object should that be necessary.
RCU with per PE counters must be patented by IBM ;)
The same amount of overhead occures in making a snapshot of the counters regardless as to if the polling thread performs the snapshot or the thread that enqueues the object for deletion makes the snapshot. Moving the snapshot code to the polling thread makes sense when the polling thread has lower priority (which it will have when it sleeps).
Chris meant that polling thread makes a snapshot for a batch of objects, so this will have lower overhead than shapshot for each object during enqueue. Batching tells positively on such things, for example polling thread may sort batch of objects by originated thread id and then send sub-batches back to originated threads with single CAS.
The counters need not be associated with the list. As an example, a counter might be incrimented in the outer most loop (before or after any objects referenced in the list and/or several lists). This reduces the polling frequency (at expense of increased latency to deletion of object). And reduces the number of counters as well as eliminates the thread_enter/thread_leave.
I guess Jim have in mind some number crunching with regular access patterns, while Chris have in mind some network server where worker thread may block for a long time waiting for a new request to arrive. I suggest you to agree on the context first ;)
Though this confirms important moment regarding such things, their design is extremely context-dependent.
There is also the concern that by placing the snapshot of the counter in the deletion thread, then the snapshot might occur _after_ thread termination. IOW, you will never see change (requires persistent counters). This though could be fixed with thread exiting placing "final count" in counter (0 or 0xFFFFFFFF, ...) then use ((Count != snapshotCount)||(Count==FinalCount))
The batch snapshot reduces overhead (once per batch) but may increase latency of time object held to time object is deleted. This may or may not be an issue.
There is also the concern that by placing the snapshot of the counter in the deletion thread, then the snapshot might occur _after_ thread termination. IOW, you will never see change (requires persistent counters). This though could be fixed with thread exiting placing "final count" in counter (0 or 0xFFFFFFFF, ...) then use ((Count != snapshotCount)||(Count==FinalCount))
The batch snapshot reduces overhead (once per batch) but may increase latency of time object held to time object is deleted. This may or may not be an issue.
Jim
There are other ways to handle the thread termination issue that do not require persistent counters, but I don't really want to get into that right now. The latency issue is definitely a point of concern. However, there are ways around that as well. You can indeed have bounded garbage count in a deferred reclamation scheme (e.g., SMR aside for a moment). The simplest technique is to use a fast-path semaphore. Here is simple pseudo-code:
[cpp]#define GARBAGE_BOUND 4096
static thread_safe_list g_defer;
static fast_semaphore g_sem(GARBAGE_BOUND);
void defer(object* o) {
g_sem.wait();
g_defer.push(o);
}
void reclaim(non_thread_safe_list g) {
for_each(g) as o {
delete o; // or functor/whatever...
}
g_sem.post(g.count());
g.flush();
}
void polling_thread() {
snapshot s1, s2;
non_thread_safe_list pending, generation;
s1.sample();
for (;;) {
wait_for_polling_interval();
s2.sample();
if (s1 != s2) {
reclaim(generation);
// generation is now empty
generation.transfer(pending);
// pending is now empty
s1.sample();
}
pending.transfer(g_defer);
// g_defer is now empty
}
}[/cpp]
This is a very crude soultion, but it "works"! ;^)
Once more, thanks to all who contribute/contributed to this interesting discussion. A few more questions to Dmitriy, if you'd be so kind:
i) Could you provde me any resources about PDR as you understand it? Googling finds mostly your discussions on these forums. ;-) ii) You use the volatile keyword in your algorithms. AFAI understand you use it in the "MS Visual sense". On other environments can it be simply replaced by tbb::atomic<>? Thanks!
Once more, thanks to all who contribute/contributed to this interesting discussion. A few more questions to Dmitriy, if you'd be so kind:
i) Could you provde me any resources about PDR as you understand it? Googling finds mostly your discussions on these forums. ;-) ii) You use the volatile keyword in your algorithms. AFAI understand you use it in the "MS Visual sense". On other environments can it be simply replaced by tbb::atomic<>? Thanks!