Re: Concurrent stack?

alexandre_hamez · ‎08-26-2008

Hello everybody,

I was wondering if there is any plan to add a concurrent stack to tbb? Indeed, a concurrent_queue is almost always the right way to pass some data from one thread to another one, but sometimes we would like to fetch the most recently added data.

Thanks!

ARCH_R_Intel · ‎08-26-2008

This is the first time anyone has requested a concurrent_stack.

If you have a multiple producers and a single consumer, it's simple to build a concurrent stack using a linked list and atomic operations on the root. Alas if there are multiple consumers, the ABA problem comes into play.

In the general multiple-producer multiple-consumer case, it may be hard to beat a plain STL stack with a spin lock around it, unlesssupport for avoiding ABA is available. (Alas, early Opteron andIntel x64 are missing the critical double-width CAS that avoids ABA).Now that I've said it is hard, someone can contribute a concurrent_stack that proves me wrong :-)

RafSchietekat · ‎08-26-2008

That's not even a challenge... :-)

(Added) No takers? To clarify, I meant that implementing such a Building Block is more of a chore than a challenge if, like for atomics, it is allowed to default to locking where specific support is not available (I would not know how to do this in general without such hardware support). Support can be had in the form of atomic operations beyond the size of a pointer as already indicated (how many of those early 64-bit x86 processors that don't provide that still exist, assuming that they're running 64-bit-pointer programs?), or primordial transactions from Alpha/ARM/MIPS/POWER/PowerPC (this is something they can easily handle), or ESA/390-z/Architecture PLO I suppose (I have not looked into it yet). Note that assembler-level programming is required in most cases, it probably cannot be efficiently done in general in terms of C++-level atomics.

Dmitry_Vyukov · ‎08-29-2008

Raf_Schietekat:
That's not even a challenge... :-)
Support can be had in the form of atomic operations beyond the size of a pointer as already indicated (how many of those early 64-bit x86 processors that don't provide that still exist, assuming that they're running 64-bit-pointer programs?)

I was thinking about this some time ago. I conclude that:
(1) There is a relatively small amount of such lagging processors. And this number constantly decreases over time.
(2) Those processors were not multicore, so it's not very disastrous to use lock on those.
So I decide that it's beneficial to use algorithms which relying on double word CAS anyway.

Dmitry_Vyukov · ‎08-29-2008

Raf_Schietekat:

or ESA/390-z/Architecture PLO I suppose (I have not looked into it yet).

z/Arch PLO is capable of performing 6 types of operations:
- compare and load
- compare and swap
- double compare and swap
- compare and swap and store,
- compare and swap and double store
- or compare and swap and triple store
on 32, 64 and 128 (!) bit operands.
So maximum what you can get is 4 compares and 12 stores!

For stack you can use compare on 'version' field and stores to 'version' field and 'next' field.

ARCH_R_Intel · ‎08-29-2008

I agree that the double-width compare-and-swap should be used if available; however, there must be a backup route for older hardware. On x86 this requires a run-time check of the hardware capabilities. Perhaps an abstract atomic (pointer+tag) can be provided, and on hardware without double-width CAS, the tag can be one bit less than a full word, and the missing bit used internally for a spin lock.

One other consideration. If the container really is just a double-width compare-and-swap on moder hardware, then memory allocation/freeing overheads will dominate the overhead. Thus an intrusive container interface might be more appropriate. Here's what the interface might look like (adapted from old KAI C++ internals):

typedef implementation-defined concurrent_intrusive_stack_link;

template concurrent_intrusive_stack;

Intrusive queues and stacks is something I've wanted in TBB for a while, but we've never had time to implement/test/document them. Furthermore, having gone that far away from classic STL, we probably want to make them non-blocking too, at least if hardware permits.

- Arch

Dmitry_Vyukov · ‎08-29-2008

MADadrobiso:
One other consideration. If the container really is just a double-width compare-and-swap on moder hardware, then memory allocation/freeing overheads will dominate the overhead. Thus an intrusive container interface might be more appropriate. Here's what the interface might look like (adapted from old KAI C++ internals):

typedef implementation-defined concurrent_intrusive_stack_link;

template concurrent_intrusive_stack;

Intrusive queues and stacks is something I've wanted in TBB for a while, but we've never had time to implement/test/document them. Furthermore, having gone that far away from classic STL, we probably want to make them non-blocking too, at least if hardware permits.

I completely agree that (1) intrusive containers are the choice for high-performance software/libraries, and (2) "single-threaded interfaces" are not suitable for high-performance concurrent collections. Ok, some interfaces are suitable, but not in general case.

Btw, there are specialized memory allocators from some types of concurrent collections. For example, this one is very effective allocator for fifo-queues. Allocation and deallocation don't involve atomic RMW nor heavy memory fences.
http://groups.google.com/group/lock-free/browse_frm/thread/2c6f6250762385e1

RafSchietekat · ‎08-29-2008

Why a tag/lock combination, even if combined in one word?

Maybe it's just me, but I'd rather eat a lemon than use an intrusive type (they're so... intrusive!). How about instead internally refurbishing the temporary storage: how much additional performance would then be left untapped, and would it be relevant considering Amdahl's law?

For completeness: Itanium has a somewhat curious lopsided instruction cmp8xchg16 (it's even a misnomer, because it's not really exchanging 16 bytes, it's exchanging 8 and storing an additional 8) that may also do the trick (I have not yet checked whether a simple 8-byte tag write will properly synchronise with cmp8xchg16, but I presume it will, for this instruction to be meaningful), so the only ones left out from the architectures currently targeted by the "Additions to atomic" patch are (pre-Rock) SPARC (for which something with indexed storage or so can still be devised, I suppose) and PA-RISC (which will always need a lock to do anything).

Dmitry_Vyukov · ‎08-29-2008

MADadrobiso:
If you have a multiple producers and a single consumer, it's simple to build a concurrent stack using a linked list and atomic operations on the root. Alas if there are multiple consumers, the ABA problem comes into play.

ABA is not serious problem here. Real problem in concurrent stack is safe memory reclamation.
Even if double-word CAS is used to solve ABA problem, access violation (SIGSEGV) is possible in pop() function.

Dmitry_Vyukov · ‎08-29-2008

Raf_Schietekat:
That's not even a challenge... :-)

How are you going to solve safe memory reclamation problem in stack?
Even if double-word CAS is used to solve ABA problem, access violation (SIGSEGV) is possible in pop() function.

RafSchietekat · ‎08-29-2008

Do elaborate why a SIGSEGV would be so inexorable...

ARCH_R_Intel · ‎08-29-2008

Safe memory reclamation was the issue that we ran into when I had a summer intern, Prahabjan Kambadur, looking into lock-free algorithms for TBB back in 2005. Hazard pointers seemed like the way to go, though that would require that each thread using the container would have to register itself with a hazard-pointer overseer. The summer ended before we could get around to implementing hazard pointers. Also, at the time I thought that requiring registration was too much of a nuisance, but I've since changed my mind since no better alternative has come up.

For the ABA issue, we ended up using a tagged pointer and the following primitive, inspired by Itanium's cmp8xchg16:

serial_number __TBB_TaggedCompareAndSwap( location* address, ptrdiff_t new_sn, void* new_ptr, ptrdiff_t old_sn, void* old_ptr) {
serial_number result;
atomic {
result = old_sn;
if( address->sn==old_sn ) {
assert( address->ptr==old_ptr );
address->ptr = new_ptr;
address->sn = new_sn;
}
}
return result;
}

Machines with double-width compare-and-swap can easily implement this instruction, as can recent Itanium processors with cmp8xchg16. The irony was, to our chagrin, that Itaniums of the time did not implement cmp8xchg16, even though it was in the architecture manuals. A separate section of the manual had the fine print saying the instruction might not be available :-(

For machines without an obvious instruction, the special tag -1 was used internally to denote "locked". That let us implement the primitive using a spin lock for the "atomic" part.

- Arch

RafSchietekat · ‎08-29-2008

So no tag/lock combination after all?

What interface would it be that causes a safe memory reclamation problem?

Dmitry_Vyukov · ‎08-29-2008

Raf_Schietekat:
Do elaborate why a SIGSEGV would be so inexorable...

Consider following stack implementation:

T stack::pop()

 node_t* head = m_head;

 for (;;)

 if (0 == head)

 return T();

 node_t* next = head->next; (***)

 if (CAS(&m_head, &head, next))

 T val = head->value;

 delete head;

 return val;

ABA aside for a moment. In line (***) SIGSEGV is possible, if concurrent popper will pop node and delete it.

Alexey-Kukanov · ‎08-29-2008

Raf_Schietekat:
Do elaborate why a SIGSEGV would be so inexorable...

In pop(), for the stack implemented as single-linked list, the first operation would be to read the head and the second operation would be to read the head->next. There is a timing gap between two; if the thread execution is preempted at this exactly point, after the thread is back it might happen that the node pointed by head was popped by another thread, and destroyed. This is howyou get SIGSEGV, I believe.

I have a case that is much less restrictive than the usual semantics of the stack. In my case, it is guaranteed that objects (nodes) are alive all the time, and moreover the exact LIFO order is not important. Even in this "relaxed" problem, there seems no way to overcome ABA without DCAS.

Transactional memory would come handy,but I'm afraid it will be uselessly slow for a highly contended case like this one where all operations are done with the head.

Dmitry_Vyukov · ‎08-29-2008

MADadrobiso:
Safe memory reclamation was the issue that we ran into when I had a summer intern, Prahabjan Kambadur, looking into lock-free algorithms for TBB back in 2005. Hazard pointers seemed like the way to go, though that would require that each thread using the container would have to register itself with a hazard-pointer overseer. The summer ended before we could get around to implementing hazard pointers. Also, at the time I thought that requiring registration was too much of a nuisance, but I've since changed my mind since no better alternative has come up.

Personally I consider SMR (Hazard Pointers) as worst GC technique for lock-free algorithms. SMR has very strong advantage, it has upper limit on number of pending (unreclaimed) objects. But it has very high price, store-load style fence for every acquisition. And it can't acquire more than one object at a time. So if you need to travese list, you have to acquire/release every node in the list (i.e. N store-load memory fences).

Oh! And don't forget that SMR is patented by Sun!

There is a couple of other GC techniques.
RCU. Has very low overheads. Some implementations are patented by IBM.
vZOOM. Has very low overheads. Also patented.
Proxy-collector algorithm. Really very good for lock-free collections. It seems that some implementations are patented. Here is one of my implementations of proxy-collector:
http://groups.google.com/group/lock-free/browse_frm/thread/18a9f43b93a6b3f0
And there is also Scalable Distributed Reference Counting. I hope that there are no patents on it:
http://groups.google.com/group/lock-free/browse_frm/thread/46d00a0f543a758e

RafSchietekat · ‎08-29-2008

My reasons for refurbishing storage: not intruding on the user's type, probably avoiding some (de)allocation overhead, avoiding reading from memory that has disappeared. Is it really imperative to come back from a high-water mark before the end of the concurrent_stack's lifetime?

(Removed related question.)

Dmitry_Vyukov · ‎08-29-2008

MADadrobiso:
Safe memory reclamation was the issue that we ran into when I had a summer intern, Prahabjan Kambadur, looking into lock-free algorithms for TBB back in 2005. Hazard pointers seemed like the way to go, though that would require that each thread using the container would have to register itself with a hazard-pointer overseer. The summer ended before we could get around to implementing hazard pointers. Also, at the time I thought that requiring registration was too much of a nuisance, but I've since changed my mind since no better alternative has come up.

As for registration. Yes, most GC techniques require some ugly things like thread registration/deregistration, or periodical activity from all registered threads, or some extremely OS-dependent code... or quite expensive.

Btw, oiginal non-amortized proxy-collector algorithm:
http://groups.google.com/group/comp.programming.threads/msg/d911c93ffa8db264
doesn't require any registration/deregistration, but is quite expensive - 2 atomic RMW operations to acquire/release a batch of nodes.
My amortized proxy-collector algorithm is extremely efficient, but do require periodic activity from all threads:
http://groups.google.com/group/lock-free/browse_frm/thread/18a9f43b93a6b3f0

Dmitry_Vyukov · ‎08-29-2008

MADadrobiso:
For the ABA issue, we ended up using a tagged pointer and the following primitive, inspired by Itanium's cmp8xchg16:

If you use some form of safe memory reclamation, then ABA problem usually just doesn't exist.
If thread holds 'A', than another thread can change structure's anchor to 'B', but CANNOT change back to 'A', because first thread holds it, so 'A' can't be reclaimed.
And some form of safe memory reclamation is usually needed anyway, so in practice in most cases ABA is not a problem, and you do not need double-word CAS.

Dmitry_Vyukov · ‎08-29-2008

Raf_Schietekat:

But I would be interested in a safe way to read arbitrary memory locations where no such guarantee exists, because I've been struggling with that question this week...

There is a number of techniques available. SMR, ROP, RCU, VZOOM, PC, DRC.
Almost all are patented. But one can use slightly modified versions of patented algorithms.
If you are interested in some technique I can provide links.

RafSchietekat · ‎08-29-2008

Hmm... Thread 1 and 2 see A, thread 1 takes it, thread 3 takes B, thread 1 pushes A, thread 2 wreaks havoc by not allowing for a pretty straightforward ABA? This can even happen with memory reclamation, it would just be less likely that some thread would recycle the memory and push it on the stack just at the wrong time, but not impossible. What am I missing?