- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Before I repeat the TBB teams' past work, I'm looking at wait-free data structures and the penalty of using atomic operations to implement them. Any general suggestions or observations on the performance of wait-free data structures?
In particular I'm looking at data structures in the "Art of Multiprocessor Programming" book to implement. Before I spend a lot of time doing this, any words of wisdom?
Thanks!
AJ
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So-called "lock free" parallel data structures rely on the conceit that data structure updates can be handled as atomic operations. You could argue that a mutex is a "lock free" parallel data structure in that it uses atomic operations to update its data without locking in order to implement,... well, a lock. This limitation of handling data structure updates only as atomic operations does restrict the variety of data structures that can be implemented in a lock-free manner.
I don't know what quote of Arch's that you may be referring to, but I do know that there are data structures with so many internal dependencies that implementations that try to use internal locks/atomics are not likely to generate better performance than alternatives that just wrap the data structure accesses in a global lock.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been fortunate, that the book "The Art of Multiprocessor Programming" has a collection of wait-free algorithms which I can implement directly using TBB in C++. Unfortunately they all have different consistency guarantees, but proper documentation can address this.
My previous post on atomic operation performance on Intel hardware gave me a general idea as to the performance to expect on highly contested data structures. What I didn't post, although perhaps I should, is that if you do a loop like for(int var = 0, int i = 0; i < 500; ++i) var++; the operations scale just fine... I take this to just mean that the cache lines have had a chance to update.
I see two ways to implement scalable wait-free data structures:
1) Use atomic operations only.
2) Use thread-local storage to perform updates on thread-local parts of a data structure, combined with a sprinkling of atomic operations which are seldomly called. For instance, an unordered_vector could work by giving each thread a segment of memory to fill with elements when it calls insert() (each segment is contiguous), when that segment is filled it gets another one which results in an atomic increment in the number of segments available.
After a great deal of thought, neither approach seems perfect by itself. Thus I have decided to implement internal structures to support both approaches. I should stress that my intention is to not allow the user to ever see these low-level details, they will simply be given the interface to the data structure and it will just work as advertised.
Thoughts?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It would be interesting to hear about this straight from the horse's mouth, even if only in the form of links to Arch's previous writings on the subject: have results that discredit the relative efficiency of lock-free/wait-free algorithms been obtained on all sorts of hardware? If
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wait-free, lock-free, it seems easy to get lost in semantics here. Raf suggests and I would agree more generally that in the case of two HW threads trying to modify the same data structure, theres no such thing as a wait free algorithm: the best anyone can expect is to manage access so that the contentious delay of a second modifying thread is no worse than the time it takes to flush an atomic variable from the first thread and establish ownership to it by the second thread. But its still a wait..
Im afraid I did not follow the argument regarding the scaling of the cited for-loop. That loop could easily be optimized by a compiler to touch no cache linesvar and i could both be assigned to registersor a smart compiler could eliminate the loop entirely. How does that prove scaling?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I built a function:
void instructionDelay(int cycles)
{
for(int i = 0; i < i; ++i);
}
I then compiled this function in a separate object file with -O0 to disable optimizations. This produces this (using objdump -dS on the .o file):
0000000000000000 <_Z16instructionDelayi>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 83 ec 10 sub $0x10,%rsp
8: 89 7d f8 mov %edi,0xfffffffffffffff8(%rbp)
b: c7 45 f0 00 00 00 00 movl $0x0,0xfffffffffffffff0(%rbp)
12: 8b 45 f0 mov 0xfffffffffffffff0(%rbp),%eax
15: 8b 55 f8 mov 0xfffffffffffffff8(%rbp),%edx
18: 3b c2 cmp %edx,%eax
1a: 7d 0e jge 2a <_Z16instructionDelayi+0x2a>
1c: 83 45 f0 01 addl $0x1,0xfffffffffffffff0(%rbp)
20: 8b 45 f0 mov 0xfffffffffffffff0(%rbp),%eax
23: 8b 55 f8 mov 0xfffffffffffffff8(%rbp),%edx
26: 3b c2 cmp %edx,%eax
28: 7c f2 jl 1c <_Z16instructionDelayi+0x1c>
2a: c9 leaveq
2b: c3 retq
So I do a call to instructionDelay() with some integer argument *then* update the atomic value. I wanted to see just how much work had to be done to break-even on the cache-line transfer time on my system.
I agree that accessing the same region of data will create a contested segment of memory, and the cache will have to deal with that and you get bad performance. That may or not be avoidable in all algorithms.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Still, it might be that if the wait is at a sufficiently low level (below the programming language) on a suitable processor, a "wait-free" algorithm might outperform one that isn't? I prefer the adversary process to run its course.
I thought that "aj.guillon" made "var" an int by accident and really meant an atomic. But it's obviously only a necessary condition for good scalability.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For terminology of wait-free versus lock-free versus obstruction free, the Wikipedia entry looks good. It references "On the Inherent Weakness of Conditional Synchronization Primitives",which shows that wait-free algorithms are inherently inefficient if compare-and-swap or load-linked/conditional-store are the only synchronization primitives available, and that other operation like fetch-and-add are necessary. Alas, no commercial architecture that I'm aware of has a scalable fetch-and-add (it can be done scalably via combining networks).
My opinion is that non-blocking algorithms are important in situations where lock preemption (arising from oversubscription) is an issue. But when oversubscription is not an issue, classic locked algorithms, combined withdesigning to avoid lock contention, will generally out perform non-blocking algorithms on current Intel hardware.I don't know about other hardware, but observe that:
- Non-blocking algorithms generally use more atomic operations than it takes to acquire a lock.
- Atomic operations in non-blocking algorithms generally need a memory fence.
- Memory fences incur an inherent penalty on out-of-order processors or processors with caches.
So on an in-order processor without any cache (e.g. Tera MTA, system-on-a-chip), non-blocking algorithms might be competitive evenin the absence oflock preemption.
The reality is that high performance processors are looking more and more like distributed memory machines; i.e., with performance dependent on very careful attention to cache locality.
Intel-specific note: For processors with more than one hardware thread per core (e.g. Pentium 4, Nehalem), a delay loop should have a pause instruction, otherwise it may unnecessarily starve the other hardware thread. TBB has this functionality in its internal routine __TBB_Pause( int32_t n ). On x86 machines, it executes n pause instructions.
- Arch
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oh yes, did anyone at TBB ever get around to timing my latest concurrent_hash_map submission, after the application of an attempted cure for the obvious (performance) problem with the naive approach to erasing items? I think that even comparable performance would make it a preferable implementation, because it is easier to reason about (won't block except when so requested for exclusive element access, minimised contention, concurrent segment reallocation, no zombie elements, ...), but I wouldn't be altogether surprised if now it were also faster.
(Added) The relevance of that here would be that it might also give better relative performance on something like POWER/PowerPC with relaxed atomic operations (and explicit fences where needed), so the ultimate test would be there and on top of my atomics submission.
(Added) And please make sure that the performance comparison is fair, and not just a micro-benchmark on zombie element creation (one thread thinking it's using an element in the map, while another thread can simultaneously remove it and a third even insert another element with the same key!), which was a negative goal in my implementation (preserving TBB's previous behaviour).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For my 2 cents worth on the Lock-Free / Wait-Free
Lock-Free: Code that can use CAS or DCAS, but does not use CAS or DCAS to set a mutex (software lock), for the purpose of holding a resource (e.g. modifying a structure). CAS and DCAS are interlocked operations (atomic) and unfortunately Intel chose the name of the prefix to perform the atomic operation as LOCK. This gives some the impression that a software lock is being performed when it is not. One potential problem with using a software lock (mutex) is the thread holding the mutex can be interrupted by the O/S, and suspended for an extended period of time. While the work intended to be performed while holding the software lock is short, The O/S intervention causes the time to perform the operation to become excessively long.
Wait-Free: Code that uses CAS and DCAS without software locks to transition the state of a structure, generally requiring several instructions and expressly where the change sequence is made available during change operation whereby other threads can advance the changes initiate by the first thread to begin the state change. Even for simple operations, bug-free wait-free coding is extremely difficult to write.
I consider myself an experienced programmer (40 years) and I have written a Lock-Free / Wait-Free simulation of a DCAS on a system that supports only CAS instructions (e.g. Opteron 270 in 64-bit mode to simulate 128-bit DCAS). And then used this to produce a Lock-Free / Wait-Free FIFO using singly linked nodes with Head and Tail pointers. Getting this relatively simple process (a FIFO) to workError-Free / Lock-Free / Wait-Free to run a stress test verification program using 1-16 threads on 4 available cores on both the2xOpteron 270 (without 128-bit atomic compare and exchange) and on Intel Q6600 (with 128-bit compare and exchange). Writing this piece of code took a good chunk of time (several months).
Extending Lock-Free / Wait-Free to something more complicated than this relatively "simple" FIFO would be daunting.
The easiest way to avoid having to write Lock-Free / Wait-Free would be to eliminate the requirement for having to write this type of code. The suggestion I have for this is to re-introduce a quirk (feature) that was available on the PDP8 minicomputer. A user modeapplication could temporarily inhibit interrupts (and O/S context switch) without using a privledged Interrupt Off instruction. Although the particularinstruction was designed for a different purpose, realtime programmers made good use of it to perform atomic operations on structures (e.g. updating a FIFO list and Head/Tail).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First, to Raf... I prefer AJ to "aj.guillon"... seeing my name in quotes delimited by a period just seems strange to me :-) , I thought I had changed a setting for that in the forums... I'll have to see what I can do to fix how it displays.
So far as being the champion of wait-free or lock-free algorithms, I am very intersted in implementing a few data structures to measure their performance against say STL containers with a lock. The reason I haven't just jumped right into coding, is that I'm still in the reading phase.
Arch's comment on multi-core systems really requiring distributed data structures is a very relevant comment... and distributed data structures is the next reading topic for me personally. This is partially where my comment related to thread-handles comes in... the idea is to have a special interface that threads use to obtain a handle to a data structure. This is hidden from the user, and ideally "just works" without the user seeing the details. The thread then proceeds to work on its piece of the data structure, or whatever it is, without contention.... atomic operations are used for synchronization as little as possible. This approach would require special data structures of course...
What would be really nice, is if hardware transactional memory just sprung into being and we didn't have to worry about any of this. Although I'm not sure what the real performance of such hardware would be.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"A user mode application could temporarily inhibit interrupts": If there is no enforced upper time limit, I suppose that that could be used to compromise the system (DoS or something), otherwise I'm curious what the error handling provisions would be.
(Corrected) What was I thinking?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Disabling interruptscreates atomicity only on uniprocessors.
Transactional memory in hardware is an intriguing approach for writing nonblocking algorithms, but proposed ways to implement it in hardware do not make it an easy way to writenonblocking algorithms, because of the lack of guarantees. For example, one common idea for hardware TM is to use the cache to hold a transaction. But that limits transactions, in the worse case, to the cache's associativity. Because archictures (so far!) do not promise a minimum cache associativity, code on such a machine can never rely solely on hardware transactions, but instead must always have a backup plan based on architectural guarantees (e.g. atomic operations).
See Section 2.1 of http://www.unine.ch/transact08/papers/Moir-Adaptive.pdffor a sketch of Rock's TM capabilities.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Adrobiso,
You missed the point on the temporarily disabling of interrupts.
My suggestion is lock with temporary disabling of maskable interrupts. What this provides is the user application to perform a software lock with the assurance that for the next n instruction fetches that those instructions will be executed without the possibility of the operating system context switching the thread out and inducing a long duration ownership of the software lock.
Sure, other threads (on other cores) can contend for the data structure but the data structure is protected by a software lock. And it is known by the other threads that the wait duration will be very short.
It would be the user's responsibility to ensure the code to be performed would do so without fault (e.g. divide by 0 or page fault). Page fault will be a nuisance but manageable.
The driving reason to code Wait-Free is to provide a work around for the Operating System interrupting a relatively short piece of code and then cascading long waits by other threads attempting to get the software lock. The temporary interrupt inhibit is an easy means to avoid inopportune operating system interruption of a short code section.
The instruction could be defined something like
IINCOBT imm8
Inhibit Interrupts N instruction fetch Cycles Or until Branch Taken (N=imm8)
This would be generally followed by a LOCK CMPXCHG to attempt to get the mutex. Failure to obtain the mutex (in a well coded routine) results in a branch to a retry location and thus re-enabling the interrupts. If the mutex is obtained the code falls through thus continuing the uninterruptible zone.
The fall through code would issue VERR and VERW to assurethe codestill has access to resident virtual memory and if not, perform a conditional move to release the mutex, thenbranch to code to touch the memory and then retry. If the VERR/VERW succeeds, the conditional movefails to reset the mutex andcode falls through the branch to retryto the next VERR/VERW as needed, then when all addresses are known to be addressable you perform the change to the structure (e.g. adding a node to a linked list). I will try to work up an example.
Although this code is harder to write than "let them wait", it is orders of magnitude simpler to write than Lock-Free/Wait-Free programming. You would have onlytwo coding issues to resolve: 1) Write the changes to the data structures in strait line code (no branching), and 2) verify you have access to addresses you intend to manipulate.
(By the way, CAS and DCAS are pseudo code, LOCK CMPXCHG would be used on Intel platform to perform CAS)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wonder how that capability will be reflected in high-level programming languages like C++, which is still trying to cope with multithreading...
P.S.: Please disregard part of an earlier post of mine, as corrected above.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you want to play with transactions in C++, try out the Intel C++ STM Compiler. It's described on http://software.intel.com/en-us/articles/intel-c-stm-compiler-prototype-edition-20
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Now I see what you mean. Yes, being able to protect against interrupts over tiny critical sections could get the practical advantage of a non-blocking algorithm without the hassle.<<
I would call it a short-term blocking algorithm.
The technique I proposed was for the critical section to terminate on the earlier of
branch taken
number of instruction fetches
fault
It might be advisableto add to the fault list any attempt to re-issue the pause interrupts instructionwhile within a pause interrupts instruction (DoS avoidance).
It would have to be determined if it would be advisable to add REP, REPE, REPNE as well as other instructions (or prefixes) that may be used in a nefarious manner. With SSE3/SSE4.1 you can copy large blocks of memory using in-line coding and a relatively low number of instructions. So REPxx might not be needed.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
mad reed:Wait-free, lock-free, it seems easy to get lost in semantics here. Raf suggests and I would agree more generally that in the case of two HW threads trying to modify the same data structure, theres no such thing as a wait free algorithm: the best anyone can expect is to manage access so that the contentious delay of a second modifying thread is no worse than the time it takes to flush an atomic variable from the first thread and establish ownership to it by the second thread. But its still a wait..
Wait-free doesn't mean that there are no "waits". Wait-free means that every operation will be executed in finite bounded amount of time (regardless of behavior of other threads).
For example following operation is wait-free:
void rc_acquire(rc_object* o)
{
o->rc.fetch_add(std::memory_order_acquire);
}
And following is only lock-free (not wait-free):
void lifo_push(std::atomic
{
node* cmp = head.load(std::memory_order_relaxed);
for (;;)
{
n->next.store(cmp, std::memory_order_relaxed);
if (head.compare_swap(cmp, n, std::memory_order_release))
break;
}
}
Informally, "CAS-loop" means lock-free, and "FetchAndAdd" or "AtomicExchage" means wait-free.
Dmitriy V'jukov
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page