Data Structures with Atomic Operations - Page 2

AJ13 · ‎06-16-2008

I remember Arch saying something along the lines of implementing a data structure with atomic operations can sometimes provide no better performance than just wrapping a sequential data structure with a mutex.

Before I repeat the TBB teams' past work, I'm looking at wait-free data structures and the penalty of using atomic operations to implement them. Any general suggestions or observations on the performance of wait-free data structures?

In particular I'm looking at data structures in the "Art of Multiprocessor Programming" book to implement. Before I spend a lot of time doing this, any words of wisdom?

Thanks!

AJ

Dmitry_Vyukov · ‎07-15-2008

JimDempseyAtTheCove:
The suggestion I have for this is to re-introduce a quirk (feature) that was available on the PDP8 minicomputer. A user modeapplication could temporarily inhibit interrupts (and O/S context switch) without using a privledged Interrupt Off instruction. Although the particularinstruction was designed for a different purpose, realtime programmers made good use of it to perform atomic operations on structures (e.g. updating a FIFO list and Head/Tail).

IBM z/Architecture currently have interesting instruction - PLO (perform locked operation).
See IBM z/Architecture: Principles of Operation:
http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
Appendix A -> Multiprogramming and Multiprocessing Examples -> PERFORM LOCKED OPERATION

PERFORM LOCKED OPERATION (PLO)
The PERFORM LOCKED OPERATION instruction
can be used in a multiprogramming or multiprocessing
environment to perform compare, load,
compare-and-swap, and store operations on two
or more discontiguous locations that can be words
or doublewords. The operations are performed as
an atomic set of operations under the control of a
lock that is held only for the duration of the execution
of a single PERFORM LOCKED OPERATION
instruction, as opposed to across the execution
of multiple instructions. Since lock contention
is resolved by the CPU and is very brief,
the program need not include a method for
dealing with the case when the lock to be used is
held by a program being executed by another
CPU. Also, there need be no concern that the
program may be interrupted while it holds a lock,
since PERFORM LOCKED OPERATION will complete
its operation and release its lock before an
interruption can occur.

Dmitriy V'jukov

Dmitry_Vyukov · ‎07-15-2008

aj.guillon@gmail.com:

2) Use thread-local storage to perform updates on thread-local parts of a data structure, combined with a sprinkling of atomic operations which are seldomly called. For instance, an unordered_vector could work by giving each thread a segment of memory to fill with elements when it calls insert() (each segment is contiguous), when that segment is filled it gets another one which results in an atomic increment in the number of segments available.

This will work faster than lock-based implementation (provided careful implementation).
Privoded big private chunks and careful data placement you can get up to 100x speed-up.

Forgot about atomic operations and locks. They play only secondary role in performance/scalability. Primary thing is data sharing. Basically, count number of accesses to different shared cache-lines per operation, not number of atomic operations. (But doesn't count accesses to read-only shared cache-lines)

Yes, atomic operations have some fixed price (btw, on latest Intel processors atomics became much cheaper), thus affects performance. But data sharing is much more expensive, and it affects not only performance but scalability of system too. Note that atomic operations can be (and usually is) combined with data sharing, this is worst scenario.

Dmitriy V'jukov

Dmitry_Vyukov · ‎07-15-2008

aj.guillon@gmail.com:

I have been fortunate, that the book "The Art of Multiprocessor Programming" has a collection of wait-free algorithms which I can implement directly using TBB in C++. Unfortunately they all have different consistency guarantees, but proper documentation can address this.

I was thinking that in this book authors entirely rely on Garbage Collector, because I don't see even mention of Safe Memory Reclamation problem in contents of the book. So I was thinking that algorithms are not directly implementable in C++.
Are they provide solution for memory reclamation in lock/wait-free algorithms? Or are you going to use some existing memory reclamation scheme (SMR, RCU, proxy-collector etc)?

Dmitriy V'jukov