I am looking at a published MPSC waitfree producer queue implementation that makes use of an atomic exchange. If I understand correctly on an intel platform this would be backed by a "lock xchg".
I am fairly fresh to this subject and trying to understand the implications. By claiming the producer is "wait-free" we are saying it has a bounded time frame in which it will complete. Is this possible with a "lock xchg"?
My understanding of a "lock xchg" is that it locks the bus somehow while it does the xchg (this is what I read somewhere). Does this mean then that the bound on time in this case is based on the number of cores on the hardware?
I.e. If all cores request to lock to do the xchg, wouldn't they then need to be queued to operate in some exclusive order?
Sorry if this is a silly question, I am fairly new to this and am just trying to understand implementations I have seen around.
The bane of a software lock (e.g. mutex) is not the time to perform the lock, the work, the unlock.
Rather it is the time for: the lock, part of the the work, pre-emption of thread, runtime of other threads, resumption, completion of work, the unlock.
A technique build on "lock; xchg", has no opportunity for preemption. Therefore the bound time is on the order of ns or us, but never ms or s. Well I should never say never. I suppose a system with 100,000 hardware threads simultaneously beginning xchg to same cache line may have some threads hitting bound times in the ms range.
A software lock of a mutex may be implimented using "lock; xchg", or simply "xchg" since xchg implicitly includes lock.
>> If all cores request to lock to do the xchg, wouldn't they then need to be queued to operate in some exclusive order?
In effect yes, though the actual implimentation may be queue and/or block until lock can be obtained.
XCHG and LOCK prefixed instructions are significantly longer than "equivalent" non-LOCK'd instructions. This was not the point of the discussion. The point was relating to MPSC "wait-free"algorithm. A "wait-free" algorithm is one where no thread entering the algorithm is (substantially) impeaded in progressing through the algorithm.
This typically means that there is no software lock or spin-wait section that is dependent upon a different thread releasing the lock or in the event of spin-wait, completing the section of statements that satisfies the spin-wait condition. wait-free algorithms may have one path or alternate paths dependent upon progress activity of other threads, but in no event are any of these paths looped in a manner that results in non-progress.
The circumstance to be avoided is preemption of the thread holding the software lock or having not completed satisfaction of spin-wait condition prior to completion of unlock or satisfaction of spin-wait condition. Should a preemption occur to thread holding a software lock (or prior to completion of section subject to spin-wait), then all other threads waiting for lock or completion of section subject to spin-wait, are impeded from progression. Wait-Free algorithms use no software locks (mutex, etc...) and use no spin-waits (loop until something changes). The wait-free algorithms do however typically include use of LOCK prefixed instructions and/or instructions with implicit lock (XCHG) but written in a manner such that no thread is inhibited from progression through the algorithm by an event of preemption of any other thread at any point within the algorithm.
In a single threaded environment it is much more efficient to not use XCHG, as well as to not use the LOCK prefix. The reference in the manual is not to tell you to not use these instructions, rather it instructing you to use them only when circumstances require such use.
The LOCK prefixed instruction (implicit with XHG) assures that the Read/Modify/Write is consistent with atomicity of the instruction protected by the lock.
add memLoc, 1 verses LOCK;add memLoc, 1
mov edx, eax verses XCHG eax, memLoc
mov eax, memLoc
mov memLoc, edx
without lock has an issue when two or more threads are simulteaneously performing the add to memLoc (or minipulating any other location in the same cache line as memLoc).
The LOCK only protects the immediate instruction following the LOCK (provided it is in the list of supported protected instructions).
The prefix does incur a large delay. But large in this sense is relative.
Lock lock(aLock); // which includes XCHG or CMPXCHG
floatReducedSum += floatLocalSum; // do something here
} // dtor of Lock lock may simply do a write to aLock
The problem is preemption while the software lock is held. All other threads attempting to obtain lock on aLock will hang for the duration of the preemption of the thread holding the lock.
In the above case, it is much better to call
float AtomicAdd(float* pf, float f)
old_f = *pf;
new_float = old_f + f;
if(CAS(reinterpret_cast< int32_t *>(pf), new_val, &cmpRet))
} // whild(true()
The above code cannot be preempted. Meaning it is wait-free. The overhead of the CAS (CMPXCHG : __sync_val_compare_and_swap) is on the order of that of obtaining the mutex in the Lock.
In the case of an integer there are various XCHGADD or LOCK;add, etc...
For double you would use
double AtomicAdd(double* pd, double d)
old_d = *pd;
new_double = old_d + d;
if(CAS(reinterpret_cast< int64_t *>(pd), new_val, &cmpRet))
} // while
The above assumes you have properly written CAS for overloads with int32_t and int64_t
Thanks for the information and help.
Jim you are correct on the pre-emption. My understanding of "wait-free" as it may be used in realtime systems is that all threads are always able to make progress, and a wait free op is guaranteed to complete within a bounded time frame. The premption while holding a lock is what prevents locked code from being safe here as if one thread is descheduled while holding the lock then it can prevent all other threads from making progress.
I do expect that waitfree algos perform slower than the single threaded case but have the additional property that they can be used in realtime processing. From what you have said I gather there is a bound to the time to process a lock xchg that depends on the hardware we are running on (number of cores possibly calling lock xchg at the same time).
Thanks for verifying that the lock xchg cant be pre-empted, but in addition to this, depending on how the hardware implements the "lock" it is posible that the result is a "lock-free" algo not a "wait-free" algo. I.e. If the hardware performs the lock xchg's in some fifo like order, then it is guaranteed that say on an 8 core system the worst case lock time is based on the size of this fifo. If however it is a priority queue, then maybe it is possible for a higher priority to prevent the lower priority xchg from running in any bounded time if more higher priority xchgs' get generated.
I guess in practice using a lock xchg can be considered bounded and it seems people assume this in the development of their algos. I am just trying to understand it.
The above algo you provided for AtomicAdd is lock-free not wait-free (my definition of lock free is that at least one thread can make progress, time to perform a task is not bounded). By using CAS in a loop, if another thread modifes the data before you commit, then it re-reads and tries the op again and can continue to be starved by another thread continually changing the data before it commits (unlikely). However at least one thread in the system if scheduled is making progress which is not guaranteed with locks.
The lock (mem inst) when executed concurrently to the same cache line will have different bound times depending on pecking order attained in lock arbitration. I am not informed as to how the pecking order is determined, I assume in a NUMA system the closer cores will win. The processor memory system design will determine fairness (unfairness).
lock; xchg rax,[foo]
add rbx, rax ; stall until rax returned
lock; xchg rax,[foo]
mul rcx, rdx ; no stall here
add rbx, rax ; stall until rax returned
*** caution no stall will depend on the implimentation of the processor pipeline and memory system
*** the stall to your instruction stream generally occurs a) when you use the result of the fetch, or b) when the write buffering capability is exhausted.
The AtomicAdd illustrated is lock-free and wait-free from the perspective that at least one thread will pass through without looping *** provided that all the writes to the target address's cache line are in flight through the AtomicAdd or other atomic operation that does not dis-joint the value of the target from the intentions of the design *** (iow you do not one one thread zeroing out the sum or using the sum for other purpose at the same time).
You are missing the point here. The mul in the example is not a padd instruction, instead it is a useful instruction that you insert into the instruction stream that does not reference the return value of the operation (in this case the rax).
For example, if you were to impliment a formal lock to atomicize a floating point reduction sum
lock; xchg rax,[yourLockFlag]
movss xmm1, [memLoc]
addss xmm0, xmm1
In the above, you can speculatively execute the summation prior to testing the status of the return value of your attempt at gaining the lock
Note, a similar technique can be done in the wait-free method using the CAS, however it is a bit more difficult since you likely have written a function for doing this. When the function is out-of-line it may be hard to insert instructions. When the function is inlined (and CAS inlined) then you may get lucky with compiler optimizations that move instructions before the branch on CAS fail. This is unlikely though because most CAS implimentations immediately take the ZF and writes it to the output register (al, ax, eax, rax as the case may be). Realy time critical code could be written in assembler and as long as the condition flags are not altered, code may continue to execute.
Attached is a simple test program illustrating the relative cost and benefit of lock instructions.
*** Please note that the test loops are essentially "do nothing" (other than possibly checking for consistency in result of operation).
*** The representative trip counts are NOT to be interpreted as a ratio of a loop without lock verses with lock.
*** Rather the difference in trip counts ARE to be interpreted as additional clock ticks per use of interlock (of type) per trip.