lock xchg performance

Brendon_C_ · ‎06-07-2013

I am looking at a published MPSC waitfree producer queue implementation that makes use of an atomic exchange. If I understand correctly on an intel platform this would be backed by a "lock xchg".

I am fairly fresh to this subject and trying to understand the implications. By claiming the producer is "wait-free" we are saying it has a bounded time frame in which it will complete. Is this possible with a "lock xchg"?

My understanding of a "lock xchg" is that it locks the bus somehow while it does the xchg (this is what I read somewhere). Does this mean then that the bound on time in this case is based on the number of cores on the hardware?

I.e. If all cores request to lock to do the xchg, wouldn't they then need to be queued to operate in some exclusive order?

Sorry if this is a silly question, I am fairly new to this and am just trying to understand implementations I have seen around.

Thanks,

Brendon.

jimdempseyatthecove · ‎06-08-2013

The bane of a software lock (e.g. mutex) is not the time to perform the lock, the work, the unlock.

Rather it is the time for: the lock, part of the the work, pre-emption of thread, runtime of other threads, resumption, completion of work, the unlock.

A technique build on "lock; xchg", has no opportunity for preemption. Therefore the bound time is on the order of ns or us, but never ms or s. Well I should never say never. I suppose a system with 100,000 hardware threads simultaneously beginning xchg to same cache line may have some threads hitting bound times in the ms range.

A software lock of a mutex may be implimented using "lock; xchg", or simply "xchg" since xchg implicitly includes lock.

>> If all cores request to lock to do the xchg, wouldn't they then need to be queued to operate in some exclusive order?

In effect yes, though the actual implimentation may be queue and/or block until lock can be obtained.

Jim Dempsey

SergeyKostrov · ‎06-08-2013

Here is a short follow up regarding Jim's statement: >>... Therefore the bound time is on the order of ns or us, but never ms or s... In Intel® 64 and IA-32 Architectures Optimization Reference Manual ( April 2012 ) on a Page 735 there is a statement: ... minimize the use of xchg instructions on memory locations ... Also, Latency and Throughput ( for different CPUs / in clock cycles ) are as follows: ... XCHG 1.5 1.5 1 1 ... XCHG reg, reg 1.5 2.5 1 1 1 ... I have not found any comments in these sections regarding a LOCK prefix for XCHG instruction.

SergeyKostrov · ‎06-08-2013

Here is a set of statements about the LOCK prefix in the same Manual: 1. ... 2.3.7 Enhancements for System Software In addition to microarchitectural enhancements that can benefit both application level and system-level software, Intel microarchitecture code name Nehalem enhances several operations that primarily benefit system software. Lock primitives: Synchronization primitives using the Lock prefix (e.g. XCHG, CMPXCHG8B) executes with significantly reduced latency than previous microarchitectures. ... 2. On a page 211: ... It is important to avoid operations that work against locality-enhancing techniques. Using the lock prefix heavily can incur large delays when accessing memory, regardless of whether the data is in the cache or in system memory. ... 3. ... B.3.4.3 Lock Contention Analysis The amount of contention on locks is critical in scalability analysis of multi-threaded applications. A typical ring3 lock almost always results in the execution of an atomic instruction. An atomic instruction is either an XCHG instruction involving a memory address or one of the following instructions with memory destination and lock prefix: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR or XADD. Precise events enable you to get an idea of the contention on any lock. Many locking APIs start by an atomic instruction in ring3 and back off a contended lock by jumping into ring0. This means many locking APIs can be very costly in low contention scenarios. To estimate the amount of contention on a locked instruction, you can measure the number of times the cache line containing the memory destination of an atomic instruction is found modified in another core. ...

jimdempseyatthecove · ‎06-09-2013

Sergey,

XCHG and LOCK prefixed instructions are significantly longer than "equivalent" non-LOCK'd instructions. This was not the point of the discussion. The point was relating to MPSC "wait-free"algorithm. A "wait-free" algorithm is one where no thread entering the algorithm is (substantially) impeaded in progressing through the algorithm.

This typically means that there is no software lock or spin-wait section that is dependent upon a different thread releasing the lock or in the event of spin-wait, completing the section of statements that satisfies the spin-wait condition. wait-free algorithms may have one path or alternate paths dependent upon progress activity of other threads, but in no event are any of these paths looped in a manner that results in non-progress.

The circumstance to be avoided is preemption of the thread holding the software lock or having not completed satisfaction of spin-wait condition prior to completion of unlock or satisfaction of spin-wait condition. Should a preemption occur to thread holding a software lock (or prior to completion of section subject to spin-wait), then all other threads waiting for lock or completion of section subject to spin-wait, are impeded from progression. Wait-Free algorithms use no software locks (mutex, etc...) and use no spin-waits (loop until something changes). The wait-free algorithms do however typically include use of LOCK prefixed instructions and/or instructions with implicit lock (XCHG) but written in a manner such that no thread is inhibited from progression through the algorithm by an event of preemption of any other thread at any point within the algorithm.

In a single threaded environment it is much more efficient to not use XCHG, as well as to not use the LOCK prefix. The reference in the manual is not to tell you to not use these instructions, rather it instructing you to use them only when circumstances require such use.

The LOCK prefixed instruction (implicit with XHG) assures that the Read/Modify/Write is consistent with atomicity of the instruction protected by the lock.

add memLoc, 1 verses LOCK;add memLoc, 1

or

mov edx, eax verses XCHG eax, memLoc
mov eax, memLoc
mov memLoc, edx

without lock has an issue when two or more threads are simulteaneously performing the add to memLoc (or minipulating any other location in the same cache line as memLoc).

The LOCK only protects the immediate instruction following the LOCK (provided it is in the list of supported protected instructions).

Jim Dempsey

SergeyKostrov · ‎06-09-2013

Jim, There was a simple question: >>... Is this possible with a "lock xchg"? and there is a simple answer in Intels docs: ... Using the lock prefix heavily can incur large delays when accessing memory, regardless of whether the data is in the cache or in system memory. ... I'm Not going to prove it, or disprove it, and I really trust what Intel software engineers say.

jimdempseyatthecove · ‎06-11-2013

Sergey,

The prefix does incur a large delay. But large in this sense is relative.

{
Lock lock(aLock); // which includes XCHG or CMPXCHG
floatReducedSum += floatLocalSum; // do something here
} // dtor of Lock lock may simply do a write to aLock

The problem is preemption while the software lock is held. All other threads attempting to obtain lock on aLock will hang for the duration of the preemption of the thread holding the lock.

In the above case, it is much better to call

float AtomicAdd(float* pf, float f)
{
union {
  float old_f;
  int32_t cmpRet;
};
old_f = *pf;
while(true)
{
  union {
   float new_float;
   int32_t new_val;
  };
  new_float = old_f + f;
  if(CAS(reinterpret_cast< int32_t *>(pf), new_val, &cmpRet))
   return old_f;
} // whild(true()
}

The above code cannot be preempted. Meaning it is wait-free. The overhead of the CAS (CMPXCHG : __sync_val_compare_and_swap) is on the order of that of obtaining the mutex in the Lock.

In the case of an integer there are various XCHGADD or LOCK;add, etc...
For double you would use

double AtomicAdd(double* pd, double d)
{
union {
  double old_d;
  int64_t cmpRet;
};
old_d = *pd;
while(true)
{
  union {
   double new_double;
   int64_t new_val;
  };
  new_double = old_d + d;
  if(CAS(reinterpret_cast< int64_t *>(pd), new_val, &cmpRet))
   return old_d;
} // while
}

The above assumes you have properly written CAS for overloads with int32_t and int64_t

Jim Dempsey

Brendon_C_ · ‎06-11-2013

Thanks for the information and help.

Jim you are correct on the pre-emption. My understanding of "wait-free" as it may be used in realtime systems is that all threads are always able to make progress, and a wait free op is guaranteed to complete within a bounded time frame. The premption while holding a lock is what prevents locked code from being safe here as if one thread is descheduled while holding the lock then it can prevent all other threads from making progress.

I do expect that waitfree algos perform slower than the single threaded case but have the additional property that they can be used in realtime processing. From what you have said I gather there is a bound to the time to process a lock xchg that depends on the hardware we are running on (number of cores possibly calling lock xchg at the same time).

Thanks for verifying that the lock xchg cant be pre-empted, but in addition to this, depending on how the hardware implements the "lock" it is posible that the result is a "lock-free" algo not a "wait-free" algo. I.e. If the hardware performs the lock xchg's in some fifo like order, then it is guaranteed that say on an 8 core system the worst case lock time is based on the size of this fifo. If however it is a priority queue, then maybe it is possible for a higher priority to prevent the lower priority xchg from running in any bounded time if more higher priority xchgs' get generated.

I guess in practice using a lock xchg can be considered bounded and it seems people assume this in the development of their algos. I am just trying to understand it.

The above algo you provided for AtomicAdd is lock-free not wait-free (my definition of lock free is that at least one thread can make progress, time to perform a task is not bounded). By using CAS in a loop, if another thread modifes the data before you commit, then it re-reads and tries the op again and can continue to be starved by another thread continually changing the data before it commits (unlikely). However at least one thread in the system if scheduled is making progress which is not guaranteed with locks.

SergeyKostrov · ‎06-11-2013

>>...The premption while holding a lock is what prevents locked code from being safe here as if one thread is descheduled >>while holding the lock then it can prevent all other threads from making progress... A much worse case is possible when a priority for a thread which doesn't hold the lock is boosted ( for example, due to some hardware interrupt linked to that thread ) well above priorities of all the rest threads including the thread that holds the lock.

jimdempseyatthecove · ‎06-12-2013

The lock (mem inst) when executed concurrently to the same cache line will have different bound times depending on pecking order attained in lock arbitration. I am not informed as to how the pecking order is determined, I assume in a NUMA system the closer cores will win. The processor memory system design will determine fairness (unfairness).

bound times

lock; xchg rax,[foo]
add rbx, rax ; stall until rax returned

lock; xchg rax,[foo]
mul rcx, rdx ; no stall here
add rbx, rax ; stall until rax returned

*** caution no stall will depend on the implimentation of the processor pipeline and memory system
*** the stall to your instruction stream generally occurs a) when you use the result of the fetch, or b) when the write buffering capability is exhausted.

The AtomicAdd illustrated is lock-free and wait-free from the perspective that at least one thread will pass through without looping *** provided that all the writes to the target address's cache line are in flight through the AtomicAdd or other atomic operation that does not dis-joint the value of the target from the intentions of the design *** (iow you do not one one thread zeroing out the sum or using the sum for other purpose at the same time).

Jim Dempsey

SergeyKostrov · ‎06-12-2013

>>bound times >> >>lock; xchg rax,[foo] >>add rbx, rax ; stall until rax returned >> >>lock; xchg rax,[foo] >>mul rcx, rdx ; no stall here >>add rbx, rax ; stall until rax returned This is a very good example. Thanks, Jim. And, how could somebody call the following piece of code as a "Waitfree"? ... lock; xchg rax, [foo] mul rcx, rdx mul rcx, rdx mul rcx, rdx mul rcx, rdx mul rcx, rdx mul rcx, rdx mul rcx, rdx mul rcx, rdx mul rcx, rdx mul rcx, rdx mul rcx, rdx mul rcx, rdx mul rcx, rdx mul rcx, rdx mul rcx, rdx mul rcx, rdx add rbx, rax ...

jimdempseyatthecove · ‎06-12-2013

Sergey,

You are missing the point here. The mul in the example is not a padd instruction, instead it is a useful instruction that you insert into the instruction stream that does not reference the return value of the operation (in this case the rax).

For example, if you were to impliment a formal lock to atomicize a floating point reduction sum

loop:
mov rax,1
lock; xchg rax,[yourLockFlag]
movss xmm1, [memLoc]
addss xmm0, xmm1
tst rax
jne loop
movss [memLoc],xmm0
mov [yourLockFlag],rax

In the above, you can speculatively execute the summation prior to testing the status of the return value of your attempt at gaining the lock

Jim Dempsey

jimdempseyatthecove · ‎06-12-2013

Note, a similar technique can be done in the wait-free method using the CAS, however it is a bit more difficult since you likely have written a function for doing this. When the function is out-of-line it may be hard to insert instructions. When the function is inlined (and CAS inlined) then you may get lucky with compiler optimizations that move instructions before the branch on CAS fail. This is unlikely though because most CAS implimentations immediately take the ZF and writes it to the output register (al, ax, eax, rax as the case may be). Realy time critical code could be written in assembler and as long as the condition flags are not altered, code may continue to execute.

Jim Dempsey

jimdempseyatthecove · ‎06-15-2013

Attached is a simple test program illustrating the relative cost and benefit of lock instructions.

*** Please note that the test loops are essentially "do nothing" (other than possibly checking for consistency in result of operation).
*** The representative trip counts are NOT to be interpreted as a ratio of a loop without lock verses with lock.
*** Rather the difference in trip counts ARE to be interpreted as additional clock ticks per use of interlock (of type) per trip.

Jim Dempsey