Showing results for 
Search instead for 
Did you mean: 

Cost of lock instruction for Xeon

I have been running into some performance issues with a 2.8Ghz SMP Xeon system. After some profiling I have found that the "lock" instruction (required for SMP) seems very expensive. Using the following function:
static inline int _osiCondSet32Locked(volatile unsigned *ptr, unsigned old,
unsigned replace)
int ok;
__asm __volatile("mov %2, %%eax;"
"movl $1, %0;" /* ok=1 */
"cmpxchgl %3, %1;" /* if(%eax==*ptr) *ptr=replace */
"jz 0f;" /* jump if exchanged */
"movl $0, %0;" /* ok=0 */
: "=&mr"(ok), "+m"(*ptr)
: "mr"(old), "r"(replace)
: "eax", "memory" );
return ok;
I found that it takes 12 cycles to run without the "lock" instruction, and 132 cycles with "lock". Are there any other instructions available that are more efficient on this architecture?
Gerrit Nagelhout
Tags (1)
0 Kudos
3 Replies
Black Belt

Gerrit -

The reason for the added time is due to the pipeline needing to be clear before (and after) the lock'ed instruction is executed. (See previous discussion at

If the lock is required for correct execution of the code, then you're kinda stuck with it. This is the overhead of parallel computing that everyone keeps harping about. You need to be sure that the performance improvement gained by running with multiple threads can overcome the performance hit of using things like the lock.

At first, I thought to suggest using one of the Win32 "Interlocked" functions, but, I suspect that you've probably implemented the algorithm or one better than what is used.

Anyone else have any suggestions for faster methods to do a compare and exchange operation? Is the lock really necessary for correct parallel execution of the cmpxchgl instruction?

-- clay


You need the LOCK prefix for correct execution per the previous discussion. You also need it for the pipeline flushes which seems to be the way memory barriers are implemented on IA-32. You need memory barriers. If you don't think you need them, you are most likely doing something wrong.

There are alternatives but it depends on what you are doing. There are what is known as distributed algorithms (not to be confused with distributed programming). Stuff like Dekker's algoritm, distributed counters, etc... Also lock-free reader/writer solutions which reduce or eliminate the need for synchronization for read access. RCU (Read, Copy, Update) which can only be used in the Linux kernel (for now) is an example. And of course using thread design patterns that reduce the need for interlocked access helps also.

as I am working on another lock-free algorithm as an alternative to locking. Unfortunately for now we have to use what I would consider an expensive C runtime function. This is more than offset by not incurring the high execution cost of interlocked instructions plus the usual benefits of lock-free.

If this works out, we could maybe get come changes to C runtime support and this new technique would really rock.