Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Complexity rank of cache locking



I know CPU cycles needed by locking vary, but I need some general picture about how heavy is cache locking. Particularly, for P6+ chip, what rank of the number of cycles consumed by LOCK BTS / INC / DEC would be, if the operand is already cashed memory? By rank I mean, would it be like 10 or rather 100?

0 Kudos
3 Replies
Black Belt

I would suggest you run a test that approximates your intended use and load conditions. 

LOCK BTS / INC / DEC / XCHG / CMPXCHG / ... execution times are not deterministic. Should two threads compete for the same cache line, one executes first, the other second. The second takes longer. On a non-contended the amount of time will vary too depending on the number of processors (sockets) and the location (same/other memory node) and other cache state nuances.

If possible, use a reduction variable. This is a local copy of the results (on stack of thread) that require the LOCKing operation only upon exit of the loop.

Jim Dempsey


Thank you for answer. The problem is, I want the code to be used on any modern PC machine, but o course I do not have possibility to check each model of processor. I believe, Intel should do some estimations about that. In other case, why even mention about a difference between bus lock and cache lock? Without some performance data, you can't take it seriously in your advantage.

Unfortunately I can't use local copies as I care mostly about incrementing and decrementing semaphores. -- I will have terribly many of those, and I can't do much about it. A proper semaphore must be down, when almost any method in the final application is called. So I need to do two such locked operations per each method call. And can't do it smartly, only in proper places, as methods are provided from outside and there are thousands of them, thus it's out of my possibilities to analyze all flows.

The most important case for me is, when there is no competition for the same cache line. There are around 1000 byte spaces between semaphores, so there is only one per cache line. And thus simultaneous accesses will be extremely rear. Maybe this determinize a little bit the expected time?

Black Belt

>> I want the code to be used on any modern PC machine

So, you are talking single processor machine? (IOW single socket)

Write up a little stress test program using something built on:


volatile int yourSemaphore = -1; // no owner
// owners have ID's 0:n, n > 0
do {
  while(yourSemaphore >= 0)
  yourSemaphore = myID_ge_0; // tentatively acquire, other thread may do the same
  _mm_sfence(); // may not be required on single socket
  _mm_pause(); // short delay before verify test
} while(yourSemaphore != myID_ge_0);
... we own code
ASSERT(yourSemaphore == myID_ge_0); // _DEBUG tests assertion of ownership
yourSemaphore = -1; // release

Rework the above code into a stress test application (all theads competing for same, then oversubscribe by 16x threads).
Note, stress test requires some time in "... we own code".

Then benchmark same stress test using LOCKing method.

 Jim Dempsey