Linux / 64 bits / Q6600 / very bad performances whith mutex/spinlock

fb251 · ‎08-27-2008

Hi all,

I'm a system developper and went into some unexpected results comparing to my main computer (which is not Intel based) when testing on a Q6600 (quad core) using Ubuntu 8.04 64 bits.

I've a basic multi-threaded program (compiled with gcc 4.2.3) like this:

volatile int counter = 0 ;

void * worker (void * arg)
{
register int i ;

(void) arg ;

for (i = O ; i < 10000000 : i ++)
{
(* method_lock) () ;
++ counter ;
(* method_unlock) () ;
}

return NULL ;
}

Where method_lock/unlock can be: pthread_mutex, pthread_spinlock, my_spinlock(*).

I created 16 threads using sched_setaffinity() to ensure each core will run 4 threads.

Results are:

pthread_mutex: 10.5s
pthread_spinlock: 384s (!)
my_spinlock: 9.8s

On my main computer (dual core from the competitor @ 2,2GHz but running with Ubuntu 8.04 32 bits) under same conditions (16 threads too), results are:

pthread_mutex: 25s
pthread_spinlock: 91s
my_spinlock: 5.4s

These values are average, this test has been done many times without significant variation. My mutex/spinlock was not aligned on a cache line boundary but as it was the only user process running on the computer I believe it's not an answer to explain these numbers.

I will use spinlock for very short code (some cycles) on a server software.
Is there anybody to give me some hints or tests to do in order to improve threads synchronization functions for the Q6600 (I was expecting more performance from a Quadcore) ?

(*) Use a classical loop with "lock; cmpxchg %2,%1" and "pause;" see below:

int try_lock (atomic_t * atom)
{
atomic_t old ;

__asm__ __volatile__ (
"lock; cmpxchg %2, %1"
: "=a" (old)
: "m"(* atom), "r"(-1), "a" (0)
: "memory", "cc"
) ;

return old ;
}

and:

void spin_lock (atomic_t * atom)
{
register int i ;

while (1)
{
if (! try_lock (atom))
return ;

for (i = 0; i < SPIN_COUNT; i++)
{
__asm__ __volatile__
(
"pause"
) ;
}

if (! try_lock (atom))
return ;

sched_yield () ;
}
}

jimdempseyatthecove · ‎08-27-2008

One problem is your for loop should enclose the 2nd try_lock.

Just as your spin_lock has a SPIN_COUNT and a call to sched_yield is made if sipning too longthe pthread_spinlock will use a similar technique. On the 384s case I would guess the SPIN_COUNT is too low thus causing excessive yields.

As for the my_spinlock showing 9.8s on quad core 2.4GHz system as opposed to 5.4s on 2.2GHz dual core system this might be explainable by two factors

1) the try_lock works faster on the 2.2GHz dual core system
2) you have twice the number of loosers (collisions) on the 4 core system as you do on the 2 core system thus try_lock fails more on the 4 core system.

You could insturment your code by something like this:

__int64 SpinCountSum = 0;
int SchedYieldSum = 0;
void spin_lock (atomic_t * atom)
{ 
 register int i = 0; 
 
 while (1)
 {
 if (! try_lock (atom))
 {
 SpinCountSum += i; // lock on atom protecting SpinCountSum
 return ;
 }
 
 for (; i < SPIN_COUNT; i++)
 { 
 __asm__ __volatile__ 
 (
 "pause" 
 ) ;
   if (! try_lock (atom))
   {
   SpinCountSum += i; // lock on atom protecting SpinCountSum
   return ;
   }
 } 
 
 _InterlockedIncrement(&SchedYieldSum); 
 sched_yield () ; 
 }
}

Jim Dempsey

fb251 · ‎08-27-2008

Thank you Jim for answering,

JimDempseyAtTheCove:
One problem is your for loop should enclose the 2nd try_lock.
Just as your spin_lock has a SPIN_COUNT and a call to sched_yield is made if sipning too longthe pthread_spinlock will use a similar technique. On the 384s case I would guess the SPIN_COUNT is too low thus causing excessive yields.

Well the "384s case" is when using pthread_spinlock(), I haven't read the source code of pthread and so I don't know what could be the value of SPIN_COUNT! I believe my function "my_spinlock()" is efficient but I don't understand why the standard pthread Linux synchronization functions using spinlocks are completely out of the game and don't scale if I compare with my hardware.

For collisions, yes, it is a good idea, I will try to do more tests by using different values (actually it's just 100); but I've almost twice better performances using a dual core (with 16 threads) than a quad core using "my_spinlock" and except pthread_mutex it doesn't scale.

Still in the dark!

jimdempseyatthecove · ‎08-27-2008

On Windows there is a way to specify what the spin count is. On Linux there must be a similar facility. You can always step into the initialize spinlock to see what is happening. or rtfm (read the fine minutia).

The throughput on this test will not be typical of the throughput on your application. After you get the lock, do your short workand release the lock then insert some work then loop. The lock portion of your application should be small. If not, then rework the code so it is small.

Jim Dempsey

Dmitry_Vyukov · ‎08-28-2008

JimDempseyAtTheCove:
Just as your spin_lock has a SPIN_COUNT and a call to sched_yield is made if sipning too longthe pthread_spinlock will use a similar technique. On the 384s case I would guess the SPIN_COUNT is too low thus causing excessive yields.

I think situation is opposite.
Scheduler yields are good in this situation, as well as very long active spinning:
for (i = 0; i < SPIN_COUNT; i++)
{
__asm__ __volatile__
(
"pause"
) ;
}
if (! try_lock (atom))
return ;

I think that pthread_spinlock tries to acquire lock much more frequently. Something like:
for (i = 0; i < SPIN_COUNT; i++)
{
__asm__ __volatile__
(
"pause"
) ;
if (! try_lock (atom))
return ;
}

This basically kills performance!
Under such heavy workload contention on cache-line is incredible. The more thread yields or makes local spinning on 'pause' instruction, the more it reliefs contention on cache-line, the more useful forward progress is possible. And the lesser useless cache-coherence traffic.

fb251 · ‎08-29-2008

randomizer:

I think situation is opposite.
Scheduler yields are good in this situation, as well as very long active spinning:
for (i = 0; i < SPIN_COUNT; i++)
{
__asm__ __volatile__
(
"pause"
) ;
}
if (! try_lock (atom))
return ;

Thank you Dmitriy for your input,

From my point of view, It makes no sense to insert a lock prefix on the bus during the spinlock loop. Logic is:

1. spinlocks are used to protect a very short code (less than 10 instructions),
2. if I can't get the lock within a few hundred cycles then it's a better thing to relinquish control to other threads/processes
3. if the number of threads/processes exceeds the number of cores the probability of doing wasteful busy wait increase

So I believe "my_spinlock" is a good choice as it scales well even when the number of threads/processes is larger than the number of cores.

Jim,

Your idea to count the mean time before getting the lock in the spinlock loop is great! But sadly it's like quantum physic: I can't use an "lock; xadd" to measure the impact on cache coherency.

Conclusion is: these timings are normal, synchronization cost increase linearly with the number of cores in best case, I don't think there's a magical solution to optimize more this kind of situation (if any, please tell me!).

Best regards

jimdempseyatthecove · ‎08-29-2008

The SPIN_COUNT is used for a time-out on attempt to gain lock (traditionally you yield to the scheduler on time-out). SPIN_COUNT is not used for reduction in processor overhead it is used for reduction in latency for highly contested short sections of code. If reduction in processor overhead/interaction is of concern (as it impacts latency as well) then the proper procedure is to issue multiple pauses (either inline or as loop) between each lock attemptbut the number of iterations is not SPIN_COUNT as this is representative of the latency timeout and not respective of system interaction overhead.

// try_lock (atom); returns prior value of atom
// i.e. 0 if was 0 and now we have the lock
// or 1 if was 1 and someone else has the lock
 
if (! try_lock (atom))
 return ;
while(true)
{
 for (i = 0; i < SPIN_COUNT; i++)
 {
#if (PAUSE_COUNT > 1)
 for(j = 0; j < PAUSE_COUNT; j::)
 {
 __asm__ __volatile__ 
 (
 "pause" 
 ) ;
 if(! atom)
 if (! try_lock (atom))
 return ;
 }
#else
 __asm__ __volatile__ 
 (
 "pause" 
 ) ;
 if(! atom)
 if (! try_lock (atom))
 return ;
#endif
 
 }
}

Each archetecture may have differing pauserequirements and the number of pauses would be proportional to the expected time to run through the locked section.

Jim Dempsey

Dmitry_Vyukov · ‎08-29-2008

fb251:

From my point of view, It makes no sense to insert a lock prefix on the bus during the spinlock loop. Logic is:

1. spinlocks are used to protect a very short code (less than 10 instructions),
2. if I can't get the lock within a few hundred cycles then it's a better thing to relinquish control to other threads/processes
3. if the number of threads/processes exceeds the number of cores the probability of doing wasteful busy wait increase

So I believe "my_spinlock" is a good choice as it scales well even when the number of threads/processes is larger than the number of cores.

Your reasoning definitely makes sense!

fb251:

Your idea to count the mean time before getting the lock in the spinlock loop is great! But sadly it's like quantum physic: I can't use an "lock; xadd" to measure the impact on cache coherency.

If you need to collect some statistics about synchronization primitive, than you can use thread-local partial counters, and aggregate them in the end of the test. Something like:

int const thread_count = 10;
struct counter_t
{
int value;
char cache_line_pad [64];
};
counter_t counters [thread_count];

fb251:

Conclusion is: these timings are normal, synchronization cost increase linearly with the number of cores in best case,

In highly contended cases (micro-benchmarks) I usually observe super-linear performance degradation.
Something like this:

Scaling
1 processor: 1 (base case)
2 cores: 0.6
2 processors: 0.4
4 cores: 0.1

fb251:

I don't think there's a magical solution to optimize more this kind of situation (if any, please tell me!).

This can be optimized if you have substantial amount of read-only transactions.

Dmitry_Vyukov · ‎08-29-2008

JimDempseyAtTheCove:

#if (PAUSE_COUNT > 1)
 for(j = 0; j < PAUSE_COUNT; j::)
 {
 __asm__ __volatile__ 
 (
 "pause" 
 ) ;
 if(! atom)
       /////

Here you are assuming that read access to cache-line is costless. It's wrong.
If it's not the case that reads heavily dominate writes, than read of shared memory has basically the same cost as write to shared memory, or atomic RMW on shared memory. ~200-300 cycles on modern Intel x86 multicore processors.
If reads heavily dominate writes, i.e. cache-line is already in all caches in S (shared) status, than read of shared memory is costless.
So, in my opinion, fb251's original design makes some sense in heavy contended case.

fb251 · ‎08-29-2008

Jim,

I completely agree with the notion on yield on time-out (may I write surrender

?), but when you're using spinlock you know that it will be very short (except if L1 or L2 cache need to be reloaded), it's why the logic is:

1. I have luck, got the lock at first try,
2. leave some time to other threads to unlock whithout putting stress on bus (and caches),
3. try again after a short delay (number of cycles to reload the cache after a write),
4. if it fails, then give another thread/process a chance to do something useful as I can't do nothing better than a NOP operation.

So far it seems to work, even if the value of SPIN_COUNT is linked to the architecture, I've tried 100..350 for SPIN_COUNT without a noticeable difference in timings.
I've also tried a very aggressive loop using directly "lock; xadd" which is optimal for incrementing a variable, on my computer (dual core, 16 threads, 160000000 iterations) results are:

"lock; xadd": 4.4s
"my_spinlock": 4.7s

Considering the overhead due to calling multiple functions in case of "my_spinlock" (ie: lock/unlock) comparing to an inline "lock; xadd", I believe it will be very hard to optimize more...

Dmitriy,

I've no problem with read access... but when it comes to write access

...

Best regards

jimdempseyatthecove · ‎08-29-2008

Read access to your cache-line is costly to your core when your cache line has been invalidated (not present due to eviction or modification by other thread in the coherency system).

Read access to your cache-line is costless to other cores (or I should say less-costly dependent on cache architecture) when your cache line has been invalidated (not present due to eviction or modification by other thread in the coherency system). The system architecture may require you to reach all the way back to memory in which case you introduce contention for memory bus. Or newer architectures reach across to other cache if other cache has current data in which it is less costly. And the reaching technique will vary in other ways (core to core in shared L2, core to core in same die, chip to chip, NUMA...).

The question for you to answer is who pays, and how much.

The if(!atom) only incurs a penalty whenever atom had changed between prior test and current test. The LOCK; CMPXCHG introduces a higher burden on the system when it fails (i.e. is non-productive), when it succeeds the burden is acceptable to bear.

If atom had not changed then the loop would be less costly on the memory bus, less costly on the core examining atom, and less costly on the other cores interested in atom.

If atom has changed, and assuming the lock will succeed, there is a little more overhead to perform a read of atom then lock and cmpxchg of atom (although atom would not need to be re-read if it had not been invalidated by other cache).

If atom has changed, and assuming the lock will fail, then there is significant overhead in performing the read of atom and the failed lock; cmpxchg. But then you will only reach this situation when the read of atom indicates available but the lock;cmpxchg fails (small window). In the case when the atom changed and read indicates locked then you bypass the more costly lock;cmpxchg.

Only when contention is very high (usually only observableunder a stress test) then you might have a probability of seeing atom change and failing to obtain the lock.

Do you really need to protect the case of the stress test?

In the rare cases where the answer is yes, then in a case by case basis (or with instrumentation) you would determine the average worst case time through the critical section (atom protected section exclusive of thread stall), then determine the probability of where the owner execution might be when you fail to obtain the lock (may be half way maybe not). Then issue the number of pauses that just exceed this value (pause time varies from system to system). Note, thatthis introduce a latency for your thread to obtain the lock. Now you have a situation of overhead saved verses latency introduced.

Nothing comes for free

I would recommend against coding for a stress test condition and recommendfor coding for your real application.

Yea, if this is for a competitive benchmark you code for the benchmark without regard to impact for application or stress test.

Jim Dempsey

fb251 · ‎08-30-2008

Jim,

I've tried your suggestion, ie:

while (1)
{
if (! (* atom))
if (! try_lock (atom))
return ;

pause_loop () ;

if (! (* atom))
if (! try_lock (atom))
return ;

sched_yield () ;
}

The first if (! (* atom)) seems to be very costly (50% overhead if I compare with my first version), if I drop the first if (! (* atom)) there's no significant gain or overhead.

So I've make some statistics as Dmitriy suggested: each time I take the pause_loop I increment a counter local to the thread and at the end of the 160000000 iterations I compute the probability by iteration to enter pause_loop. Results are interesting:

If I use a SPIN_COUNT of 10, I've 70% chance to enter pause_loop, which seems normal since it's a too short time for the cache to update a write, this test takes 24s
With a SPIN_COUNT of 50, the probability drops to 17% and 9.6s
With 100, probability is 5.7% and 6.56s
With 200, probability is 4.3% and 6s
With 350, probability is 3.9% and 5.7s
With 500, probability is 2.9% and 4,8s
With 1000, probability is 1.7% and 5.6s

So the "good" value for SPIN_LOCK seems to be in range 200-500 which is coherent with cache delays (at least for this test and for my computer) and do not add too much latency.

I will try to do some tests on the Q6600 to see if results are in the same range.

I agree that this is an extreme test, but it is also the worst case a real life application can encounter, it's why this discussion is not useless.

Best regards

Dmitry_Vyukov · ‎08-30-2008

JimDempseyAtTheCove:
The question for you to answer is who pays, and how much.

Core which need to modify state of cache-line in other core's cache, or need to fetch cache line into own cache. About 200-300 cycles on modern Intel multicore processors.

JimDempseyAtTheCove:

The if(!atom) only incurs a penalty whenever atom had changed between prior test and current test. The LOCK; CMPXCHG introduces a higher burden on the system when it fails (i.e. is non-productive), when it succeeds the burden is acceptable to bear.

If atom had not changed then the loop would be less costly on the memory bus, less costly on the core examining atom, and less costly on the other cores interested in atom.

If atom has changed, and assuming the lock will succeed, there is a little more overhead to perform a read of atom then lock and cmpxchg of atom (although atom would not need to be re-read if it had not been invalidated by other cache).

If atom has changed, and assuming the lock will fail, then there is significant overhead in performing the read of atom and the failed lock; cmpxchg. But then you will only reach this situation when the read of atom indicates available but the lock;cmpxchg fails (small window). In the case when the atom changed and read indicates locked then you bypass the more costly lock;cmpxchg.

Only when contention is very high (usually only observableunder a stress test) then you might have a probability of seeing atom change and failing to obtain the lock.

I do not agree here.

There is *always* substantial cost associated with additional read, no matter whether lock will succeed or not.

Why? Because core have to execute one cache coherence transaction (200-300 cycles) to fetch cache-line in S status, and than execute another cache coherence transaction (200-300 cycles) to promote cache-line to E status. This is true, even if lock will succeed.

And in current situation it's unlikely that cache-line will be in S status for a long. Because successful lock, unlock, and unsuccessful try_lock, all will promote cache-line to E/M status.

You can try to make following test.

1. A bunch of threads constantly execute only XCHG instruction on shared memory location.

2. A bunch of threads constantly execute a mix of XCHG and plain loads (50/50) on shared memory location.

You will see that second case has substantially lower scalability. Because it basically doubles cache-coherence traffic.

JimDempseyAtTheCove:

Do you really need to protect the case of the stress test?

In the rare cases where the answer is yes, then in a case by case basis (or with instrumentation) you would determine the average worst case time through the critical section (atom protected section exclusive of thread stall), then determine the probability of where the owner execution might be when you fail to obtain the lock (may be half way maybe not). Then issue the number of pauses that just exceed this value (pause time varies from system to system). Note, thatthis introduce a latency for your thread to obtain the lock. Now you have a situation of overhead saved verses latency introduced.

Nothing comes for free

I would recommend against coding for a stress test condition and recommendfor coding for your real application.

Yea, if this is for a competitive benchmark you code for the benchmark without regard to impact for application or stress test.

Additional reads do have substantial overhead, no matter whether it's a benchmark or real application. Well, yes, if local work by threads is around, for example, 1 second, than mentioned overhead will be completely masked. But this doesn't mean that such low-level and basic primitive as mutex is allowed incur not obligatory overheads.

Dmitry_Vyukov · ‎08-30-2008

fb251:

The first if (! (* atom)) seems to be very costly (50% overhead if I compare with my first version)

This agrees with what I wrote here:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/30262467/30262467/ShowThread.aspx#30262467

fb251:

So I've make some statistics as Dmitriy suggested: each time I take the pause_loop I increment a counter local to the thread and at the end of the 160000000 iterations I compute the probability by iteration to enter pause_loop. Results are interesting:

If I use a SPIN_COUNT of 10, I've 70% chance to enter pause_loop, which seems normal since it's a too short time for the cache to update a write, this test takes 24s
With a SPIN_COUNT of 50, the probability drops to 17% and 9.6s
With 100, probability is 5.7% and 6.56s
With 200, probability is 4.3% and 6s
With 350, probability is 3.9% and 5.7s
With 500, probability is 2.9% and 4,8s
With 1000, probability is 1.7% and 5.6s

Do I get it right, that the more local spinning w/o any accesses to shared data threads done, the more scalable the algorithm?

fb251:

I agree that this is an extreme test, but it is also the worst case a real life application can encounter, it's why this discussion is not useless.

Indeed.

fb251 · ‎08-30-2008

randomizer:
Do I get it right, that the more local spinning w/o any accesses to shared data threads done, the more scalable the algorithm?

From my tests on my AMD64x2 and the Intel Q6600, yes (I'm also surprised!). On the Q6600 the timings (around 7s) are almost equals with a SPIN_COUNT in the range 1000-10000. So I believe a good strategy would be to assign to SPIN_COUNT a value like 300 x n, where n is the number of cores/processors.

I don't have other multicore systems, so I can't tell if it works right on Xeon for example, but I'm very interested in reading benchmarks on other processors using this technique.

Best regards

Dmitry_Vyukov · ‎08-30-2008

fb251:
randomizer:
Do I get it right, that the more local spinning w/o any accesses to shared data threads done, the more scalable the algorithm?

From my tests on my AMD64x2 and the Intel Q6600, yes

Ok, then things works as expected.
Here I try to explain why this happens:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/30262467/30262467/ShowThread.aspx#30262467

fb251:

I don't have other multicore systems, so I can't tell if it works right on Xeon for example, but I'm very interested in reading benchmarks on other processors using this technique.

I think, that similar result will be on most current SMP/multicore machines. Substantial difference can be on highly hardware threaded processors, like Sun Niagara/Niagara2, they have 4/8 cores each with 8 hardware threads (and Sun Rock will have 16 hardware threads per core). Communication between hardware threads on the same core is extremely cheap.

Dmitry_Vyukov · ‎08-30-2008

fb251:

I've no problem with read access... but when it comes to write access ...

There is no way to make heavy centralized write workload cheap nor scalable. No matter what kind of mutex, or lock-free techniques you will use. If you have such workload, then you better to consider ways to decentralize work.
The easiest way is to partition data. Choose some 'primary key' in data elements, calculate hash of that key, and assign data elements to threads/cores according to that hash. There are some variations. You can assign data to threads, then only one thread can modify data (so no mutexes) (other threads can possibly read); or you can assign data to cores and bind a number of threads to each core, then a number of threads can modify data (here you still have to use mutexes, but solution will be scalable).

fb251 · ‎08-30-2008

randomizer:
Ok, then things works as expected.
Here I try to explain why this happens:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/30262467/30262467/ShowThread.aspx#30262467

Yes, you're right. In this kind of situation "over optimizations" can be fooling, the best approach is still "KISS" (which is my speciality ;-) )

randomizer:
fb251:

I don't have other multicore systems, so I can't tell if it works right on Xeon for example, but I'm very interested in reading benchmarks on other processors using this technique.

I think, that similar result will be on most current SMP/multicore machines. Substantial difference can be on highly hardware threaded processors, like Sun Niagara/Niagara2, they have 4/8 cores each with 8 hardware threads (and Sun Rock will have 16 hardware threads per core). Communication between hardware threads on the same core is extremely cheap.

Interesting. But sadly I've not enough hardware to make tests. I think that "my_spinlock" is a good candidate to replace pthread_spinlock at least, but not too sure if it can fit all SMP/multicore architectures.

fb251 · ‎08-30-2008

randomizer:
[...]You can assign data to threads, then only one thread can modify data (so no mutexes) (other threads can possibly read); or you can assign data to cores and bind a number of threads to each core, then a number of threads can modify data (here you still have to use mutexes, but solution will be scalable).

Well, I use a different approach. I don't like massive threading so I use events based processes (it's a server software). Each request has its own pool of memory which don't need mutex and the scheduling is done when state change and not based upon time slice; it's efficient and fit well with multicore architecture as there's little synchronization work, but it's a lot of programming work comparing to the "worker thread model". So, for me, a core == a process and I take care of the state scheduling inside the process. But I still need to access shared memory, of course, it's why I'm doing some work on spinlock right now ;-)

Dmitry_Vyukov · ‎08-30-2008

fb251:
randomizer:
[...]You can assign data to threads, then only one thread can modify data (so no mutexes) (other threads can possibly read); or you can assign data to cores and bind a number of threads to each core, then a number of threads can modify data (here you still have to use mutexes, but solution will be scalable).

Well, I use a different approach. I don't like massive threading so I use events based processes (it's a server software). Each request has its own pool of memory which don't need mutex and the scheduling is done when state change and not based upon time slice; it's efficient and fit well with multicore architecture as there's little synchronization work, but it's a lot of programming work comparing to the "worker thread model". So, for me, a core == a process and I take care of the state scheduling inside the process. But I still need to access shared memory, of course, it's why I'm doing some work on spinlock right now ;-)

I was talking exactly about this 'shared memory'. You can try to partition it too. You can try to replace it with efficient message-passing. If your 'shared memory' is, for example, statistics, then every thread can maintain private statistics, and then private statistics will be periodically aggregated.

jimdempseyatthecove · ‎08-30-2008

fb21,

Your expression of my suggestion was not complete.
A better version try:

if (! try_lock (atom))
 return ;
while (1)
{
 if (! (* atom))
 if (! try_lock (atom))
 return ;

 pause_loop () ;

 if (! (* atom))
 if (! try_lock (atom))
 return ;

 sched_yield () ;
}

This avoids the if(!atom) test at the beginning
of your attempt to obtain the lock.

Now for my complete example re-expressed in your programming style

Try:

while (try_lock (atom))
{
 for(i=0; i {
 if (! (* atom))
 if (! try_lock (atom))
 return ;

 pause_loop () ; // not SPIN_COUNT number of iterations
 }

 if (! (* atom))
 if (! try_lock (atom))
 return ;

 sched_yield () ;
}
return;

Jim Dempsey