Re: Linux / 64 bits / Q6600 / very bad performances whith mutex - Page 2

fb251 · ‎08-27-2008

Hi all,

I'm a system developper and went into some unexpected results comparing to my main computer (which is not Intel based) when testing on a Q6600 (quad core) using Ubuntu 8.04 64 bits.

I've a basic multi-threaded program (compiled with gcc 4.2.3) like this:

volatile int counter = 0 ;

void * worker (void * arg)
{
register int i ;

(void) arg ;

for (i = O ; i < 10000000 : i ++)
{
(* method_lock) () ;
++ counter ;
(* method_unlock) () ;
}

return NULL ;
}

Where method_lock/unlock can be: pthread_mutex, pthread_spinlock, my_spinlock(*).

I created 16 threads using sched_setaffinity() to ensure each core will run 4 threads.

Results are:

pthread_mutex: 10.5s
pthread_spinlock: 384s (!)
my_spinlock: 9.8s

On my main computer (dual core from the competitor @ 2,2GHz but running with Ubuntu 8.04 32 bits) under same conditions (16 threads too), results are:

pthread_mutex: 25s
pthread_spinlock: 91s
my_spinlock: 5.4s

These values are average, this test has been done many times without significant variation. My mutex/spinlock was not aligned on a cache line boundary but as it was the only user process running on the computer I believe it's not an answer to explain these numbers.

I will use spinlock for very short code (some cycles) on a server software.
Is there anybody to give me some hints or tests to do in order to improve threads synchronization functions for the Q6600 (I was expecting more performance from a Quadcore) ?

(*) Use a classical loop with "lock; cmpxchg %2,%1" and "pause;" see below:

int try_lock (atomic_t * atom)
{
atomic_t old ;

__asm__ __volatile__ (
"lock; cmpxchg %2, %1"
: "=a" (old)
: "m"(* atom), "r"(-1), "a" (0)
: "memory", "cc"
) ;

return old ;
}

and:

void spin_lock (atomic_t * atom)
{
register int i ;

while (1)
{
if (! try_lock (atom))
return ;

for (i = 0; i < SPIN_COUNT; i++)
{
__asm__ __volatile__
(
"pause"
) ;
}

if (! try_lock (atom))
return ;

sched_yield () ;
}
}

jimdempseyatthecove · ‎08-30-2008

fb251,

Can you post your test app (either source or .zip of source)so I can run it on my Q6600 and incorporate my suggestions. This will help me post actual data against your data as opposed to me posting hypothetical data against your data.

Jim Dempsey

fb251 · ‎08-30-2008

Of course Jim, thank you for your help. You will find the code below in one "file" (as attachement are not allowed on this forum), please note that the test program (ie: worker() and main() are not production code but simply "scratch").

To compile under Linux, use:

gcc -O3 -o atomic2 atomic2.c -lpthread

Usage: ./atomic2 1|2|3|4

where 1: pthread_mutex, 2: pthread_spinlock, 3: my_spinlock, 4: simple xadd

Best regards

x-x-x-x-x-x-x-x-x-x
/*
Copyright 2005-2008 Franois Battail

This file is governed by the CeCILL license version 2 under French
law and abiding by the rules of distribution of free software. You can
use, modify and/or redistribute the software under the terms of the
CeCILL version 2 license as circulated by CEA, CNRS and INRIA at the
following URL "http://www.cecill.info/".

As a counterpart to the access to the source code and rights to copy,
modify and redistribute granted by the license, users are provided only
with a limited warranty and the software's author, the holder of the
economic rights, and the successive licensors have only limited
liability.

In this respect, the user's attention is drawn to the risks associated
with loading, using, modifying and/or developing or reproducing the
software by the user in light of its specific status of free software,
that may mean that it is complicated to manipulate, and that also
therefore means that it is reserved for developers and experienced
professionals having in-depth computer knowledge. Users are therefore
encouraged to load and test the software's suitability as regards their
requirements in conditions enabling the security of their systems and/or
data to be ensured and, more generally, to use and operate it in the
same conditions as regards security.

The fact that you are presently reading this means that you have had
knowledge of the CeCILL license version 2 and that you accept its terms.
*/

# include
# include
# include
# include
# include
# include
# include
# include
# include
# include
# include
# include
# include
# include
# include
# include
# include

/* -------------------------------------------------------------------------------------------------
Atomic operations
*/

# define SPIN_COUNT (350)
# define ATOMIC_INITIALIZER (0)

# define NBR_THREADS 16

typedef volatile uintptr_t atomic_t ;

/*
If (* atom) is free (0) then we write -1 in one atomic operation.
Returns the previous value, if it's zero we have the lock.
*/
extern atomic_t my_try_lock (atomic_t * atom) ;

/*
Simply free the lock by writing 0. Of course, atom needs to be locked prior
calling this function. Returns a non zero value if lock was freed.
*/
extern atomic_t my_unlock (atomic_t * atom) ;

/*
Busy wait to acquire the lock. The "pause" instruction is
here to exploit hyperthreading. If we cannot get the lock
after the spin loop we ask the kernel to reschedule our process.
From our tests, this function scales very well even if the
number of running threads / processes exceeds the number of
CPU compared to pthread_mutex or pthread_spinlock.
*/
extern void my_spin_lock (atomic_t * atom, int loc) ;

/*
Increment a counter.
*/
extern void my_atomic_inc (atomic_t * atom) ;

/*
Add a value to a counter.
*/
extern void my_atomic_add (atomic_t * atom, unsigned int add) ;

/*
If (* atom) is free (0) then we write -1 in one atomic operation.
Returns the previous value, if it's zero we have the lock.
*/
atomic_t my_try_lock (atomic_t * atom)
{
atomic_t old ;

__asm__ __volatile__
(
" lock; cmpxchg %2, %1"
: "=a" (old)
: "m"(* atom), "r"(-1), "a" (0)
: "memory", "cc"
) ;

return old ;
}

/*
Simply free the lock by writing 0. Of course, atom needs to be locked prior
calling this function. Returns a non zero value if lock was freed.
*/
atomic_t my_unlock (atomic_t * atom)
{
atomic_t old ;

__asm__ __volatile__
(
" lock; cmpxchg %2, %1"
: "=a" (old)
: "m"(* atom), "r"(0), "a" (-1)
: "memory", "cc"
) ;

return old ;
}

/*
Try to grab statistics on entering the pause loop
*/
static unsigned long long spin_count [NBR_THREADS] = { 0 } ;

/*
Busy wait to acquire the lock. The "pause" instruction is
here to exploit hyperthreading. If we cannot get the lock
after the spin loop we ask the kernel to reschedule our process.
From our tests, this function scales very well even if the
number of running threads / processes exceeds the number of
CPU compared to pthread_mutex or pthread_spinlock.
*/
void my_spin_lock (atomic_t * atom, int loc)
{
register int i ;

while (1)
{
if (! my_try_lock (atom))
return ;

for (i = 0; i < SPIN_COUNT; i++)
{
__asm__ __volatile__
(
" pause"
) ;
}

spin_count [loc] ++ ;

if (! my_try_lock (atom))
return ;

sched_yield () ;
}
}

/*
Increment a counter.
*/
void my_atomic_inc (atomic_t * atom)
{
__asm__ __volatile__
(
" lock; xadd %0, %1;"
: /* No output */
&n bsp; : "r" (1), "m" (* atom)
: "memory", "cc"
) ;
}

/*
Add a value to a counter.
*/
void my_atomic_add (atomic_t * atom, unsigned int add)
{
__asm__ __volatile__
(
" lock; xadd %0, %1;"
: /* No output */
: "a" (add), "m" (* atom)
: "memory", "cc"
) ;
}

/*
Benchmark for comparing the different methods pthread vs/ our method
*/

atomic_t test = ATOMIC_INITIALIZER ;
pthread_mutex_t mux = PTHREAD_MUTEX_INITIALIZER ;
pthread_spinlock_t slock ;

int counter = 0 ;

/*
Code protected by our spinlock
*/
int dumb (void)
{
counter ++ ;
}

int instance = 0 ;

/* XXX WTF! sched.h is a complete mess, need to use internal macro in bits/sched.h */

void * worker (void * arg)
{
cpu_set_t cpus ;
int n_cpu ;
register int i ;
char m = (char) arg ;
int loc = instance ;

if (sched_getaffinity (syscall (SYS_gettid), sizeof (cpu_set_t), & cpus))
perror ("sched_getaffinity") ;

// n_cpu = __CPU_COUNT (sizeof (cpu_set_t), & cpus) ;
n_cpu = __sched_cpucount (sizeof (cpu_set_t), & cpus) ;

__CPU_ZERO_S (sizeof (cpus), & cpus) ;

__CPU_SET_S (instance % n_cpu, sizeof (cpus), & cpus) ;
if (sched_setaffinity (syscall (SYS_gettid), sizeof (cpu_set_t), & cpus))
perror ("sched_setaffinity") ;

instance ++ ;

printf ("CPU used: %d/%d ", sched_getcpu (),n_cpu) ;

if (m == '3')
{
for (i = 0 ; i < 10000000 ; i ++)
{
my_spin_lock (& test,loc) ;
dumb () ;
my_unlock (& test) ;
}
}
else if (m == '2')
{
for (i = 0 ; i < 10000000 ; i ++)
{
pthread_spin_lock (& slock) ;
dumb () ;
pthread_spin_unlock (& slock) ;
}
}
else if (m == '1')
{
for (i = 0 ; i < 10000000 ; i ++)
{
pthread_mutex_lock (& mux) ;
dumb () ;
pthread_mutex_unlock (& mux) ;
}
}
else
{
for (i = 0 ; i < 10000000 ; i ++)
my_atomic_inc (& counter) ;
}

return NULL ;
}

int main (int argC, char * * argV)
{
pthread_t id [NBR_THREADS] ;
int i ;

if (argC != 2)
{
wrong:
printf ("Usage: atomic 1|2|3|4 ") ;
exit (1) ;
}

switch (argV [1][0])
{
case '1':
printf ("Atomic operations, using pthread_mutex ") ;
bre ak ;

case '2':
printf ("Atomic operations, using pthread_spinlock ") ;
break ;

case '3':
printf ("Atomic operations, using my_spin_lock ") ;
break ;

case '4':
printf ("Atomic operations, using my_atomic_inc ") ;
break ;

default:
goto wrong ;
}

pthread_spin_init (& slock, 0) ;

for (i=0; i < NBR_THREADS ; i ++)
pthread_create (& id , NULL, worker, (void *) argV [1][0]) ;

for (i=0; i < NBR_THREADS ; i ++)
pthread_join (id , NULL) ;

printf ("The counter is: %d ", counter) ;

if (argV [1][0] == '3')
{
double g = 0.0 ;

printf ("SPIN_LOCK statistics: ") ;
for (i=0;i {
g += spin_count ;
}

g /= 160000000.0 ;
printf ("Average: %f ",g) ;
}
}

fb251 · ‎08-31-2008

Some analysis with another data:

while (1)
{
// cost: ~ 200 cycles with lock on bus
if (try_lock (atom))
return ;

// cost: SPIN_COUNT cycles without any impact on other cores or bus (NOP)
for (i = 0 ; i < SPIN_COUNT ; i++)
pause () ;

// cost: ~ 200 cycles with lock on bus
if (try_lock (atom))
return ;

// cost: ?
sched_yield () ;
}

With 16 threads on my dual core there is:

a probability to call the pause loop of: 3.4%

a probability to call sched_yield of: 1.2%

That means that 96.6% of the time it just costs ~ 200 cycles, like a lock xadd. If it fails it will cost ~ 200 cycles + SPIN_COUNT, then another ~ 200 cycles to retry the lock. If it still fails (1.2%) we will retry later after rescheduling.

If you put a comment on the line sched_yield() you will see a 5x factor on execution time

because in that case we are doing a busy wait on a deadlock until the scheduler does its job, so giving up is the better thing to do at this time.

If you put a comment on the pause loop and on the second try_lock (but not on sched_yield()) you will see an improvement (10-15%) because 3.4% of the time we avoid a failed lock which occurs 33% of the time.

So from my tests the most efficient strategy is a spin lock without a spin! That means that we drop SPIN_COUNT which is architecture dependant, so here's the code:

void my_spin_lock (atomic_t * atom)
{
while (1)
{
if (try_lock (atom))
sched_yield () ;
else
break ;
}
}

It was tested with 16 threads and 2 threads (the later is the number of cores of my computer) and it scales.

I'm awaiting your comments and remarks from your tests!

Best regards

Dmitry_Vyukov · ‎08-31-2008

fb251:

So from my tests the most efficient strategy is a spin lock without a spin! That means that we drop SPIN_COUNT which is architecture dependant, so here's the code:
I'm awaiting your comments and remarks from your tests!

I tested a number of mutex and rw_mutex implementations on Q6600. And concluded that basic spin mutex with passive spin (sched_yield()/SwitchToThread()) is the winner among all traditional mutex and rw_mutex implementations provided small critical sections.
I am not sure, but it seems that all those active spins was born in Jurassic times on quite obsolete hardware in HPC community. At least I wasn't seen any micro-benchmark where active spin mutex was the winner. I see the only scenario where they can be appropriate: if you create strictly thread per core, and bind every thread to the core.

jimdempseyatthecove · ‎09-02-2008

fb,

Working on converting your sample program for use on Windows. I have some work to do in converting the thread/process calls to Windoz flavors.

One comment comes to mind though.

In a compute-only application, depending on flow, it may not be beneficial to use more threads than available cores. Therefore a stress test using 16 threads on a 4 core system might not be a realistic test for performance. Such stress test should be run to check for deadlocks.

When using 16 threads on a 4 core system, and using your sample program, you will have a significant probability a context switch out in the mutex and spin lock tests while the mutex/lock is held. Whereas on a 4 threads 4 core system the probability of context switch out during holding of mutex/lock is comparatively low.

In the 16 on 4 environment, when a thread holding the mutex/lock is switched out, then using yield will reduce latency to get the thread holding the mutex/lock to run and complete the critical section.

In the 4 on 4 environmenta non-yielding technique may prove best.

Therefor, although your test illustrated better performance using one technique over the other, you must be cognizant that the test results are only valid for the circumstances under which the test were run and NOT necessarily valid under the circumstances in which you application is run.

Jim Dempsey

jimdempseyatthecove · ‎09-02-2008

fb251,

I got a version of your test program running on Windows XP Professional x64. Required some changes to use Windows threads in place of POSIX threads. I did not set in the affinity (too lazy at this time).

Used Windows Process mutex for pthread_mutex
Used Windows CriticalSection for pthread_spinlock
ran with same count

Made two sets of runs, one with 4 threads, and one with 16 threads

4 Threads

mutex154.97
spincount 3.8
my_spincount 3.72
my_inc 2.15

16 threads

mutex606.69spincount 13.29
my_spincount 14.22
my_inc 8.50

When my system initialized the SpinCount of the CriticalSection it is initialized to 0. I also overrided that count and set to your SPIN_COUNT (350) the CriticalSection time went from13.29 to 91.32.

Note, my code waits for all threads to complete before marking time (as opposed to waiting for master thread to make it through. Therefore 16 thread test is pushing 4x the data through as the 4 thread test.

Jim Dempsey

fb251 · ‎09-02-2008

Jim,

Thank you for doing these tests. My program was not intended to run on Windows or even to be portable to another operating system, so I believe it was hard work to do the porting.
Your results are interesting, but some remarks.

1. Affinity can be an important issue (I was also lazy at first time for making these system calls!).
2. From your tests, in this case, mutex seems to be avoided under Windows as from my tests pthread_spin_lock should be avoided under Linux. It's a good thing to know.

About the test case, I'm coding a server software and not an application; that means I search short completion time and low latency. 16 threads sharing common data under a critical section for only two cores is an extreme situation, I agree, but what's interesting is that we can know the minimum time needed by simply using "lock xadd", since in that case only the hardware is at work.

I was very surprised by the probability to get the lock at the first try: around 96.4% on my dual core, that means that even in such extreme situation hardware is performing very well; but when it fails, it fails well too! Without the sched_yield() call execution time increase by a 5 factor even if the probability to have this case is only 1.2%! Of course with a quad core it's worse since there's more probability to lock the bus.

I don't program for Windows since a long time and I'm not aware of the internals; in Linux we have now the "fair scheduler" and it seems to be very efficient in that particular case. Maybe the strategy of the scheduler of Windows XP 64 pro is different from a Windows server software?

I'm not too sure to understand your last sentence (as you can read I'm not a native English reader/writer), in my program I use pthread_join(), so in any case I will wait for the last thread to complete and only then the main thread will take a timestamp.

Best regards

fb251 · ‎09-03-2008

JimDempseyAtTheCove:
When my system initialized the SpinCount of the CriticalSection it is initialized to 0. I also overrided that count and set to your SPIN_COUNT (350) the CriticalSection time went from13.29 to 91.32.

If the documentation on MSDN is still accurate:

MSDN:
[...]On multiprocessor systems, if the critical section is unavailable, the calling thread spin dwSpinCount times before performing a wait operation on a semaphore associated with the critical section. If the critical section becomes free during the spin operation, the calling thread avoids the wait operation.

(emphasis is mine)

So from your numbers and MSDN it is likely that EnterCriticalSection() looks to something like this:

if (! try_lock ())
return ;
for (i=0; i < dwSpinCount ; i++)
{
pause () ;
if (! try_lock ())
return ;
}
sched_yield () ; // or whatever

Which is very costly if we take the pause/try_lock loop, but by default the spinlock count is zero so this function behaviour should be almost the same as "my_spinlock".

I've done some search about Windows mutexes and it appears that they are known to be a CPU hog because they're implemented in kernel space.

Thank you for your tests, even if the subject of this thread was Linux..., Windows developpers have now also some valuable information when it comes to synchronization timings with multi core processors.

Best regards

Dmitry_Vyukov · ‎09-03-2008

fb251:

I've done some search about Windows mutexes and it appears that they are known to be a CPU hog because they're implemented in kernel space.

Windows's Mutexes, Semaphores, Events and Waitable Timers are named kernel objects, and intended for inter-process communication. They are NOT intended for intra-process communication.

Windows's synchronization objects for intra-process communication are:
CRITICAL_SECTION
SRWLock (slim reader-writer mutex)
ConditionVariable (can be used with CRITICAL_SECTION or SRWLock)
InitOnce
InterlockedSList (efficient lock-free lifo stack)
And various Interlocked function

fb251 · ‎09-03-2008

randomizer:
Windows's Mutexes, Semaphores, Events and Waitable Timers are named kernel objects, and intended for inter-process communication. They are NOT intended for intra-process communication.

I'm not a Windows specialist but using shared memory it should be possible to call user space synchronization functions for inter-process communications. It's what I'm doing on Linux to avoid doing costly kernel calls.

jimdempseyatthecove · ‎09-03-2008

fb251,

I am quite familiar with affinity as I have written a thredding toolkit that supports object level-to-thread affinity. For this sample program setting affinity would not have affected results too much seeing that allthreads are running the same code section and referencing the same data.

Mutex is a process to process exclusion technique whereas critical section (a grander version of spinlock) is meant for intra-process (within one process, i.e. thread to thread within one process). Select the tool designed for the purpose.

The underlaying reason the sched_yield() improves the throughput on your test is you test has an oversubscription of worker threads to cores. On a server, one is typically not in control of the number of processes running, whereas each process may be in control of the number of threads it spawns. Therefor, the server environment (when run as a server) will typically see more running threads than available cores. In this case a yield is warranted. On an application system (non-server) other processes run (e.g. browser, mp3 player, ...) butthe probability of oversubscription becomes intermittant. In this environment it is hard to say if yields produce better or worse time.

The principal problem with the: spinlock, work, unspinlock technique is regardless of how short work is, there is a finite probability of thread owning the lock to be context switched out.

Several weeks ago, I made a suggestion to someone at Intel to consider adding an instruction prefex that when used with a lock;cmpxchg and if the cmpxchg succeeded that the processor would disable interrupts for a short duration (to be determined). The temporary interrupt inhibit would stay in place until a) a short time interval elapsed, b) a cmpxchg is issued (at the unlock), and c) to avoid abuse, if a subsequent temporary interrupt off prefix is issued, in which case it could behave as if a pause were inserted in front of the second temporary interrupt off prefix.

The purpose of the temporary interrupt off prefex is such that you can traverse the lock, work, unlock with virtually no probability of suspending the thread having the lock and thus avoid unnecessary context switches.

Jim Dempsey

Dmitry_Vyukov · ‎09-03-2008

fb251:
randomizer:
Windows's Mutexes, Semaphores, Events and Waitable Timers are named kernel objects, and intended for inter-process communication. They are NOT intended for intra-process communication.

I'm not a Windows specialist but using shared memory it should be possible to call user space synchronization functions for inter-process communications. It's what I'm doing on Linux to avoid doing costly kernel calls.

Yes, this is possible solution for cooperating processes. But Windows kernel synchronization objects are named. So they can be used in non-cooperative environments. For example, some commonly used program (WinAMP) can create named mutex "WinAMP_is_running", and 3rd party programs can determine whether WinAMP is running by requesting state of that mutex.
Also their usage is a bit simpler than shared memory approach. Two processes just have to create mutex with the same name. But, yes, they are significantly slower.

fb251 · ‎09-03-2008

randomizer:
For example, some commonly used program (WinAMP) can create named mutex "WinAMP_is_running", and 3rd party programs can determine whether WinAMP is running by requesting state of that mutex.

Well, WinAMP_is_running it's a very strange example to use a mutex!

JimDempseyAtTheCove:
The underlaying reason the sched_yield() improves the throughput on your test is you test has an oversubscription of worker threads to cores. On a server, one is typically not in control of the number of processes running, whereas each process may be in control of the number of threads it spawns.

Agree, but there's a difference between Windows and Linux, quoting MSDN:

"The yield of execution is limited to the processor of the calling thread. The operating system will not switch execution to another processor, even if that processor is idle or is running a thread of lower priority."

(emphasis is mine)

That means under Windows you've limited control if using SwitchToThread() (or may be there is another system call), it's not the case with sched_yield() which seems to be process/thread/core agnostic (it is implemented as a dequeue/requeue on the global task tree).

For parallel and long numerical simulations, cost of synchronization is not an issue if the algorithm is good; it's not exactly the case when you are developping a server: you try to "eat" as much as possible s and ms in each possible place.
In fact sometimes it's like coding an operating system in user space. That's why we bypass the operating system to keep code in user space and still use assembly language ;-). So I don't trust any system call or function before doing a benchmark in the worst case!

Jim, if I understand your suggestion for Intel, it would be a specific instruction which inhibits hardware interrupts for a limited time on a core as early as an atomic operation (cmpxchg) succeeds? Looks like a very good idea if such kind of instruction could be used without needing ring 0 privilege. I hope Intel will take on this and gives some feedback.

Best regards

jimdempseyatthecove · ‎09-03-2008

>>"The yield of execution is limited to the processor of the calling thread. The operating system will not switch execution to another processor, even if that processor is idle or is running a thread of lower priority."

The above holds for SwitchToThread() but not for Sleep(0). It appears that SwitchToThread places the running thread at the tail end of the current processors run queue. Whereas Sleep immediately reschedules thread. So Sleep(0) could potentially switch processors (provided affinity doesn't restrict switch).

The temporary interrupt off would eliminate the requirement to write wait-free algorithms. If used properly a lock will not normallybe held on thread context switch. I forgot to mention that the termination of the temporary interrupt off would also be triggered by other faults, such as divide by 0, over flow, etc.. And you may also want to terminat on XCHG as well as CMPXCHG. But normal interrupts, such as clock or device interrupts would be deferred a small slice of time.

movedx, 1
loop:
xoreax,eax
LOCKnoInt
CMPXCHG dword ptr [atom], edx
JNZyield
oreax, dword ptr [head]
cmovnz edx, dword ptr [eax]
cmovnzdword ptr [head], edx
xoredx,edx
XCHGdword ptr [head], edx
ret
yield: call do_yield
jmp loop

The above (untested) code could be used to unlink from a slingle linked list.

The new LOCKnoInt would inhibit interrupts in the above code until first occurance of

a) immediately following CMPXCHG failed,
b)XCHG
c) CMPXCHG of any type if used in place of XCHG
d) if [head] is paged out
e) if [head] isat invalid address
f) if node pointed to by former head [eax] is paged out
g) if node pointed to by former head [eax] is invalid.

For the paged out conditions, it doesn't matter if the locked session is interrupted as every contending thread will wait in the yeld loop. For invalid addreses - then all threads are trashed as the list is blown.

Jim Dempsey

jimdempseyatthecove · ‎09-03-2008

A few notes on the above code

LOCK could be extended to perform the LOCKnoInt (potentially enabled/disabled with bit in processor status word)

The CMPXCHG instruction following the LOCKnoInt (LOCK) is atomic with respect to other processors, but instructions following it are not (as these are presumably protected by the lock on atom. The only difference is that is if successful CMPXCHG the interrupts are temporarily disabled on thecore running the thread.

The length of time for the interrupt disable could be fixed or programmable (i.e. when you inable the new instruction).

If implemented as an extension to LOCK (as opposed to new instruction), then all existing code will run with an interrupt reprieve and thus system performance will improve. This should be somethin Intel could emulate using one of there emulation progams. It would be interesting what the effect would be on the stress test programs.

Jim Dempsey

Dmitry_Vyukov · ‎09-04-2008

I've also made some benchmarks of spin mutexes. Platform is Intel Core 2 Quad (Q6600), Windows XP 32 bit.

The winner is following algorithm:
for (;;) {
if (0 == XCHG(lock, 1)) return;
for (int i = 0; i != 300; ++i) pause;
if (0 == XCHG(lock, 1)) return;
SwitchToThread();
}
4 threads: acquisition time is 87 cycles
16 threads: acquisition time is 84 cycles

Second place:
for (;;) {
if (0 == XCHG(lock, 1)) return;
SwitchToThread();
}
4 threads: acquisition time is 96 cycles
16 threads: acquisition time is 100 cycles

Third place:
int iter = 0;
for (;;) {
if (0 == lock)
if (0 == CAS(lock, 0, 1)) return;
if (iter++ < 3)
pause;
else
SwitchToThread();
}
4 threads: acquisition time is 206 cycles
16 threads: acquisition time is 186 cycles

Fourth place:
for (;;) {
if (0 == XCHG(lock, 1)) return;
for (int i = 0; i != 30; ++i) {
for (int j = 0; j != 10; ++j)
pause;
if (0 == XCHG(lock, 1)) return;
}
SwitchToThread();
}
4 threads: acquisition time is 659 cycles
16 threads: acquisition time is 571 cycles

So, the conclusion I can made - the less mutex touches shared data while spinning - the better.
Also for HPC segment (thread per processor) active spin-lock is the choice. For server segment (number of threads > number of processors) passive spin-lock is the choice.

Also I've benchmarked TAS protocol vs TATAS. Results have confirmed my supposition the TATAS protocol is basically outdated.
TAS:
for (;;) {
if (0 == XCHG(lock, 1)) return;
SwitchToThread();
}

TATAS:
for (;;) {
if (0 == lock)
if (0 == XCHG(lock, 1)) return;
SwitchToThread();
}

TAS:
4 threads: acquisition time is 96 cycles

TATAS:
4 threads: acquisition time is 116 cycles

jimdempseyatthecove · ‎09-04-2008

Dmitriy,

Good work on running the permutations.

Jim Dempsey

fb251 · ‎09-04-2008

Thank you very much Dmitriy for doing all these tests.

The use of SwitchToThread() is, may be, more efficient since we schedule another thread on the same core when there's a failed lock rather than doing a global reschedule as done by scheld_yield(), less bus/cache stress I believe. I should do some tests but I don't think there's a function like this under Linux callable from user space.
My tests done on the Q6600 may not be completely accurates since this computer is running Ubuntu 8.04 Studio 64 and the scheduler used is RT (Real Time) and not the fair scheduler I use.

Very kind thanks to you and to Jim for investing time on this topic and for sharing your knowledge.

Best regards

jimdempseyatthecove · ‎09-04-2008

Dmitriy,

Can you try the following:

static int lockLock = 0;

for (;;) {
 if (0 == XCHG(lock, 1)) return;
 if (0 == XCHG(lockLock, 1))
 {
 for (int i = 0; i != 300; ++i) pause;
 lockLock = 0; // or XCHG(lockLock, 0)
 if (0 == XCHG(lock, 1)) return;
 }
 SwitchToThread();
}

The first thread failing to get lock gets into the pause loop, the others immediately fall into SwitchToThread.

Jim

jimdempseyatthecove · ‎09-05-2008

Dmitriy,

I forgot to mention that you may need to reduce the iteration count on the pause loop to account for the two operations on lockLock.

Also, I am certain you will be careful enough not to place lock and lockLock in the same cache line as that would interfere with the other threadsaccessing the cache line with lock.

It would be interesting, for both your former best case and above revision, if you can determine if the technique is fair to all competing threads.

Jim Dempsey

Dmitry_Vyukov · ‎09-05-2008

JimDempseyAtTheCove:

I forgot to mention that you may need to reduce the iteration count on the pause loop to account for the two operations on lockLock.

Also, I am certain you will be careful enough not to place lock and lockLock in the same cache line as that would interfere with the other threadsaccessing the cache line with lock.

It would be interesting, for both your former best case and above revision, if you can determine if the technique is fair to all competing threads.

Jim, your variant is very interesting. I'm going to benchmark it against other implementations, fortunately it will be relatively easy since benchmarking framework is already established.

Yes, I must try to tweak parameters of algorithms (iteration counts etc) in order to get more fair results. I just had not enough time.

As for fairness, well, I think they all completely not fair, because they are *spin* mutexes. And spin mutexes are for "best effort", not for "quality of service".

Btw, your last algorithm recalls 'futile wakeup throttling' trick for blocking mutexes:
http://www.usenix.org/events/jvm01/full_papers/dice/dice.pdf
And David Dice was reporting substantial improvement after applying 'futile wakeup throttling' in heavy contended case.

Linux / 64 bits / Q6600 / very bad performances whith mutex/spinlock