Linux / 64 bits / Q6600 / very bad performances whith mutex/spinlock - Page 3

fb251 · ‎08-27-2008

Hi all,

I'm a system developper and went into some unexpected results comparing to my main computer (which is not Intel based) when testing on a Q6600 (quad core) using Ubuntu 8.04 64 bits.

I've a basic multi-threaded program (compiled with gcc 4.2.3) like this:

volatile int counter = 0 ;

void * worker (void * arg)
{
register int i ;

(void) arg ;

for (i = O ; i < 10000000 : i ++)
{
(* method_lock) () ;
++ counter ;
(* method_unlock) () ;
}

return NULL ;
}

Where method_lock/unlock can be: pthread_mutex, pthread_spinlock, my_spinlock(*).

I created 16 threads using sched_setaffinity() to ensure each core will run 4 threads.

Results are:

pthread_mutex: 10.5s
pthread_spinlock: 384s (!)
my_spinlock: 9.8s

On my main computer (dual core from the competitor @ 2,2GHz but running with Ubuntu 8.04 32 bits) under same conditions (16 threads too), results are:

pthread_mutex: 25s
pthread_spinlock: 91s
my_spinlock: 5.4s

These values are average, this test has been done many times without significant variation. My mutex/spinlock was not aligned on a cache line boundary but as it was the only user process running on the computer I believe it's not an answer to explain these numbers.

I will use spinlock for very short code (some cycles) on a server software.
Is there anybody to give me some hints or tests to do in order to improve threads synchronization functions for the Q6600 (I was expecting more performance from a Quadcore) ?

(*) Use a classical loop with "lock; cmpxchg %2,%1" and "pause;" see below:

int try_lock (atomic_t * atom)
{
atomic_t old ;

__asm__ __volatile__ (
"lock; cmpxchg %2, %1"
: "=a" (old)
: "m"(* atom), "r"(-1), "a" (0)
: "memory", "cc"
) ;

return old ;
}

and:

void spin_lock (atomic_t * atom)
{
register int i ;

while (1)
{
if (! try_lock (atom))
return ;

for (i = 0; i < SPIN_COUNT; i++)
{
__asm__ __volatile__
(
"pause"
) ;
}

if (! try_lock (atom))
return ;

sched_yield () ;
}
}

jimdempseyatthecove · ‎09-05-2008

Dmitriy,

The paper is a nice reference, will have to bookmark it for re-reading.

My only problem with the technique in the paper is it seems more suited to mutexes (inter-process) as opposed to spinlocks (intra-process). This is principally due to the call to malloc/new to obtain a monRec object. Granted, alloca or stack based structure could be used as well as a thread local storage structure could be used to conserve some processing time.

What I would prefer for my purposes is a lighter weight method that is used by multiple threads within a process and which produces a relatively fair access to the resource.

Will try to work-up an example.

Jim Dempsey

jimdempseyatthecove · ‎09-06-2008

Dmitriy,

The following is a light-weight ordered spinlock. The code was not fully tested and it is written to compile on my Windows system therefore you will need to Linux-ize the code (preferrably with conditionalized #defines). See if you can incorporate this into your test suite. Note MS uses non-intuitive ordering of arguments (arguments not in order of how they are specified in name). You may have to play with SLEEP_COUNT to minimize the overhead.

// LockQueue.cpp
// Copyright(C) - Jim Dempsey 2008
// The following code represents a queuing spinlock.
// This code, and derrivative thereof, may be freely used
// provided it includes this Copyright Notice.
//
// Jim Dempsey
// Oshkosh, WI
// USA
// jim_dempsey@ameritech.net
//#include "stdafx.h"#include #include #include #define CACH_LINE_SIZE 64#define SLEEP_COUNT 300__declspec( thread ) long MyLockId = 0;struct Lock
{
  __declspec( align(CACH_LINE_SIZE)) volatile long LastLock;
  __declspec( align(CACH_LINE_SIZE)) volatile long TransferLock;
  Lock() { LastLock = TransferLock = NULL; };
  ~Lock(){ _ASSERT(LastLock==NULL); };
  void Acquire();
  void Release();
  long UniqueId()
  {
    union {
      void* v;
      long id;
    };
    v = &v; // a void* that is assured to be unique    return id;
  };
};void Lock::Acquire()
{
  // Exchange whatever is in LastLock with MyLockID (Initialize if required)  long myLastLock = _InterlockedExchange(&LastLock, (MyLockId?MyLockId:(MyLockId=UniqueId())));
  // If the exchang prior value is NULL then I own the lock  if(!myLastLock) return;
  // Here if someone else owns the lock  while(true)
  {
    // perform short series of pause instructions    for(int i=0; i
      _mm_pause();
    // return if ownership transfered to my thread    if(MyLockId == TransferLock) return;
    // perform longer fast wait by going to end of line    SwitchToThread();
    // return if ownership transfered to my thread    if(MyLockId == TransferLock) return;
  }
}
void Lock::Release()
{
  // if LastLock containd MyLockId replace it with NULL  // else place MyLockId into TransferLock
  // MS uses: old = _InterlockedCompareExchange(p, x, c) 
  if(_InterlockedCompareExchange(&LastLock, NULL, MyLockId) != MyLockId)
    TransferLock = MyLockId;
  else
    TransferLock = 0;
}
Lock someLock;
int _tmain(int argc, _TCHAR* argv[])
{
  someLock.Acquire();
  someLock.Release();
  return 0;
}

Comments:

If the lock is not contested for then the expense is one XCHG and one CAS.

If the lock is contested for then the first thread experiences one XCHG, one CAS, and onewrite to shared variable.

If the lock is dualycontested but fast path thenfor then thesecond thread experiences one XCHG, pause loop, read of shared variable, one CAS, and one write to shared variable.

If the lock has more contesting or longer paththen SwitchToThread takes effect.

Jim Dempsey

jimdempseyatthecove · ‎09-06-2008

Dmitriy,

The prior posted code is suitable for 32-bit platform.

For 64-bit platform change the long to __int64 types (and use the appropriate _Interlocked...64(...) routines).

Jim Dempsey

Dmitry_Vyukov · ‎09-07-2008

I've made some more benchmarks.

First of all, I try to variate spin-count in this implementation:
for (;;) {
if (0 == XCHG(lock, 1)) return;
for (int i = 0; i != 300; ++i) pause;
if (0 == XCHG(lock, 1)) return;
SwitchToThread();
}
It turns out that number 300 is good choice for Q6600 quad-code. When spin-count is 200-400 the performance is nearly equal and optimal. When spin-count >400 or <200 performance degrades.

Then I've made following test. I remove all local work from critical section (all other tests was conducted with local work for about ~30 cycles in critical section ). And set number of threads to 16. In this setup following implementation:
for (;;) {
if (0 == XCHG(lock, 1)) return;
SwitchToThread();
}
substantially degrades to 800-3000 cycles. While following:
for (;;) {
if (0 == XCHG(lock, 1)) return;
for (int i = 0; i != 300; ++i) pause;
if (0 == XCHG(lock, 1)) return;
SwitchToThread();
}
doesn't degrade at all.

This results suggest that mutex implementation must be chosen and tuned to particular situation. Even small deviations in thread count, hardware concurrency and amount of local processing inside and outside of critical section can greatly affect performance of mutex.

Then I re-run benchmarks for following implementations, varying number of threads and amount of local processing. Local processing is ~30 cycles inside critical section + ~30 cycles outside critical section. Every test runs for 5 seconds, and I choose the best result from 3 runs.

static int lockLock = 0;

for (;;) {
 if (0 == XCHG(lock, 1)) return;
 if (0 == XCHG(lockLock, 1))
 {
 for (int i = 0; i != 300; ++i) pause;
 lockLock = 0; // or XCHG(lockLock, 0)
 if (0 == XCHG(lock, 1)) return;
 }
 SwitchToThread();
}

4  threads, w/o  local work: 89 cycles
16 threads, w/o  local work: 79 cycles
4  threads, with local work: 110 cycles
16 threads, with local work: 96 cycles

---------------------------------------
for (;;) {
 if (0 == XCHG(lock, 1)) return;
 for (int i = 0; i != 300; ++i) pause;
 if (0 == XCHG(lock, 1)) return;
 SwitchToThread();
}

4  threads, w/o  local work: 62 cycles
16 threads, w/o  local work: 62 cycles
4  threads, with local work: 80 cycles
16 threads, with local work: 80 cycles

---------------------------------------

for (;;) {

 if (0 == XCHG(lock, 1)) return;

 SwitchToThread();

4  threads, w/o  local work: 69 cycles
16 threads, w/o  local work: 1968 cycles (!!!)
4  threads, with local work: 111 cycles
16 threads, with local work: 2375 cycles (!!!)

As for your algorithm described here:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/30263045/30263044/ShowThread.aspx#30263044
I get the idea of direct ownership transfer. But the algorithm is seriously broken. I have tried to fix it, but it turns out not very easy...

jimdempseyatthecove · ‎09-08-2008

Dmitriy,

Sorry about the broken code. Below is a working stress test for Windows. Made a minor but significant change.

Jim Dempsey

// LockQueue.cpp

// Copyright(C) - Jim Dempsey 2008
// The following code is pseudo code in the style of C++
// and represents a queuing spinlock.
// This code, and derrivative thereof, may be freely used
// provided it includes this Copyright Notice.
// Jim Dempsey
// jim_dempsey@ameritech.net
//

#include "stdafx.h"
#include 
#include 
#include 
#include

#define CACH_LINE_SIZE 64
#define SLEEP_COUNT 300
#define MAX_THREADS 16

__declspec( thread ) long MyLockId = 0;
 // Function to generate unique thread ID
 // Do not use omp_thread_num() as this is the team member number
 // and not a process-wide unique thread number. Each team has
 // team member numbers of 0:Number of threads in team-1
 // therefore if nested levels omp_thread_num() is not unique.
long UniqueIdMembership = 0;
long GetUniqueId()
{
_ASSERT(UniqueIdMembership return _InterlockedIncrement(&UniqueIdMembership);
};
long MyNewUniqueId(long id)
{
return id+MAX_THREADS;
}
struct Lock
{
 __declspec( align(CACH_LINE_SIZE)) volatile long LastLock;
 __declspec( align(CACH_LINE_SIZE)) volatile long TransferLock;

 Lock() { LastLock = TransferLock = NULL; };
 ~Lock(){ _ASSERT(LastLock==NULL); };
 void Acquire();
 void Release();
};

void Lock::Acquire()
{
 // Exchange whatever is in LastLock with MyLockID (Initialize if required)
 long myLastLock = _InterlockedExchange(&LastLock, MyLockId);

 // If the exchange prior value is NULL then I own the lock
 if(myLastLock)
 {
 // Here if someone else owns the lock
 while(true)
 {
   // perform short series of pause instructions
   for(int i=0; i   // return if ownership transfered to my thread
   if(myLastLock == TransferLock) break;
   // perform longer fast wait by going to end of line
   SwitchToThread();
   // return if ownership transfered to my thread
   if(myLastLock == TransferLock) break;
 }
 }
}

void Lock::Release()
{
 // if LastLock containd MyLockId replace it with NULL
// else place MyLockId into TransferLock
 if(_InterlockedCompareExchange(&LastLock, NULL, MyLockId) != MyLockId)
 {
  // Other lock(s) pending
  // Inform waiting thread my work is done
  _InterlockedExchange(&TransferLock, MyLockId);
 }
 // get a new unique Id
 MyLockId = MyNewUniqueId(MyLockId);
}

Lock someLock;
struct Node
{
intSharedCounter;
Node() {SharedCounter=0;};
~Node() {;};
};

Node* Head = NULL;

int _tmain(int argc, _TCHAR* argv[])
{
Node aNode;
Head = &aNode;// simulate a Node at head of var
ying list
int NumberOfThreads = MAX_THREADS;
omp_set_num_threads(NumberOfThreads);
int IntendedEachThreadCount = 123456;
intIntendedEndCount = NumberOfThreads*IntendedEachThreadCount;
#pragma omp parallel num_threads(NumberOfThreads)
{
// once only initialization of MyLockId
if(!MyLockId) MyLockId = GetUniqueId(); // 1:MAX_THREADS
int i;
for(i=0;i{
someLock.Acquire();
{
// simulate code section to increase
// delay between a read/modify/write
int localCopyOfCOunt = Head->SharedCounter;
_mm_pause();
Head->SharedCounter = localCopyOfCOunt+1;
}
someLock.Release();
}
}
if(Head->SharedCounter == IntendedEndCount)
printf("Success
");
else
printf("Fail
");
return 0;
}

jimdempseyatthecove · ‎09-08-2008

Dmitriy,

The last posted code works provided you do not perform nested locks

lock A, lock B, unlock B, unlock A

As the rotating ID number would get bumped at unlock of B.

This can be corrected under typical use whereby the lock control is encapsulated into a container in order for a dtor to be present in the event of exit of function (e.g. exception). The ctor of this funciton would obtain and maintain the initial unique ID. In the revised code use Lock::Release(long oldID)

With revising the code for nested locks I think the code will be complete.

The prior listed code (lock and unlock section) is suitable for use in your performance test program.

Jim Dempsey

jimdempseyatthecove · ‎09-08-2008

struct LockControl
{
 Lock& aLock;
 int id;
 LockControl( Lock& l) {aLock = l; id = aLock.Acquire();};
 ~LockControl() { aLock.Release(id); };
};

Something along the line of above where Acquire returns the current id of the lock.

Jim Dempsey

Dmitry_Vyukov · ‎09-24-2008

JimDempseyAtTheCove:
LOCK could be extended to perform the LOCKnoInt (potentially enabled/disabled with bit in processor status word)

The CMPXCHG instruction following the LOCKnoInt (LOCK) is atomic with respect to other processors, but instructions following it are not (as these are presumably protected by the lock on atom. The only difference is that is if successful CMPXCHG the interrupts are temporarily disabled on thecore running the thread.

The problem can be solved by OS on current hardware. See schedctl(2) call (cmd=SETHINTS):
http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=0630&db=man&fname=/usr/share/catman/p_man/cat2/ftn/schedctl.z

Briefly: thread can notify OS that it is currently in critical section, so OS will try to not preempt thread in this region. Notification is not syscall, it's just plain store to variable. OS scheduler will examine that variable on time slice end, and prolong time slice if needed.

Dmitry_Vyukov · ‎10-12-2008

Quoting - jimdempseyatthecove

// LockQueue.cpp

#define CACH_LINE_SIZE 64
#define SLEEP_COUNT 300
#define MAX_THREADS 16

__declspec( thread ) long MyLockId = 0;
 // Function to generate unique thread ID
 // Do not use omp_thread_num() as this is the team member number
 // and not a process-wide unique thread number. Each team has
 // team member numbers of 0:Number of threads in team-1
 // therefore if nested levels omp_thread_num() is not unique.
long UniqueIdMembership = 0;
long GetUniqueId()
{
_ASSERT(UniqueIdMembership return _InterlockedIncrement(&UniqueIdMembership);
};

[skipped]

I finally get some time to test this implementation:

single core: 1145 cycles per lock/unlock

cores 1+2: 29635 cycles per lock/unlock (scaling 0.038)

cores 1+3: 34038 cycles per lock/unlock (scaling 0.033)

cores 1+2+3+4: 21092 cycles per lock/unlock (scaling 0.053717)

Hmmm... Maybe I made something wrong... Although I don't see anything suspicious.

But note that I test in EXTREMELY SYTHETIC benchmark with EXTREMELY HIGH WORKLOAD.

This mutex implementation can be very suitable as "higher-level" mutex, so to say. For example in web-server/service, when one need reasonable amount of fairness.

Dmitry_Vyukov · ‎10-12-2008

Quoting - Dmitriy V'jukov

I finally get some time to test this implementation:

single core: 1145 cycles per lock/unlock

cores 1+2: 29635 cycles per lock/unlock (scaling 0.038)

cores 1+3: 34038 cycles per lock/unlock (scaling 0.033)

cores 1+2+3+4: 21092 cycles per lock/unlock (scaling 0.053717)

Hmmm... Maybe I made something wrong... Although I don't see anything suspicious.

But note that I test in EXTREMELY SYTHETIC benchmark with EXTREMELY HIGH WORKLOAD.

This mutex implementation can be very suitable as "higher-level" mutex, so to say. For example in web-server/service, when one need reasonable amount of fairness.

I think I see where is the problem. It's called 'hand-off ownership problem'.

Thread 1 releases the mutex when there is already at least one waiter on mutex (thread 2). And there is also new coming thread 3 which tries to acquire the mutex. Iff thread 1 and thread 3 are unable to acquire the mutex ahead of thread 2, then this is called 'hand-off ownership'.

'Hand-off ownership' is required for fairness. But at the same 'hand-off ownership' conflicts with high ranks in synthetic benchmarks :)

Btw, most general-purpose mutex implementations (Win32 CRITICAL_SECTION, pthread_mutex, boost mutex) especially avoid 'hand-off ownership', i.e. thread HAVE to be able to reacquire just released mutex.

Chris_M__Thomasson · ‎10-14-2008

Quoting - Dmitriy V'jukov

I think I see where is the problem. It's called 'hand-off ownership problem'.

Thread 1 releases the mutex when there is already at least one waiter on mutex (thread 2). And there is also new coming thread 3 which tries to acquire the mutex. Iff thread 1 and thread 3 are unable to acquire the mutex ahead of thread 2, then this is called 'hand-off ownership'.

'Hand-off ownership' is required for fairness. But at the same 'hand-off ownership' conflicts with high ranks in synthetic benchmarks :)

Btw, most general-purpose mutex implementations (Win32 CRITICAL_SECTION, pthread_mutex, boost mutex) especially avoid 'hand-off ownership', i.e. thread HAVE to be able to reacquire just released mutex.

CRITICAL_SECTION did indeed used to hand-off ownership. A guy by the name of
Neill Clift who works on the Windows Kernel told me this. I think he posted
the information on `comp.programming.threads'. I need to search for the
post. CRITICAL_SECTION still uses an odd mutual exclusion because the
following algorithm seems to beat it:
__________________________________________________________________
class mutex {
enum constant {
UNLOCKED = 0,
LOCKED = 1,
CONTENTION = 2
};

atomicword m_state; // == UNLOCKED
event m_wset; // initial state == false

public:
void lock() {
if (XCHG(&m_state, LOCKED)) {
while (XCHG(&m_state, CONTENTION)) {
m_wset.wait();
}
}
MEMBAR #StoreLoad | #StoreStore;
}

void unlock() {
MEMBAR #LoadStore | #StoreStore;
if (XCHG(&m_state, UNLOCKED) == CONTENTION) {
m_wset.signal();
}
}
};
__________________________________________________________________

Humm...

Lingfeng_C_Intel · ‎10-22-2008

Hi,

I am learning now, I tried to compile your file in my environment, but fail like below, could you provide more information?

Thanks,
Wise

[wchen18@spd20 learning]$ gcc -O3 -o atomic2 atomic2.c -lpthread
atomic2.c: In function `my_atomic_inc':
atomic2.c:198: syntax error before '&' token
atomic2.c: At top level:
atomic2.c:225: syntax error before "slock"
atomic2.c:225: warning: data definition has no type or storage class
atomic2.c: In function `worker':
atomic2.c:246: warning: cast from pointer to integer of different size
atomic2.c:296: warning: passing arg 1 of `my_atomic_inc' from incompatible pointer type
atomic2.c: In function `main':
atomic2.c:340: warning: cast to pointer from integer of different size
[wchen18@spd20 learning]$

fb251 · ‎10-23-2008

Quoting - Lingfeng Chen (Intel)

Hi,

I am learning now, I tried to compile your file in my environment, but fail like below, could you provide more information?

Thanks,
Wise

[wchen18@spd20 learning]$ gcc -O3 -o atomic2 atomic2.c -lpthread
atomic2.c: In function `my_atomic_inc':
atomic2.c:198: syntax error before '&' token
atomic2.c: At top level:
atomic2.c:225: syntax error before "slock"
atomic2.c:225: warning: data definition has no type or storage class
atomic2.c: In function `worker':
atomic2.c:246: warning: cast from pointer to integer of different size
atomic2.c:296: warning: passing arg 1 of `my_atomic_inc' from incompatible pointer type
atomic2.c: In function `main':
atomic2.c:340: warning: cast to pointer from integer of different size
[wchen18@spd20 learning]$

This program has been compiled on 32 bits and 64 bits Linux without error (but with some warnings). I don't understand error for line #198, for line #225 it may be a missing pthread type declaration (old pthread.h?).

Can you give us some informations about your environment:

- version of kernel and linux distribution (uname -a)

- version of gcc (gcc -v)

BTW, this code is old and has been modified to compile with -Wall -Werror, but now there's a lot of things not related to synchronization. Let me know if you have still problem and I will try to make a more clean code package.

Best regards