Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

pthread spin_lock is 28% faster than tbb::spin_mutex, Is this normal?

ahfuqq_com
Beginner
864 Views
Hi, Greate TBB Engineers,
All the times, I think TBB library is the best tool in multi-threading programing.
But today, I found pthread spin_lock is 28% faster than tbb::spin_mutex, I am confused.
I check the source code between pthread spin_lock and tbb::spin_mutex:
// the following is pthread spin_lock
// glibc-2.8\nptl\sysdeps\i386\pthread_spin_lock.c
int
pthread_spin_lock (lock)
pthread_spinlock_t *lock;
{
asm ("\n"
"1:\t" LOCK_PREFIX "decl %0\n\t"
"jne 2f\n\t"
".subsection 1\n\t"
".align 16\n"
"2:\trep; nop\n\t"
"cmpl $0, %0\n\t"
"jg 1b\n\t"
"jmp 2b\n\t"
".previous"
: "=m" (*lock)
: "m" (*lock));

return 0;
}

//and this is tbb::spin_lock code
void lock() {
__TBB_LockByte(flag);
}

inline uintptr_t __TBB_LockByte( unsigned char& flag ) {
if ( !__TBB_TryLockByte(flag) ) {
tbb::internal::atomic_backoff b;
do {
b.pause();
} while ( !__TBB_TryLockByte(flag) );
}
return 0;
}

I don't know asm, it seems that pthread spin_lock is better.
Am I right? thanks.
0 Kudos
7 Replies
RafSchietekat
Valued Contributor III
864 Views
What exactly did you measure?

I very much disagree with TBB's implementation myself (see thread "Spinning"), although there seems to be no interest in my alternative proposal (implemented in my patch in thread "Additions to atomic").

0 Kudos
RafSchietekat
Valued Contributor III
864 Views

"there seems to be no interest in my alternative proposal"
No protest? Then it must be true...

0 Kudos
Chris_M__Thomasson
New Contributor I
864 Views
Quoting - ahfuqq.com
Hi, Greate TBB Engineers,
All the times, I think TBB library is the best tool in multi-threading programing.
But today, I found pthread spin_lock is 28% faster than tbb::spin_mutex, I am confused.
I check the source code between pthread spin_lock and tbb::spin_mutex:
// the following is pthread spin_lock
// glibc-2.8nptlsysdepsi386pthread_spin_lock.c
int
pthread_spin_lock (lock)
pthread_spinlock_t *lock;
{
asm ("n"
"1:t" LOCK_PREFIX "decl %0nt"
"jne 2fnt"
".subsection 1nt"
".align 16n"
"2:trep; nopnt"
"cmpl $0, %0nt"
"jg 1bnt"
"jmp 2bnt"
".previous"
: "=m" (*lock)
: "m" (*lock));

return 0;
}

//and this is tbb::spin_lock code
void lock() {
__TBB_LockByte(flag);
}

inline uintptr_t __TBB_LockByte( unsigned char& flag ) {
if ( !__TBB_TryLockByte(flag) ) {
tbb::internal::atomic_backoff b;
do {
b.pause();
} while ( !__TBB_TryLockByte(flag) );
}
return 0;
}

I don't know asm, it seems that pthread spin_lock is better.
Am I right? thanks.


FWIW, here is a sample implementation of a spinlock I use, in MSVC:

[cpp]typedef __int32 atomicword_ia32;
typedef atomicword_ia32 spinlock_ia32;


#define SPINLOCK_IA32_INIT 0


#define spinlock_ia32_init(mp_self) ( 
  *(mp_self) = 0 
)


__declspec(naked) void 
spinlock_ia32_acquire( 
 spinlock_ia32* const self 
) { 
  _asm {
    MOV ECX, [ESP + 4]
    MOV EAX, 1


spinlock_ia32_acquire_retry:
    XCHG [ECX], EAX
    TEST EAX, EAX
    JNZ spinlock_ia32_acquire_failed
    RET 


spinlock_ia32_acquire_failed:
    PAUSE
    JMP spinlock_ia32_acquire_retry
  }
} 


__declspec(naked) void
spinlock_ia32_release(
 spinlock_ia32* const self
) {
  _asm {
    MOV EAX, [ESP + 4]
    MOV DWORD PTR [EAX], 0
    RET
  }
}

[/cpp]


0 Kudos
softarts
Beginner
864 Views
Quoting - Raf Schietekat
What exactly did you measure?

I very much disagree with TBB's implementation myself (see thread "Spinning"), although there seems to be no interest in my alternative proposal (implemented in my patch in thread "Additions to atomic").




also curious about the performance comparison,can anybody put their benchmark?(CPU CYCLE/per round)

spin_mutex seems implemented with CAS (that 's the reason why it is not scalable?),

how does it compare with bakery style lock?
0 Kudos
ARCH_R_Intel
Employee
864 Views
Using DEC vs. CAS should make no difference on scalability. In either case, only one processor acquires the lock. In principle, a failed CAS that did force a cache line into an exclusive state would be more scalable than the DEC approach. But as I understand current hardware, even a failed CAS puts the line in an exclusive state.

We do pay a small price for exception safety and backwards binary compatibility. Specifically, the "my_unlock_value" is always 0 in current implementations of TBB. So we pay a cycle to load a 0. But there are some old TBB implementations where my_unlock_value has backoff information encoded in it, and for some of our customers the binary compatibility is critical.

I'd be curious to hear what test shows a 28% spread.The GLibc code sequence is certainly optimal for an uncontended lock. But if the machine is oversubscribed and the lock is contended, it may cause severe performance hiccups because it never yields. Our experience has been that in such a situation, it is important that the lock yield to the operating system, otherwise the hardware thread wastes on average half a time slice doing nothing instead of advancing the software thread that is holding the lock.
0 Kudos
jimdempseyatthecove
Honored Contributor III
864 Views

Chris,

Your code, as written, assumes the only valid values for the lock are 0 or 1. Should code external to your lock/release use locked values other than 1 your code will run into problems. To correct for this consider moving the tag spinlock_ia32_acquire_retry: back one line to where EAX is loaded with 1 (your lock flag). Example, assume code issues a call to your spinlock routine and acquires the lock. Under normal case the lock is held for a short time. However under unusual cases the thread holding the lock may write a non-zero and non-1 value into the lock flag to indicate the lock holding thread is in an unusual state (e.g. some sort of error ercovery). The code external to your spinlock can first test for the unusual state, then attemp lock if not in unusual state (with small window for snafu).

Depending on other factors you might want to consider inserting a PAUSE in the loop or in the code after n tries or in the code when EAX returns > 1 (though now you set the lock to 1)

Jim
0 Kudos
jimdempseyatthecove
Honored Contributor III
864 Views

I forgot to mention, on a highly contested lock what is the effect of your code sequence on the thread holding the lock, and on the threads not competing for the lock? If you look at the TBB code you will see that a fast path is tried first then followed by a slower path (but which may be less intrusive to the other threads).

Jim
0 Kudos
Reply