- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Greate TBB Engineers,
All the times, I think TBB library is the best tool in multi-threading programing.
But today, I found pthread spin_lock is 28% faster than tbb::spin_mutex, I am confused.
I check the source code between pthread spin_lock and tbb::spin_mutex:
// the following is pthread spin_lock
// glibc-2.8\nptl\sysdeps\i386\pthread_spin_lock.c
int
pthread_spin_lock (lock)
pthread_spinlock_t *lock;
{
asm ("\n"
"1:\t" LOCK_PREFIX "decl %0\n\t"
"jne 2f\n\t"
".subsection 1\n\t"
".align 16\n"
"2:\trep; nop\n\t"
"cmpl $0, %0\n\t"
"jg 1b\n\t"
"jmp 2b\n\t"
".previous"
: "=m" (*lock)
: "m" (*lock));
return 0;
}
//and this is tbb::spin_lock code
void lock() {
__TBB_LockByte(flag);
}
inline uintptr_t __TBB_LockByte( unsigned char& flag ) {
if ( !__TBB_TryLockByte(flag) ) {
tbb::internal::atomic_backoff b;
do {
b.pause();
} while ( !__TBB_TryLockByte(flag) );
}
return 0;
}
I don't know asm, it seems that pthread spin_lock is better.
Am I right? thanks.
All the times, I think TBB library is the best tool in multi-threading programing.
But today, I found pthread spin_lock is 28% faster than tbb::spin_mutex, I am confused.
I check the source code between pthread spin_lock and tbb::spin_mutex:
// the following is pthread spin_lock
// glibc-2.8\nptl\sysdeps\i386\pthread_spin_lock.c
int
pthread_spin_lock (lock)
pthread_spinlock_t *lock;
{
asm ("\n"
"1:\t" LOCK_PREFIX "decl %0\n\t"
"jne 2f\n\t"
".subsection 1\n\t"
".align 16\n"
"2:\trep; nop\n\t"
"cmpl $0, %0\n\t"
"jg 1b\n\t"
"jmp 2b\n\t"
".previous"
: "=m" (*lock)
: "m" (*lock));
return 0;
}
//and this is tbb::spin_lock code
void lock() {
__TBB_LockByte(flag);
}
inline uintptr_t __TBB_LockByte( unsigned char& flag ) {
if ( !__TBB_TryLockByte(flag) ) {
tbb::internal::atomic_backoff b;
do {
b.pause();
} while ( !__TBB_TryLockByte(flag) );
}
return 0;
}
I don't know asm, it seems that pthread spin_lock is better.
Am I right? thanks.
Link Copied
7 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What exactly did you measure?
I very much disagree with TBB's implementation myself (see thread "Spinning"), although there seems to be no interest in my alternative proposal (implemented in my patch in thread "Additions to atomic ").
I very much disagree with TBB's implementation myself (see thread "Spinning"), although there seems to be no interest in my alternative proposal (implemented in my patch in thread "Additions to atomic
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"there seems to be no interest in my alternative proposal"
No protest? Then it must be true...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - ahfuqq.com
Hi, Greate TBB Engineers,
All the times, I think TBB library is the best tool in multi-threading programing.
But today, I found pthread spin_lock is 28% faster than tbb::spin_mutex, I am confused.
I check the source code between pthread spin_lock and tbb::spin_mutex:
// the following is pthread spin_lock
// glibc-2.8nptlsysdepsi386pthread_spin_lock.c
int
pthread_spin_lock (lock)
pthread_spinlock_t *lock;
{
asm ("n"
"1:t" LOCK_PREFIX "decl %0nt"
"jne 2fnt"
".subsection 1nt"
".align 16n"
"2:trep; nopnt"
"cmpl $0, %0nt"
"jg 1bnt"
"jmp 2bnt"
".previous"
: "=m" (*lock)
: "m" (*lock));
return 0;
}
//and this is tbb::spin_lock code
void lock() {
__TBB_LockByte(flag);
}
inline uintptr_t __TBB_LockByte( unsigned char& flag ) {
if ( !__TBB_TryLockByte(flag) ) {
tbb::internal::atomic_backoff b;
do {
b.pause();
} while ( !__TBB_TryLockByte(flag) );
}
return 0;
}
I don't know asm, it seems that pthread spin_lock is better.
Am I right? thanks.
All the times, I think TBB library is the best tool in multi-threading programing.
But today, I found pthread spin_lock is 28% faster than tbb::spin_mutex, I am confused.
I check the source code between pthread spin_lock and tbb::spin_mutex:
// the following is pthread spin_lock
// glibc-2.8nptlsysdepsi386pthread_spin_lock.c
int
pthread_spin_lock (lock)
pthread_spinlock_t *lock;
{
asm ("n"
"1:t" LOCK_PREFIX "decl %0nt"
"jne 2fnt"
".subsection 1nt"
".align 16n"
"2:trep; nopnt"
"cmpl $0, %0nt"
"jg 1bnt"
"jmp 2bnt"
".previous"
: "=m" (*lock)
: "m" (*lock));
return 0;
}
//and this is tbb::spin_lock code
void lock() {
__TBB_LockByte(flag);
}
inline uintptr_t __TBB_LockByte( unsigned char& flag ) {
if ( !__TBB_TryLockByte(flag) ) {
tbb::internal::atomic_backoff b;
do {
b.pause();
} while ( !__TBB_TryLockByte(flag) );
}
return 0;
}
I don't know asm, it seems that pthread spin_lock is better.
Am I right? thanks.
FWIW, here is a sample implementation of a spinlock I use, in MSVC:
[cpp]typedef __int32 atomicword_ia32; typedef atomicword_ia32 spinlock_ia32; #define SPINLOCK_IA32_INIT 0 #define spinlock_ia32_init(mp_self) ( *(mp_self) = 0 ) __declspec(naked) void spinlock_ia32_acquire( spinlock_ia32* const self ) { _asm { MOV ECX, [ESP + 4] MOV EAX, 1 spinlock_ia32_acquire_retry: XCHG [ECX], EAX TEST EAX, EAX JNZ spinlock_ia32_acquire_failed RET spinlock_ia32_acquire_failed: PAUSE JMP spinlock_ia32_acquire_retry } } __declspec(naked) void spinlock_ia32_release( spinlock_ia32* const self ) { _asm { MOV EAX, [ESP + 4] MOV DWORD PTR [EAX], 0 RET } } [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Raf Schietekat
What exactly did you measure?
I very much disagree with TBB's implementation myself (see thread "Spinning"), although there seems to be no interest in my alternative proposal (implemented in my patch in thread "Additions to atomic ").
I very much disagree with TBB's implementation myself (see thread "Spinning"), although there seems to be no interest in my alternative proposal (implemented in my patch in thread "Additions to atomic
also curious about the performance comparison,can anybody put their benchmark?(CPU CYCLE/per round)
spin_mutex seems implemented with CAS (that 's the reason why it is not scalable?),
how does it compare with bakery style lock?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Using DEC vs. CAS should make no difference on scalability. In either case, only one processor acquires the lock. In principle, a failed CAS that did force a cache line into an exclusive state would be more scalable than the DEC approach. But as I understand current hardware, even a failed CAS puts the line in an exclusive state.
We do pay a small price for exception safety and backwards binary compatibility. Specifically, the "my_unlock_value" is always 0 in current implementations of TBB. So we pay a cycle to load a 0. But there are some old TBB implementations where my_unlock_value has backoff information encoded in it, and for some of our customers the binary compatibility is critical.
I'd be curious to hear what test shows a 28% spread.The GLibc code sequence is certainly optimal for an uncontended lock. But if the machine is oversubscribed and the lock is contended, it may cause severe performance hiccups because it never yields. Our experience has been that in such a situation, it is important that the lock yield to the operating system, otherwise the hardware thread wastes on average half a time slice doing nothing instead of advancing the software thread that is holding the lock.
We do pay a small price for exception safety and backwards binary compatibility. Specifically, the "my_unlock_value" is always 0 in current implementations of TBB. So we pay a cycle to load a 0. But there are some old TBB implementations where my_unlock_value has backoff information encoded in it, and for some of our customers the binary compatibility is critical.
I'd be curious to hear what test shows a 28% spread.The GLibc code sequence is certainly optimal for an uncontended lock. But if the machine is oversubscribed and the lock is contended, it may cause severe performance hiccups because it never yields. Our experience has been that in such a situation, it is important that the lock yield to the operating system, otherwise the hardware thread wastes on average half a time slice doing nothing instead of advancing the software thread that is holding the lock.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Chris,
Your code, as written, assumes the only valid values for the lock are 0 or 1. Should code external to your lock/release use locked values other than 1 your code will run into problems. To correct for this consider moving the tag spinlock_ia32_acquire_retry: back one line to where EAX is loaded with 1 (your lock flag). Example, assume code issues a call to your spinlock routine and acquires the lock. Under normal case the lock is held for a short time. However under unusual cases the thread holding the lock may write a non-zero and non-1 value into the lock flag to indicate the lock holding thread is in an unusual state (e.g. some sort of error ercovery). The code external to your spinlock can first test for the unusual state, then attemp lock if not in unusual state (with small window for snafu).
Depending on other factors you might want to consider inserting a PAUSE in the loop or in the code after n tries or in the code when EAX returns > 1 (though now you set the lock to 1)
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I forgot to mention, on a highly contested lock what is the effect of your code sequence on the thread holding the lock, and on the threads not competing for the lock? If you look at the TBB code you will see that a fast path is tried first then followed by a slower path (but which may be less intrusive to the other threads).
Jim

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page