Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

test_mutex.exe

RafSchietekat
Valued Contributor III
511 Views
Who would (be interested to find out and) explain why "test_mutex.exe 1" succeeds but "test_mutex.exe 2" and "test_mutex.exe 4" just keep spinning after having printed "Spin RW Mutex readers & writers time =" unless spin_rw_mutex is allowed to (time out and) __TBB_Yield(), or, e.g., calls printf() as part of the spin loop (probably also because of an implied yield)?

The answer I'm hoping for is that the test is requiring more than allowed from a real application (parallelism strictly optional), which would therefore not incur the performance penalty associated with the need for a timeout-induced __TBB_Yield() and/or be able to work with non-yielding spin locks.

Note that except on x86 and Itanium a yield is done for every spin iteration, so now I'm a bit confused about its cost relative to a pause (and of both relative to the atomic operation of course), the size of the timeout relative to a quantum, etc. My purpose is doing the same thing as in the 2008-08-03 "Additions to atomic" patch, but it now seems that I need some kind of fail-safe mechanism.

0 Kudos
15 Replies
Dmitry_Vyukov
Valued Contributor I
511 Views
Raf_Schietekat:
Who would (be interested to find out and) explain why "test_mutex.exe 1" succeeds but "test_mutex.exe 2" and "test_mutex.exe 4" just keep spinning after having printed "Spin RW Mutex readers & writers time =" unless spin_rw_mutex is allowed to (time out and) __TBB_Yield(), or, e.g., calls printf() as part of the spin loop (probably also because of an implied yield)?



I've run test_mutex from latest stable release (2.1) with parameters "1", "2" and "4", and they all print "done" and finished. Do I need to make some special setup?


0 Kudos
Wooyoung_K_Intel
Employee
511 Views

Hi,

I commented out Yield() in internal::ExponentialBackoff and ran test_mutex repeatedly, and did not see what you described. If you describe your set-up in more detail, I would go back and try it again and see if I can reproduce it.

As for Yield(), it is not needed in theory. A well-written application where accesses to a shared regionare spread out in time would not call it at all. I did some experiments with it and without it, and at least for the synthetic micro-benchmark that I used, I did not see much difference in execution times.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
511 Views
MADwkim3:

As for Yield(), it is not needed in theory. A well-written application where accesses to a shared regionare spread out in time would not call it at all. I did some experiments with it and without it, and at least for the synthetic micro-benchmark that I used, I did not see much difference in execution times.


Btw, here is a concurrent discussion about performance of different implementations of spin-mutexes (also with some benchmark results):
http://softwarecommunity.intel.com/isn/Community/en-US/forums/2/30263091/ShowThread.aspx


0 Kudos
RafSchietekat
Valued Contributor III
511 Views
I was trying to apply the change in tbb_atomic.h (see "Additions to atomic" patch) also to spin_rw_mutex, as follows (also to the other acquisition methods, of course):

//! Acquire write lock on the given mutex.
bool spin_rw_mutex_v3::internal_acquire_writer()
{
 ITT_NOTIFY(sync_prepare, this);
 for( internal::ExponentialBackoff backoff; ; backoff.pause() ) {
 state_t s = const_cast(state); // ensure reloading
 if( !(s & BUSY) ) { // no readers, no writers
 if( CAS(state, WRITER, s)==s )
 break; // successfully stored writer flag
 #if !__TBB_USE_LOCKING_NEWSTYLE
 backoff.reset(); // we could be very close to complete op.
 #endif
 } else if( !(s & WRITER_PENDING) ) { // no pending writers
 __TBB_AtomicOR(&state, WRITER_PENDING);
 }
 #if __TBB_USE_LOCKING_NEWSTYLE
 int pauses = 0;
 do {
 if( pauses<__TBB_PAUSES_BETWEEN_YIELDS ) {
 ++pauses;
 __TBB_Pause(1); // be nice to other hyperthreads, save power
 }
 else {
 pauses = 0;
 __TBB_Yield();
 }
 s = const_cast(state); // ensure reloading
 } while(!( !(s & BUSY) || !(s & WRITER_PENDING) ));
 #endif
 }
 ITT_NOTIFY(sync_acquired, this);
 return false;
}

Without the workaround __TBB_PAUSES_BETWEEN_YIELDS, which is the same as setting it to a very large value, I get the observed problem, but I'm not sure whether this works yet (I also get strange variations in total run time, and I need to do more thinking/testing). So I was presuming that maybe there is some oversubscription going on in the test and each lock took a whole quantum or something...

0 Kudos
RafSchietekat
Valued Contributor III
511 Views
Maybe somebody with more time for multithreading could test my algorithm? Note that currently pause() waits for a sort-of random amount of time chosen from an exponentially growing interval that currently maxes out where TBB originally maxed out, but I would like to limit it to something related to default_num_threads().

The idea is to pretend that a lock is like Ethernet: wait till the channel is free, then wait some more (to avoid jumping in there all threads at once), and if it's still free go for it (grab the cache line for modification). The important change is that of course there is still a bias toward newcomers, but earlier arrivers are not starved anymore, and the backoff interval does not grow to the maximum value while a thread is merely waiting (only on each failed acquisition attempt). The next version could record the delay associated with the last successful acquisition, so that the next thread could start with a fraction of that, both releaving the bias toward new threads and the contention level.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
511 Views
Raf_Schietekat:
I was trying to apply the change in tbb_atomic.h (see "Additions to atomic" patch) also to spin_rw_mutex, as follows (also to the other acquisition methods, of course):

//! Acquire write lock on the given mutex.
bool spin_rw_mutex_v3::internal_acquire_writer()
{
 ITT_NOTIFY(sync_prepare, this);
 for( internal::ExponentialBackoff backoff; ; backoff.pause() ) {
 state_t s = const_cast(state); // ensure reloading
 if( !(s & BUSY) ) { // no readers, no writers
 if( CAS(state, WRITER, s)==s )
 break; // successfully stored writer flag
 #if !__TBB_USE_LOCKING_NEWSTYLE
 backoff.reset(); // we could be very close to complete op.
 #endif
 } else if( !(s & WRITER_PENDING) ) { // no pending writers
 __TBB_AtomicOR(&state, WRITER_PENDING);
 }
 #if __TBB_USE_LOCKING_NEWSTYLE
 int pauses = 0;
 do {
 if( pauses<__TBB_PAUSES_BETWEEN_YIELDS ) {
 ++pauses;
 __TBB_Pause(1); // be nice to other hyperthreads, save power
 }
 else {
 pauses = 0;
 __TBB_Yield();
 }
 s = const_cast(state); // ensure reloading
 } while(!( !(s & BUSY) || !(s & WRITER_PENDING) ));
 #endif
 }
 ITT_NOTIFY(sync_acquired, this);
 return false;
}


Please post all the code - acquire_reader, release_writer and release_reader. Or better whole mutex code and test suite. Otherwise it's difficult to say something.

Btw, you can use Relacy Race Detector for verification. If there is some kind of deadlock, livelock, logical race etc then Relacy will find it in a second.


0 Kudos
Dmitry_Vyukov
Valued Contributor I
511 Views
Raf_Schietekat:

I also get strange variations in total run time, and I need to do more thinking/testing


Yeah, I was observing the same thing in some spin mutex implementations. The effect shows only when oversubscription has place, and only in some spin-mutex implementations, and other spin-mutex implementations doesn't inclined to the effect. See here for details:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/30263091/30263061/ShowThread.aspx#30263061


0 Kudos
Dmitry_Vyukov
Valued Contributor I
511 Views
Raf_Schietekat:

The idea is to pretend that a lock is like Ethernet: wait till the channel is free, then wait some more (to avoid jumping in there all threads at once), and if it's still free go for it (grab the cache line for modification). The important change is that of course there is still a bias toward newcomers, but earlier arrivers are not starved anymore, and the backoff interval does not grow to the maximum value while a thread is merely waiting (only on each failed acquisition attempt). The next version could record the delay associated with the last successful acquisition, so that the next thread could start with a fraction of that, both releaving the bias toward new threads and the contention level.


I am not sure whether such 'complicated' strategies make sense for spin-mutex. Spin-mutex is a kind of lower-level than Ethernet. And for spin-mutex very simple algorithms work surprisingly well. For example following implementation shows good results:
for (;;) {
if (0 == XCHG(lock, 1)) return;
for (int i = 0; i != 300; ++i) pause;
if (0 == XCHG(lock, 1)) return;
SwitchToThread();
}
But honestly I was testing only on very simple synthetic benchmarks and only on Core 2 Quad. So maybe more exhaustive 'real-world' testing will show superiority of more elaborated techniques.


0 Kudos
Dmitry_Vyukov
Valued Contributor I
511 Views
Raf_Schietekat:

The idea is to pretend that a lock is like Ethernet: wait till the channel is free, then wait some more (to avoid jumping in there all threads at once), and if it's still free go for it (grab the cache line for modification). The important change is that of course there is still a bias toward newcomers, but earlier arrivers are not starved anymore, and the backoff interval does not grow to the maximum value while a thread is merely waiting (only on each failed acquisition attempt). The next version could record the delay associated with the last successful acquisition, so that the next thread could start with a fraction of that, both releaving the bias toward new threads and the contention level.


Also you can consider trick posted by Jim Dempsey here:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/30263091/30262913/ShowThread.aspx#30262913

It recalls 'futile wakeup throttling' technique describer by David Dice here:
http://www.usenix.org/events/jvm01/full_papers/dice/dice.pdf
And David was reporting substantial improvement after applying 'futile wakeup throttling' in heavy contended case.

0 Kudos
jimdempseyatthecove
Honored Contributor III
511 Views

Raf and Dmitriy

The technique used in the referenced post is an ordered SpinLock technique to be used only for short duration locks. The working example posted should not be used for nested SpinLocks without incorporating the features of the follow-on post (constructor/destructor of lock reference object which remembers thread's locking ID). This technique may need to be revised if the running environment has an over subscription of threads (more threads competing for lock than cores available to run). In this case,you might want to occasionally call Sleep(0) after "n" SwitchToThread calls fail to gain lock.

Dmitriy's suggested lock technique should be faster but at the expense of loss of First Come - First Serve characteristic of the method of my post. The requirements of the application should indicate which technique is most suitable. Example:

Assume the application observes the system has 8 cores during initialization and divides the workload into 8 domains of equal work. In this situation you would not want some of the threads to get better access to the lock than others.

Jim Dempsey

0 Kudos
Wooyoung_K_Intel
Employee
511 Views

Hi, Raf,

I did a small experiment and indeed it confirmed your observation -- the test_mutex.exe spins infinitely in spin_rw_mutex::internal_acquire_writer().To simplify the code for examination, I replaced your extra delay code with a simple long pause (i.e., pause(128) ) Since only 2 and 4 threads are involved, I ruled out the possibility of over-subscription. I examined the logic of internal_acquire_writer() hard again, and found no problem.

Then, I tried it with icc 10.1 and voila, test_mutex.exe worked. it worked with gcc 3.4.4as well. I suspected gcc 4.1.2 generated incorrect code. So, Igenerated theassembly code for the routine using gcc 4.1.2 as well as icc and compared them. Attached is the generated code.Note, in thegcc version, after 'pause', no reload of'state' is present; that seems to be the cause of the infinite loop...Whatcompiledid you use for your experiment? Maybe gcc 4.1.2 is inerror, or we haveto go back and review the use of volatiles in TBB more carefully...

best regards,

0 Kudos
RafSchietekat
Valued Contributor III
511 Views
Wooyoung, thanks for the thorough investigation, I'm impressed!

"The observable behavior of the abstract machine is its sequence of reads and writes to volatile data and calls to library I/O functions." g++ seems to be acting against this and the following in [intro.execution] of the C++ standard, so I would invite you to submit the evidence you found. Is it just with const_cast, or also for a direct volatile state declaration? And is it just my little trick, or did this already hamper performance of spin_rw_mutex?

A likely workaround is to use atomics, I suppose, or just calling a function that cannot be inlined.

(Added) I used g++ 4.0.3.
0 Kudos
RafSchietekat
Valued Contributor III
511 Views
Dmitriy, I'm just trying to make an educated guess, but TBB's Andrey Marochko has already demonstrated that there may be something to it (see "Spinning"; best to continue this there, I suppose).

Jim, you're welcome also to grill "my" spin lock and comment on it.

(Edited to specify evidence.)


0 Kudos
RafSchietekat
Valued Contributor III
511 Views
As I suspected, if "state" is qualified "volatile" (where it is declared, not just in a const_cast where it is used), the problem disappears, so the course of action seems clear. I've always felt uneasy about casting to volatile, so I have personal reasons as well for recommending this change to the code. smiley [:-)] I have not found any indication to think that g++'s behaviour might not be in error, though (Wooyoung Kim, will you submit an error report based on your research together with this workaround?).

Of course, a volatile "doesn't work", i.e., doesn't do the work of an atomic like a Java volatile does, it's essentially just a state holder for use with atomic operations (for TBB's purposes, anyway). To show off some new tbb::atomic operations scheduled to appear in the next "Additions to atomic" patch version, I have selectively performed such encapsulation.

0 Kudos
Wooyoung_K_Intel
Employee
511 Views

Thank you very much for finding a clean work-around.

I willfile a bug report against gcc with your work-around.

best regards

0 Kudos
Reply