- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can I use OpenMP locking primitives with TBB? I haven't seen any problem in using TBB with OpenMP locking primivites if I execute only a single (multithreaded) TBB process but I am seeing a strange problem if I launch multiple MPI processes (each MPI process is multithreaded using TBB) on a single shared memory machine (mainly for testing before moving to a cluster).
The problem is my program finishes execution (executes the last std::cout statement before the return statement of the main function) but segfaults on clean-up, and the followingis a gdb output (this does not happen every time but happens most frequtly on my system if I launch 16 MPI processes with each process executing 4 threads, my test system has 64 hardware threads).
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x4081a940 (LWP 25208)]
0x0000000000bcbbe2 in __kmp_unregister_root_current_thread(int) ()
(gdb) where
#0 0x0000000000bcbbe2 in __kmp_unregister_root_current_thread(int) ()
#1 0x0000000000bcb795 in __kmp_internal_end_dest ()
#2 0x0000003afac05ad9 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0
#3 0x0000003afac0674b in start_thread ()
from /lib64/libpthread.so.0
#4 0x0000003afa0d44bd in clone ()
from /lib64/libc.so.6
(gdb)
I tried to reprocude the behavior with simpel code, andin the following code
[bash]int main( int argc, char* ap_args[] ) { task_scheduler_init init( 4 ); long sum = 0; #if 1 omp_lock_t lock; omp_init_lock( &lock ); #else mutex myMutex; #endif parallel_for( blocked_range( 0, 100000000L ), [&]( const blocked_range & r ) { for( long i = r.begin() ; i < r.end() ; i++ ) { { #if 1 omp_set_lock( &lock ); #else mutex::scoped_lock myLock( myMutex ); #endif sum += i; #if 1 omp_unset_lock( &lock ); #endif } } } ); #if 1 omp_destroy_lock( &lock ); #endif cout << sum << endl; return 0; } [/bash]
this happens if I use omp_lock but does not happen if I use TBB mutex (#if1 -> #if 0). So shouldn't I use OpenMP locking primitives with TBB or any idea on this?
And is there any documentation about TBB mutex's memory consistency model (e.g. explaining when variables in registers are flushed and so on)? My memory is fading but what I remember is code using TBB's mutex did not work for code I written assuming OpenMP locking mechanism (and its memory model), and I decided to use OpenMP locking primtives as I am more familiar with its memory consistency model.
In short,
1) is there any known issue in using OpenMP locking primitves with TBB?
2) is there any documentation about TBB mutex's memory consistency model?
Thank you!!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Guess something wierd is happening between TBB scheduler and the compier (Intel icpc version 12.0.4)."
Looks more like something weird is going wrong with the OpenMP part, because TBB is conceived as compiler-independent, using only standard C++ features, and I don't suppose the compiler has started to second-guess TBB yet and do smart things that then went wrong. Still, there might be some runtime interaction... but you could dispel those thoughts by trying a different compiler, and you probably have the GNU one at your disposal. And even then it might not be related to TBB directly.
"And one additional question about tbb mutex.
tbb reference manual page 241 says mutex and recursive mutex block on long wait and spin_mutex yields on long wait.
So here, block means the task trying to grab a lock blocks and returns the thread to the thread pool so other task can use the thread or the blocking task will hold the (hardware) thread while waiting for the lock to be released.
And what does "yield" mean here? is this something similar to shced_yield() function?"
Block does not mean that the thread is given away (and that's not the way threads and tasks find each other, either), it means that the thread gets descheduled until the kernel knows that it is fit to run again, as opposed to yielding where the thread is always given more opportunities because the kernel doesn't know anything about what is going on, it just sees the yielding (and yes, sched_yield() may be the implementation, depending on platform). It's a trade-off based on where the overhead lies, with the former more suitable for long waits and the latter more for short waits that may not even get to the yield part.
"And regarding the memory model, I was asking whether there is a document like http://www.nic.uoregon.edu/iwomp2005/Talks/hoeflinger.pdf for TBB mutex. (especially page 16-18)"
There's no such formal document that I know of. Just use what the normal threading API would provide, and add intuitive notions like implicit happens-before between the region before a parallel_for and its tasks, and the same between the tasks and the region after the parallel_for, but not between the tasks in the parallel_for. It gets a little trickier with raw tasks, continuations, etc., where more faith is required. :-) What would you want to know?
So first try GNU compiler for some additional info, then try with plain threads instead of TBB to help put the blame where it belongs, unless you just stick with TBB's mutexes to avoid the problem altogether.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What specifically would you want to know about the memory model? When you lock, state is acquired through the lock, and when you unlock, it is released for use by the next owner.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your code should work. If you are using Intel's C++ I suggest you file a bug report, or at least post this on the Intel C++ forum.
You have a simple reproducer, which is helpful for someone else to test the code.
The only problem I can think of is:
1) main() thread acquires OpenMP lock
2) other thread attempts lock and is blocked
3) upon failure, other thread instantiates OpenMP thread pool
4) locks work properly till loop ends
5) main thread on exit of scope, shuts down TBB thread pool, destroying the thread context that is running an OpenMP thread pool
Try this as an hypothisis:
In main(), in front of the task_scheduler_init, insert a
long sum = 0;
#pragma omp parallel
{
sum = omp_get_thread_num(); // dummy code
}
sum = 0;
task_scheduler_init init( 4 );
...
The purpose of the dummy code is such that the compiler optimizer doesn't remove your code.
What the above does, is establish the OpenMP thread team from the main() thread (as opposed to from within one of the TBB threads).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>statement of the main function) but segfaults on clean-up...
That is really strange and it means that ALL threads, for example OpenMP, TBB, etc, are destroyed and the
application's primary process is destroyed as well! This is clearly related toa problem withsynchronization.
I would add an additional class to see anddebug what is going on, somethinglike:
class CInitApp
{
CInitApp()
{
printf( "Application Started\n" );
};
virtual ~CInitApp()
{
printf( "Application Finished\n" );
};
}
...
CInitApp ia; // It has to be declaredoutside of main() function!
...
int main( ... )
{
// Your Test-Code...
}
and
I would try to set a handlerwith 'atexit' CRT-function in order to see when it is called. Please look at MSDN on how it works.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So did you try that (note that I didn't)?
[cpp]#define USE_OMP_LOCK_T 1 int main( int argc, char* ap_args[] ) { long sum; #if USE_OMP_LOCK_T omp_lock_t lock; omp_init_lock( &lock ); #else mutex myMutex; #endif /* start and end TBB in this block */ { task_scheduler_init init(); parallel_for( blocked_range( 0, 100000000L ), [&]( const blocked_range & r ) { for( long i = r.begin() ; i < r.end() ; i++ ) { #if USE_OMP_LOCK_T omp_set_lock( &lock ); #else mutex::scoped_lock myLock( myMutex ); #endif sum += i; #if USE_OMP_LOCK_T omp_unset_lock( &lock ); #endif } } ); } #if USE_OMP_LOCK_T omp_destroy_lock( &lock ); #endif cout << sum << endl; return 0; } [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[bash]int main( int argc, char* ap_args[] ) { omp_lock_t lock; omp_init_lock( &lock ); { task_scheduler_init init( 4 ); long sum; sum = 0; parallel_for( blocked_range( 0, 100000000L ), [&]( const blocked_range & r ) { for( long i = r.begin() ; i < r.end() ; i++ ) { omp_set_lock( &lock ); sum += i; omp_unset_lock( &lock ); } } ); cout << sum << endl; } omp_destroy_lock( &lock ); return 0; } [/bash]
This works OK (no segfault)if I parallelize using #pragma omp parallel for (and omp lock) instead of tbb parallel_for.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Test program still segfaults with dummy #pragma omp parallel routines---though it seems like this lowers the frequency of the bug.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) TBB parallel_for works OK with tbb locking primitves (TBB mutex and atomic
2) Seems like this has nothing to do with using mpiexec. I got segfaults in both
mpiexec -l -n 16 ./test
and
./test&; ./test&; .... ; ./test& (total 16 processes)
3) In my original application, this happens even when omp locking primitives are not executed.
For example,
if( condition that is never true ) {
#pragma omp critical
{
...
}
}
segfaults,
but if I insert assert( 0 )
e.g.
if( condition that is never true ) {
assert( 0 );
#pragma omp critical
{
...
}
}
my application does not segfault.
Guess something wierd is happening between TBB scheduler and the compier (Intel icpc version 12.0.4).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
tbb reference manual page 241 says mutex and recursive mutex block on long wait and spin_mutex yields on long wait.
So here, block means the task trying to grab a lock blocks and returns the thread to the thread pool so other task can use the thread or the blocking task will hold the (hardware) thread while waiting for the lock to be released.
And what does "yield" mean here? is this something similar to shced_yield() function?
And regarding the memory model, I was asking whether there is a document like http://www.nic.uoregon.edu/iwomp2005/Talks/hoeflinger.pdf for TBB mutex. (especially page 16-18)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Guess something wierd is happening between TBB scheduler and the compier (Intel icpc version 12.0.4)."
Looks more like something weird is going wrong with the OpenMP part, because TBB is conceived as compiler-independent, using only standard C++ features, and I don't suppose the compiler has started to second-guess TBB yet and do smart things that then went wrong. Still, there might be some runtime interaction... but you could dispel those thoughts by trying a different compiler, and you probably have the GNU one at your disposal. And even then it might not be related to TBB directly.
"And one additional question about tbb mutex.
tbb reference manual page 241 says mutex and recursive mutex block on long wait and spin_mutex yields on long wait.
So here, block means the task trying to grab a lock blocks and returns the thread to the thread pool so other task can use the thread or the blocking task will hold the (hardware) thread while waiting for the lock to be released.
And what does "yield" mean here? is this something similar to shced_yield() function?"
Block does not mean that the thread is given away (and that's not the way threads and tasks find each other, either), it means that the thread gets descheduled until the kernel knows that it is fit to run again, as opposed to yielding where the thread is always given more opportunities because the kernel doesn't know anything about what is going on, it just sees the yielding (and yes, sched_yield() may be the implementation, depending on platform). It's a trade-off based on where the overhead lies, with the former more suitable for long waits and the latter more for short waits that may not even get to the yield part.
"And regarding the memory model, I was asking whether there is a document like http://www.nic.uoregon.edu/iwomp2005/Talks/hoeflinger.pdf for TBB mutex. (especially page 16-18)"
There's no such formal document that I know of. Just use what the normal threading API would provide, and add intuitive notions like implicit happens-before between the region before a parallel_for and its tasks, and the same between the tasks and the region after the parallel_for, but not between the tasks in the parallel_for. It gets a little trickier with raw tasks, continuations, etc., where more faith is required. :-) What would you want to know?
So first try GNU compiler for some additional info, then try with plain threads instead of TBB to help put the blame where it belongs, unless you just stick with TBB's mutexes to avoid the problem altogether.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel may want to spend a bit more time as this may (or may not) affect TBB-OpenMP interoperability.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Heck, this is a duplicate, please delete this :-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you don't compare with g++ and/or plain threads instead of TBB, Intel is also not going to have much reason or facts to do anything.
Then again, with the Intel compiler the threads are supposed to be shared through the RML... so another wild guess is that things might change if you prime them for OpenMP use by first running an OpenMP loop.
But I'm just guessing based on what I do know, sorry. It might be more satisfying if somebody knew how OpenMP mutexes are implemented and/or what you are supposed to be able to do with them relating to non-OpenMP threads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks!!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's quite surprising, and (therefore) also quite interesting. Maybe some code in governor::create_rml_server can be disabled to force a private server even with an Intel compiler? Maybe some code should verify that TBB's worker threads really are destroyed when the task_scheduler_init goes away? Maybe there's public source code for OpenMP mutexes? But I now defer to any insider opinion.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page