Solved: Using OpenMP locking primitives with TBB (and MPI)

Seunghwa_Kang · ‎12-19-2011

Hello,

Can I use OpenMP locking primitives with TBB? I haven't seen any problem in using TBB with OpenMP locking primivites if I execute only a single (multithreaded) TBB process but I am seeing a strange problem if I launch multiple MPI processes (each MPI process is multithreaded using TBB) on a single shared memory machine (mainly for testing before moving to a cluster).

The problem is my program finishes execution (executes the last std::cout statement before the return statement of the main function) but segfaults on clean-up, and the followingis a gdb output (this does not happen every time but happens most frequtly on my system if I launch 16 MPI processes with each process executing 4 threads, my test system has 64 hardware threads).

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x4081a940 (LWP 25208)]

0x0000000000bcbbe2 in __kmp_unregister_root_current_thread(int) ()

(gdb) where

#0 0x0000000000bcbbe2 in __kmp_unregister_root_current_thread(int) ()

#1 0x0000000000bcb795 in __kmp_internal_end_dest ()

#2 0x0000003afac05ad9 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0

#3 0x0000003afac0674b in start_thread ()

from /lib64/libpthread.so.0

#4 0x0000003afa0d44bd in clone ()

from /lib64/libc.so.6

(gdb)

I tried to reprocude the behavior with simpel code, andin the following code

[bash]int main( int argc, char* ap_args[] ) {
    task_scheduler_init init( 4 );
    long sum = 0;
#if 1
    omp_lock_t lock;
    omp_init_lock( &lock );
#else
    mutex myMutex;
#endif

    parallel_for( blocked_range ( 0, 100000000L ), [&]( const blocked_range& r ) {
    for( long i = r.begin() ; i < r.end() ; i++ ) {
        {
#if 1
            omp_set_lock( &lock );
#else
            mutex::scoped_lock myLock( myMutex );
#endif
            sum += i;
#if 1
            omp_unset_lock( &lock );
#endif
        }
    }
    } );

#if 1
    omp_destroy_lock( &lock );
#endif

    cout << sum << endl;

    return 0;
}
[/bash]

this happens if I use omp_lock but does not happen if I use TBB mutex (#if1 -> #if 0). So shouldn't I use OpenMP locking primitives with TBB or any idea on this?

And is there any documentation about TBB mutex's memory consistency model (e.g. explaining when variables in registers are flushed and so on)? My memory is fading but what I remember is code using TBB's mutex did not work for code I written assuming OpenMP locking mechanism (and its memory model), and I decided to use OpenMP locking primtives as I am more familiar with its memory consistency model.

In short,

1) is there any known issue in using OpenMP locking primitves with TBB?
2) is there any documentation about TBB mutex's memory consistency model?

Thank you!!!

RafSchietekat · ‎12-20-2011

From the confirmation you provided I still guess that you wouldn't be able to use an OpenMP mutex with ordinary threads either, but that's not as trivial to try, and I leave that entirely up to you. I don't know nearly enough about OpenMP to know why that might be so, but maybe its mutexes are only meant to be used with OpenMP-controlled threads for administrative reasons? TBB mutexes are either based on user-space atomic operations and memory fences (without any external administration), or shallow wrappers around mutexes provided by the traditional threads API.

"Guess something wierd is happening between TBB scheduler and the compier (Intel icpc version 12.0.4)."
Looks more like something weird is going wrong with the OpenMP part, because TBB is conceived as compiler-independent, using only standard C++ features, and I don't suppose the compiler has started to second-guess TBB yet and do smart things that then went wrong. Still, there might be some runtime interaction... but you could dispel those thoughts by trying a different compiler, and you probably have the GNU one at your disposal. And even then it might not be related to TBB directly.

"And one additional question about tbb mutex.

tbb reference manual page 241 says mutex and recursive mutex block on long wait and spin_mutex yields on long wait.

So here, block means the task trying to grab a lock blocks and returns the thread to the thread pool so other task can use the thread or the blocking task will hold the (hardware) thread while waiting for the lock to be released.

And what does "yield" mean here? is this something similar to shced_yield() function?"
Block does not mean that the thread is given away (and that's not the way threads and tasks find each other, either), it means that the thread gets descheduled until the kernel knows that it is fit to run again, as opposed to yielding where the thread is always given more opportunities because the kernel doesn't know anything about what is going on, it just sees the yielding (and yes, sched_yield() may be the implementation, depending on platform). It's a trade-off based on where the overhead lies, with the former more suitable for long waits and the latter more for short waits that may not even get to the yield part.

"And regarding the memory model, I was asking whether there is a document like http://www.nic.uoregon.edu/iwomp2005/Talks/hoeflinger.pdf for TBB mutex. (especially page 16-18)"
There's no such formal document that I know of. Just use what the normal threading API would provide, and add intuitive notions like implicit happens-before between the region before a parallel_for and its tasks, and the same between the tasks and the region after the parallel_for, but not between the tasks in the parallel_for. It gets a little trickier with raw tasks, continuations, etc., where more faith is required. :-) What would you want to know?

So first try GNU compiler for some additional info, then try with plain threads instead of TBB to help put the blame where it belongs, unless you just stick with TBB's mutexes to avoid the problem altogether.

View solution in original post

RafSchietekat · ‎12-19-2011

Could you use the OpenMP locks with plain threads? And what happens if you init and destroy the lock outside the lifetime of task_scheduler_init (try that first)?

What specifically would you want to know about the memory model? When you lock, state is acquired through the lock, and when you unlock, it is released for use by the next owner.

jimdempseyatthecove · ‎12-19-2011

Kang,

Your code should work. If you are using Intel's C++ I suggest you file a bug report, or at least post this on the Intel C++ forum.

You have a simple reproducer, which is helpful for someone else to test the code.

The only problem I can think of is:

1) main() thread acquires OpenMP lock
2) other thread attempts lock and is blocked
3) upon failure, other thread instantiates OpenMP thread pool
4) locks work properly till loop ends
5) main thread on exit of scope, shuts down TBB thread pool, destroying the thread context that is running an OpenMP thread pool

Try this as an hypothisis:

In main(), in front of the task_scheduler_init, insert a

long sum = 0;
#pragma omp parallel
{
sum = omp_get_thread_num(); // dummy code
}
sum = 0;
task_scheduler_init init( 4 );
...

The purpose of the dummy code is such that the compiler optimizer doesn't remove your code.

What the above does, is establish the OpenMP thread team from the main() thread (as opposed to from within one of the TBB threads).

Jim Dempsey

SergeyKostrov · ‎12-19-2011

>>...The problem is my program finishes execution (executes the last std::cout statement before the return
>>statement of the main function) but segfaults on clean-up...

That is really strange and it means that ALL threads, for example OpenMP, TBB, etc, are destroyed and the
application's primary process is destroyed as well! This is clearly related toa problem withsynchronization.

I would add an additional class to see anddebug what is going on, somethinglike:

class CInitApp
{
CInitApp()
{
printf( "Application Started\n" );
};
virtual ~CInitApp()
{
printf( "Application Finished\n" );
};
}

...
CInitApp ia; // It has to be declaredoutside of main() function!
...
int main( ... )
{
// Your Test-Code...
}

and

I would try to set a handlerwith 'atexit' CRT-function in order to see when it is called. Please look at MSDN on how it works.

Best regards,
Sergey

RafSchietekat · ‎12-19-2011

"And what happens if you init and destroy the lock outside the lifetime of task_scheduler_init (try that first)?"
So did you try that (note that I didn't)?

[cpp]#define USE_OMP_LOCK_T 1

int main( int argc, char* ap_args[] ) {
    long sum;
#if USE_OMP_LOCK_T
    omp_lock_t lock;
    omp_init_lock( &lock );
#else
    mutex myMutex;
#endif

    /* start and end TBB in this block */ {
        task_scheduler_init init();
        parallel_for( blocked_range ( 0, 100000000L ), [&]( const blocked_range& r ) {
            for( long i = r.begin() ; i < r.end() ; i++ ) {
#if USE_OMP_LOCK_T
                omp_set_lock( &lock );
#else
                mutex::scoped_lock myLock( myMutex );
#endif
                sum += i;
#if USE_OMP_LOCK_T
                omp_unset_lock( &lock );
#endif
            }
        } );
    }

#if USE_OMP_LOCK_T
    omp_destroy_lock( &lock );
#endif

    cout << sum << endl;

    return 0;
}
[/cpp]

Seunghwa_Kang · ‎12-19-2011

No difference if I destory lock outside the life time of task_scheduler_init (I did this as the following, let me know if this is not what you intended).

[bash]int main( int argc, char* ap_args[] ) {
    omp_lock_t lock;

    omp_init_lock( &lock );

    {
    task_scheduler_init init( 4 );
    long sum;

    sum = 0;

    parallel_for( blocked_range ( 0, 100000000L ), [&]( const blocked_range& r ) {
    for( long i = r.begin() ; i < r.end() ; i++ ) {
        omp_set_lock( &lock );
        sum += i;
        omp_unset_lock( &lock );
    }
    } );

    cout << sum << endl;
    }

    omp_destroy_lock( &lock );

    return 0;
}
[/bash]

This works OK (no segfault)if I parallelize using #pragma omp parallel for (and omp lock) instead of tbb parallel_for.

Seunghwa_Kang · ‎12-19-2011

Test program still segfaults with dummy #pragma omp parallel routines---though it seems like this lowers the frequency of the bug.

Seunghwa_Kang · ‎12-19-2011

Program segfualted after executing atexit function (I am using a Linux system so I used man instead of MSDN :-)) and the destructor of CInitApp.

Seunghwa_Kang · ‎12-19-2011

To add a few more observations.

1) TBB parallel_for works OK with tbb locking primitves (TBB mutex and atomic) and GNU __sync_fetch_and_add but segfaults with OpenMP synchronizaiton primtives (omp_lock, #pragma omp critical, #pragma omp atomic). OpenMP locking primitives work OK with #pragma omp parallel for.

2) Seems like this has nothing to do with using mpiexec. I got segfaults in both

mpiexec -l -n 16 ./test

and

./test&; ./test&; .... ; ./test& (total 16 processes)

3) In my original application, this happens even when omp locking primitives are not executed.

For example,

if( condition that is never true ) {
#pragma omp critical
{
...
}
}

segfaults,

but if I insert assert( 0 )

e.g.

if( condition that is never true ) {
assert( 0 );
#pragma omp critical
{
...
}
}

my application does not segfault.

Guess something wierd is happening between TBB scheduler and the compier (Intel icpc version 12.0.4).

Seunghwa_Kang · ‎12-19-2011

And one additional question about tbb mutex.

tbb reference manual page 241 says mutex and recursive mutex block on long wait and spin_mutex yields on long wait.

So here, block means the task trying to grab a lock blocks and returns the thread to the thread pool so other task can use the thread or the blocking task will hold the (hardware) thread while waiting for the lock to be released.

And what does "yield" mean here? is this something similar to shced_yield() function?

And regarding the memory model, I was asking whether there is a document like http://www.nic.uoregon.edu/iwomp2005/Talks/hoeflinger.pdf for TBB mutex. (especially page 16-18)

RafSchietekat · ‎12-20-2011

From the confirmation you provided I still guess that you wouldn't be able to use an OpenMP mutex with ordinary threads either, but that's not as trivial to try, and I leave that entirely up to you. I don't know nearly enough about OpenMP to know why that might be so, but maybe its mutexes are only meant to be used with OpenMP-controlled threads for administrative reasons? TBB mutexes are either based on user-space atomic operations and memory fences (without any external administration), or shallow wrappers around mutexes provided by the traditional threads API.

"Guess something wierd is happening between TBB scheduler and the compier (Intel icpc version 12.0.4)."
Looks more like something weird is going wrong with the OpenMP part, because TBB is conceived as compiler-independent, using only standard C++ features, and I don't suppose the compiler has started to second-guess TBB yet and do smart things that then went wrong. Still, there might be some runtime interaction... but you could dispel those thoughts by trying a different compiler, and you probably have the GNU one at your disposal. And even then it might not be related to TBB directly.

"And one additional question about tbb mutex.

tbb reference manual page 241 says mutex and recursive mutex block on long wait and spin_mutex yields on long wait.

So here, block means the task trying to grab a lock blocks and returns the thread to the thread pool so other task can use the thread or the blocking task will hold the (hardware) thread while waiting for the lock to be released.

And what does "yield" mean here? is this something similar to shced_yield() function?"
Block does not mean that the thread is given away (and that's not the way threads and tasks find each other, either), it means that the thread gets descheduled until the kernel knows that it is fit to run again, as opposed to yielding where the thread is always given more opportunities because the kernel doesn't know anything about what is going on, it just sees the yielding (and yes, sched_yield() may be the implementation, depending on platform). It's a trade-off based on where the overhead lies, with the former more suitable for long waits and the latter more for short waits that may not even get to the yield part.

"And regarding the memory model, I was asking whether there is a document like http://www.nic.uoregon.edu/iwomp2005/Talks/hoeflinger.pdf for TBB mutex. (especially page 16-18)"
There's no such formal document that I know of. Just use what the normal threading API would provide, and add intuitive notions like implicit happens-before between the region before a parallel_for and its tasks, and the same between the tasks and the region after the parallel_for, but not between the tasks in the parallel_for. It gets a little trickier with raw tasks, continuations, etc., where more faith is required. :-) What would you want to know?

So first try GNU compiler for some additional info, then try with plain threads instead of TBB to help put the blame where it belongs, unless you just stick with TBB's mutexes to avoid the problem altogether.

Seunghwa_Kang · ‎12-20-2011

Thank you for the great answer, andyeah, I decided to move to TBB's mutexes. This looks like a complicated problem and may take some time to be fixed.

Intel may want to spend a bit more time as this may (or may not) affect TBB-OpenMP interoperability.

Seunghwa_Kang · ‎12-20-2011

Heck, this is a duplicate, please delete this :-)

RafSchietekat · ‎12-20-2011

"Intel may want to spend a bit more time as this may (or may not) affect TBB-OpenMP interoperability."
If you don't compare with g++ and/or plain threads instead of TBB, Intel is also not going to have much reason or facts to do anything.

Then again, with the Intel compiler the threads are supposed to be shared through the RML... so another wild guess is that things might change if you prime them for OpenMP use by first running an OpenMP loop.

But I'm just guessing based on what I do know, sorry. It might be more satisfying if somebody knew how OpenMP mutexes are implemented and/or what you are supposed to be able to do with them relating to non-OpenMP threads.

Seunghwa_Kang · ‎12-20-2011

OK, I ran this using pthread with OpenMP locking primitives and this combination worked OK. I am looking for a mahcine with a recent (enough to support lamda expression and OpenMP) version of gcc or a site to download precompiled gcc binaries and if I find, I will also test this using GCC.

Thanks!!!

RafSchietekat · ‎12-21-2011

"OK, I ran this using pthread with OpenMP locking primitives and this combination worked OK."
That's quite surprising, and (therefore) also quite interesting. Maybe some code in governor::create_rml_server can be disabled to force a private server even with an Intel compiler? Maybe some code should verify that TBB's worker threads really are destroyed when the task_scheduler_init goes away? Maybe there's public source code for OpenMP mutexes? But I now defer to any insider opinion.