Re: OMP vs TBB

pvonkaenel · ‎05-14-2009

Hi all,

I've used OpenMP in a couple of projects in the past and gotten good results with it with minimal effort, and I've just started reading the Threading Building Blocks book and started wondering when to use one threading package over the other. I'm working in C++ on Windows, so the language and OS are not an issue. From what I've read so far, it looks like TBB has the same features as OMP, and adds a lot more on top of it - is this correct?

Basically, before I go back an re-implement my OMP as TBB, I would be interested in a Pro/Con list of each. I know they can co-exist, but I'd rather not have two independent packages each creating an "optimal" number of threads that compete with each other for resources.

Any thoughts?

Thanks,
Peter

jimdempseyatthecove · ‎05-14-2009

Peter,

Some of the more experienced users of TBB could provide you with their experience on use of TBB over OpenMP. My overview on the subject is

OpenMP is good when your parallization requirements is best expressed with a fan-out and fan-in diagram

/======/======
---------<------------>--------<------------>--
======/ ======/

I do not know if the above diagrammed well.

The intent of the above is to illustrate a run of serial code, then parallel for a run, then serial, then parallel. OpenMP is more than suitable for this type of programming. The above diagram depicts a repetitive hourglass type of figure.

TBB is a tasking programmig paradigm. In a tasking system the parallel sections of the code (bulges) are loosely coupled and can run(to your programming)indepent from other tasks. Tasking is more efficient on core utilization when your application is suitable for description in tasks.

There are additional features of TBB, such as concurrent_vector, scalable allocator, etc. but these can be obtained from TBB or elsewhere and incorporated into OpenMP.

OpenMP can gain core utilization by use of nested levels and nowait and/or sections. Look at and experiment witha few of the sample applications before you undergo a conversion effort. What you learn with the sample applications will go a long way in reducing your conversion efforts and produce better results.

Jim Dempsey

Alexey-Kukanov · ‎05-14-2009

Skim overthe article that might be useful.

pvonkaenel · ‎05-14-2009

Thank you for the description and the article pointer: both have been very useful. So far I have been using OpenMP in a very simple fashon: use "#pragma omp parallel for" to parallelize loops over pixels to get a speedup. It looks like I can do the same think with TBB, and TBB has additional benefits such as pipelining which I will want to use in the future. On feature of OMP I'm using is the ability to completely disable it by excluding the /openmp compiler flag. Up until now I have been generating both sequential and threaded versions of a library I've written in-case a non-threaded version is needed (also makes debugging the non-threaded portions easier). Is there a way to do that with TBB?

Thanks again,
Peter

pvonkaenel · ‎05-14-2009

Wow, the more I read, the more questions I have. At this point it may just be easier to enumerate them:

Can the threading be easily disabled as it can with OpenMP? I guess this leads to a more philisophical questions: in this day of multi-core, are sequential versions of tools/libs even needed?
I noticed that the TBB DLLs are built with /MD. I tend to use /MT in my libs and applications. Would it be safe to rebuild the open source version of TBB with /MT?
In general, do people tend to use /MD or /MT? I'm going to assume that most people use /MD since it's the default and you can avoid certain memory allocation/deallocation issues across modules. Why are you using the linking model you're using?
I read that each thread must has its own scheduler instance. Does each module also need a separate scheduler?
If an application links with two DLLs each using TBB, will they share the same thread pool?

Peter

Alexey-Kukanov · ‎05-14-2009

Quoting - pvonkaenel

Up until now I have been generating both sequential and threaded versions of a library I've written in-case a non-threaded version is needed (also makes debugging the non-threaded portions easier). Is there a way to do that with TBB?

There is no out-of-box solution to completely switch TBB off at compile time, asTBB is more source-intrusive than OpenMP. There is a solution to make TBB use just one thread (i.e. your thread that starts TBB algorithms) and no additional workerthreads:
task_scheduler_init my_TBB_init( 1 /*for singlethreaded execution*/);
Some TBB overhead will still be incured. When using some algorithms, it could be further reduced down - e.g. with parallel_for you could specify grain size equal to the whole size of iteration space, so that the space will not be chunked for sake of (non-existing) parallelism:
parallel_for( blocked_range(0,N,N /*same as range size*/), my_body );
Andauto_partitioner() will do just a few range splits if only one thread is available.

If you develop a library however, this run-time approach might not necessary work as desired, because if the calling application also uses TBB and already initialized it then extra (secondary) initialization done in the library does not impact the number of threads.

robert_jay_gould · ‎05-14-2009

Quoting - pvonkaenel

Wow, the more I read, the more questions I have. At this point it may just be easier to enumerate them:

Can the threading be easily disabled as it can with OpenMP? I guess this leads to a more philisophical questions: in this day of multi-core, are sequential versions of tools/libs even needed?

I noticed that the TBB DLLs are built with /MD. I tend to use /MT in my libs and applications. Would it be safe to rebuild the open source version of TBB with /MT?

In general, do people tend to use /MD or /MT? I'm going to assume that most people use /MD since it's the default and you can avoid certain memory allocation/deallocation issues across modules. Why are you using the linking model you're using?

I read that each thread must has its own scheduler instance. Does each module also need a separate scheduler?

If an application links with two DLLs each using TBB, will they share the same thread pool?

Peter

Hi Peter,

As you, I was first using OMP, and later discovered TBB. And like Jim mentions, my OMP programs back then werestructuredas cascading forks, with a few tricks here and there. And being able to turn off threading was useful to debug stuff, easily prove the validity of my programs, and fallback to generic C/C++ ifnecessary(something TBB can't do) . When I first found TBB I tried replacing my forks with "parallel for" and got about the same performance, but with more work than with OMP, and I could no longer completely remove threading dependencies like I did with OMP (those parallel for don't just disappear). So I kind of decided TBB wasn't for me at that time, but I felt it was worth further investigation.

A few months down the road I began trying TBB's allocators in my application, that did give me performance, I wasn't getting out of OMP, so that lead me to try out TBB's hash later on when I needed a thread safe hash container for another program.

After a few projects using bits and pieces of TBB here and there and playing around with it's other features, like it's pipeline, threads, task API, etc... I eventually began to understand the ideas behind TBB.

Now I really like TBB, and keep it as an active tool in my belt, however I still keep a little OMP in my belt, sometimes it's the right tool to sprinkle just a little threading on an application, especially during prototyping stages, also OMP is useful when I need to work with C (as TBB is obviously not C99 compatible). Anyhow when I have the chance I'll leverage TBB and it doesn't deception, if you work together with it you can get really great performance you just can't get from OMP.

Alexey-Kukanov · ‎05-15-2009

Quoting - pvonkaenel

Wow, the more I read, the more questions I have. At this point it may just be easier to enumerate them:

Can the threading be easily disabled as it can with OpenMP? I guess this leads to a more philisophical questions: in this day of multi-core, are sequential versions of tools/libs even needed?

I noticed that the TBB DLLs are built with /MD. I tend to use /MT in my libs and applications. Would it be safe to rebuild the open source version of TBB with /MT?

In general, do people tend to use /MD or /MT? I'm going to assume that most people use /MD since it's the default and you can avoid certain memory allocation/deallocation issues across modules. Why are you using the linking model you're using?

I read that each thread must has its own scheduler instance. Does each module also need a separate scheduler?

If an application links with two DLLs each using TBB, will they share the same thread pool?

Peter

1) Sequential versions still might be useful, at very least for debugging purpose as you mentioned. Thus TBB, as well as OpenMP,supports and encourages what is usually called relaxed sequential execution, where parallelism is optional, not mandated. You may look at the post of TBB chief architectabout importance of sequential backbone.
2) We did not test it with /MT, and generally /MD is recommended. As far as I remember,TBB does not free any memory allocated in another module, neither does it allocate memory to be freed out of TBB; so there are chances a version built with /MT will work fine.
4) No and yes. TBB does not require that every module using it must have called something in advance. From practical perspective, however, if you do not know whether an application that uses your DLL already initialized TBB or not, the recommended schema is to create one task_scheduler_init object that covers the whole life-span of the DLL, and then create another object in each call that uses TBB. The second is necessary to ensure TBB is initialized in each thread that uses it, while the first is necessary to ensure TBB worker threads are created just once and not on every call (see below).
5) Constructor of the very first scheduler initializer creates the pool of threads, and destructor of the last active object finishes the threads. So there could be different thread pools active during the process lifetime, but only one active (and used by all modules) at any given moment. So the answer to your question is: yes the thread pool is shared, but it can change over time, as controlled by the application.

pvonkaenel · ‎05-15-2009

Quoting - robert.jay.gould

Hi Peter,

As you, I was first using OMP, and later discovered TBB. And like Jim mentions, my OMP programs back then werestructuredas cascading forks, with a few tricks here and there. And being able to turn off threading was useful to debug stuff, easily prove the validity of my programs, and fallback to generic C/C++ ifnecessary(something TBB can't do) . When I first found TBB I tried replacing my forks with "parallel for" and got about the same performance, but with more work than with OMP, and I could no longer completely remove threading dependencies like I did with OMP (those parallel for don't just disappear). So I kind of decided TBB wasn't for me at that time, but I felt it was worth further investigation.

A few months down the road I began trying TBB's allocators in my application, that did give me performance, I wasn't getting out of OMP, so that lead me to try out TBB's hash later on when I needed a thread safe hash container for another program.

After a few projects using bits and pieces of TBB here and there and playing around with it's other features, like it's pipeline, threads, task API, etc... I eventually began to understand the ideas behind TBB.

Now I really like TBB, and keep it as an active tool in my belt, however I still keep a little OMP in my belt, sometimes it's the right tool to sprinkle just a little threading on an application, especially during prototyping stages, also OMP is useful when I need to work with C (as TBB is obviously not C99 compatible). Anyhow when I have the chance I'll leverage TBB and it doesn't deception, if you work together with it you can get really great performance you just can't get from OMP.

Hi Robert,

From reading the TBB O-Reilly book and the available soft documentation I'm coming to the same conclusion. There is more work involved in using TBB over OpenMP, but the available constructs more than makes up for it.

Thanks for sharing your experiences,
Peter

pvonkaenel · ‎05-15-2009

Quoting - Alexey Kukanov (Intel)

1) Sequential versions still might be useful, at very least for debugging purpose as you mentioned. Thus TBB, as well as OpenMP,supports and encourages what is usually called relaxed sequential execution, where parallelism is optional, not mandated. You may look at the post of TBB chief architectabout importance of sequential backbone.
2) We did not test it with /MT, and generally /MD is recommended. As far as I remember,TBB does not free any memory allocated in another module, neither does it allocate memory to be freed out of TBB; so there are chances a version built with /MT will work fine.
4) No and yes. TBB does not require that every module using it must have called something in advance. From practical perspective, however, if you do not know whether an application that uses your DLL already initialized TBB or not, the recommended schema is to create one task_scheduler_init object that covers the whole life-span of the DLL, and then create another object in each call that uses TBB. The second is necessary to ensure TBB is initialized in each thread that uses it, while the first is necessary to ensure TBB worker threads are created just once and not on every call (see below).
5) Constructor of the very first scheduler initializer creates the pool of threads, and destructor of the last active object finishes the threads. So there could be different thread pools active during the process lifetime, but only one active (and used by all modules) at any given moment. So the answer to your question is: yes the thread pool is shared, but it can change over time, as controlled by the application.

2) I've added a feature request for an /MT buildin premier support. I will also try changing the build of the open source project to try it out.

4&5) I was under the impression that creating the scheduler was a very expensive operation. Are subsequent init's cheaper than the initial one? From your description in 5 above, it sounds like every init will generate its own set of threads that all calls will use until the scheduler instance is destructed. If this is the case, then it sounds like every scheduler init is expensive. Along these lines, if I use TBB in a DLL, then I should be able to create a global scheduler instance in DllMain() when a process connects to it and deallocates when it is disjoined. Likewise, I can create an instance per thread in DllMain() and store it in thread local storage which is freed when the thread disjoins. Is this correct, and less costly than creating a scheduler per library call?

As a side question (let's call it question 6), why does each thread need to have it's own scheduler?

Thanks for all the answers,
Peter

jimdempseyatthecove · ‎05-15-2009

Quoting - robert.jay.gould

Hi Peter,

As you, I was first using OMP, and later discovered TBB. And like Jim mentions, my OMP programs back then werestructuredas cascading forks, with a few tricks here and there. And being able to turn off threading was useful to debug stuff, easily prove the validity of my programs, and fallback to generic C/C++ ifnecessary(something TBB can't do) . When I first found TBB I tried replacing my forks with "parallel for" and got about the same performance, but with more work than with OMP, and I could no longer completely remove threading dependencies like I did with OMP (those parallel for don't just disappear). So I kind of decided TBB wasn't for me at that time, but I felt it was worth further investigation.

A few months down the road I began trying TBB's allocators in my application, that did give me performance, I wasn't getting out of OMP, so that lead me to try out TBB's hash later on when I needed a thread safe hash container for another program.

After a few projects using bits and pieces of TBB here and there and playing around with it's other features, like it's pipeline, threads, task API, etc... I eventually began to understand the ideas behind TBB.

Now I really like TBB, and keep it as an active tool in my belt, however I still keep a little OMP in my belt, sometimes it's the right tool to sprinkle just a little threading on an application, especially during prototyping stages, also OMP is useful when I need to work with C (as TBB is obviously not C99 compatible). Anyhow when I have the chance I'll leverage TBB and it doesn't deception, if you work together with it you can get really great performance you just can't get from OMP.

Excellent post Robert. Sometimes a blend of programming technologies works best. It is handy that many of the parallel constructs within TBB can be used outside of TBB. Although you can blend other compatible parallel constructs into your OpenMP and/orother threadedapplication as well.

OpenMP is quite nice for having one source that can be compiled with or without parallization. Although as you try to eek more and more core use efficiencyout of the application you may introduce differing code paths, one for parallel and one for serial.

In the QuickThread programming system, the parallel_for divides the interation space accross the available threads (similar to TBB). When the current execution thread (that which issues the parallel_for) is a member of the thread team destined to execute the task of the parallel_for, then the current execution thread makes a direct function call to the task of the parallel_for. Therefore, when the task pool is set to 1 thread, the parallel_for reduces to a slightly convoluted function call (i.e. it figures out the parallel_for uses a thread pool of 1 bypasses the code to schedule the additional thread). With the debugger, you can quickly step from the parallel_for into the loop (when the current execution threadis a member of the thread team destined to execute the task of the parallel_for). This will work for many of the parallel constructs in QuickThread but not all. e.g. parallel_task en-queues an asychronous task.

When you need to support two versions of the code - single-threaded and multi-threaded, it may be more suitable for support reasonsto stick with OpenMP. Since most systems now have multi-cores it might be a better strategy to make all code multi-threaded. And experiment to see if you need to force the minimum thread count to be 2 as opposed to letting it fall down to 1. Then insert at appropriate places a test for the real number of HW threads and if 1, call SwitchToThread(). An example would be if you had one task producing data and another task consuming data and not enough memory (buffering space) to hold the complete production. The SwitchToTask() produces a co-routine type of effect for both to run intermittantly.

Jim Dempsey

Alexey-Kukanov · ‎05-15-2009

Quoting - pvonkaenel

2) I've added a feature request for an /MT buildin premier support. I will also try changing the build of the open source project to try it out.

4&5) I was under the impression that creating the scheduler was a very expensive operation. Are subsequent init's cheaper than the initial one? From your description in 5 above, it sounds like every init will generate its own set of threads that all calls will use until the scheduler instance is destructed. If this is the case, then it sounds like every scheduler init is expensive. Along these lines, if I use TBB in a DLL, then I should be able to create a global scheduler instance in DllMain() when a process connects to it and deallocates when it is disjoined. Likewise, I can create an instance per thread in DllMain() and store it in thread local storage which is freed when the thread disjoins. Is this correct, and less costly than creating a scheduler per library call?

As a side question (let's call it question 6), why does each thread need to have it's own scheduler?

Thanks for all the answers,
Peter

It seems something is wrong with my descriptions, as others do not understand it right.
Thread pool is created by the very first init object in the process. Provided that at least one init object is still alive, all subsequent inits are much cheaper as they do not create threads. The first init object in a thread creates thread local scheduler structures. Each subsequent init in the same thread is just reference increment.

What you wrote about DllMain is meaningful, and I actually recommended similar thing, just in different words. But I only meant a single globalinstance, not one per thread.DllMain(DLL_THREAD_ATTACH) is not guaranteed to be called for every thread ("when a DLL is loaded using LoadLibrary, existing threads do not call the entry-point function of the newly loaded DLL"), so the per-thread instance is hard to provide for e.g. plugins. And DLL_THREAD_DETACH notification is also sent only when a thread exits, but not when DLL unloads.

Re 6): each thread needs its own scheduler related structures, the task pool first of all. TBB worker threads get it automatically, but external threads do not (yet).