TBB and Boost

tbbnovice · ‎12-16-2008

1. Is there a penalty for using boost Shared pointers with the scalable allocator? I know that shared pointers require an extra level of indirection, but I am curious if the atomic ref count in shared pointer has an impact? I am using this in a pipeline btw.

2. Is there anything similar to the boost threadpool in the tbb world (see http://threadpool.sourceforge.net/). I have used this product with TBB without any problems but am not sure if this is the fastest solution. Would it be better to reimplement this with tbb?

Dmitry_Vyukov · ‎12-19-2008

Quoting - tbbnovice

1. Is there a penalty for using boost Shared pointers with the scalable allocator?

There is no any special penalty for exactly this combination... probably you wanted to ask some other question...

Quoting - tbbnovice

I know that shared pointers require an extra level of indirection

Nope. Storing boost::shared_ptr is exactly the same as stroing raw pointer.

However, boost::shared_ptr requires additional memory allocation for counter. And this is a problem. If you need performance you must consider switching to boost::intrusive_pre. shared_ptr is a dump tool for prototyping.

Quoting - tbbnovice

but I am curious if the atomic ref count in shared pointer has an impact?

Atomic ref count can potentially have HUGE performance impact. Basically total performance and scalability destruction. But can have no impact at all. It depends.

Dmitry_Vyukov · ‎12-19-2008

Quoting - tbbnovice

2. Is there anything similar to the boost threadpool in the tbb world (see http://threadpool.sourceforge.net/). I have used this product with TBB without any problems but am not sure if this is the fastest solution. Would it be better to reimplement this with tbb?

Of course! The thread pool is one of key components of TBB scheduler. TBB is a thread pool, from some point of view.

If your tasks don't execute blocking operations, then you can use TBB as-is. Otherwise you have to setup number of threads in TBB thread pool to something like number_of_processors * 32 or so (requires tweaking), however it's better to eliminate blocking from tasks.

Would it be better?.. It depends on size of your tasks.

tbbnovice · ‎01-06-2009

Quoting - Dmitriy Vyukov

There is no any special penalty for exactly this combination... probably you wanted to ask some other question...

Quoting - tbbnovice

I know that shared pointers require an extra level of indirection

Nope. Storing boost::shared_ptr is exactly the same as stroing raw pointer.

However, boost::shared_ptr requires additional memory allocation for counter. And this is a problem. If you need performance you must consider switching to boost::intrusive_pre. shared_ptr is a dump tool for prototyping.

Quoting - tbbnovice

but I am curious if the atomic ref count in shared pointer has an impact?

Atomic ref count can potentially have HUGE performance impact. Basically total performance and scalability destruction. But can have no impact at all. It depends.

I am missing something. Depends on what? I believe a shared pointer contains a raw pointer (and an atomic ref count). If I switch to the intrusive, I guess I remove the ref count from the heap but what would I gain? I simply moved the ref counter from the pointer to the object. (aside: would there be any impact if I use tbb atomic count instead of the boost atomic count if I implement intrusive?)

I will be generating millions of objects in my simulation so this seems important. Thanks for the help.

tbbnovice · ‎01-06-2009

Quoting - Dmitriy Vyukov

Of course! The thread pool is one of key components of TBB scheduler. TBB is a thread pool, from some point of view.

If your tasks don't execute blocking operations, then you can use TBB as-is. Otherwise you have to setup number of threads in TBB thread pool to something like number_of_processors * 32 or so (requires tweaking), however it's better to eliminate blocking from tasks.

Would it be better?.. It depends on size of your tasks.

Unlike tbb::thread there is no tbb::threadpool that I can use directly (please correct me if I am missing something here). I believe the task scheduler in tbb is meant for short-running, non-blocking tasks while the boost theadpool can run any task.

If I have to implement my own threadpool, is this a good place to start?
http://software.intel.com/en-us/forums/showthread.php?t=62751

robert-reed · ‎01-06-2009

Quoting - tbbnovice

Unlike tbb::thread there is no tbb::threadpool that I can use directly (please correct me if I am missing something here). I believe the task scheduler in tbb is meant for short-running, non-blocking tasks while the boost theadpool can run any task.

If I have to implement my own threadpool, is this a good place to start?
http://software.intel.com/en-us/forums/showthread.php?t=62751

You want to instantiate your own thread pool? In TBB you use tbb:task_scheduler_init to construct an object which contains a thread pool. TBB threadpool threads are as useful for long running or blocking tasks as any other thread; however assigning such threads to such tasks takes them out of the mix for dealing with other scheduled tasks. Is it that you want multiple thread pools? Doing so runs the risk of overcommitment, which can lead to thrashing and other impediments to performance. If you need a couple threads in the pool for blocking, you could always bump the count of the number of threads created by the task_scheduler_init, knowing that when those extra threads are not blocking, they might be competing for resources with other threads.
Can you describe more your need for an explicit thread pool?

tbbnovice · ‎01-08-2009

Quoting - Robert Reed (Intel)

You want to instantiate your own thread pool? In TBB you use tbb:task_scheduler_init to construct an object which contains a thread pool. TBB threadpool threads are as useful for long running or blocking tasks as any other thread; however assigning such threads to such tasks takes them out of the mix for dealing with other scheduled tasks. Is it that you want multiple thread pools? Doing so runs the risk of overcommitment, which can lead to thrashing and other impediments to performance. If you need a couple threads in the pool for blocking, you could always bump the count of the number of threads created by the task_scheduler_init, knowing that when those extra threads are not blocking, they might be competing for resources with other threads.
Can you describe more your need for an explicit thread pool?

Thanks. Let me explain what I want to do. I am working on something like the capitalize-words-in-chunk example in the book, except I have multiple files to read from (there is no need to merge the files, so the pipelines are completely independent). However, the first stage can block on I/O because it reads from a file, so I want to instantiate each pipeline in its own thread (earlier, I was thinking of using a parallel_for but because these are long-running+blocking tasks, I heard on this forum that I should not be doing that).

From Raf's post on the multiple pipelines thread (different application), the suggestion was:
>Add 1 worker thread per pipeline, and perhaps do the connection management in a tbb::thread instead.

I can create a tbb::thread and run my pipeline within that. But what if I want to capitalize 100 files? I don't want to create 100 tbb threads, right? Boost::threadpool (mentioned above) lets me create a pool with 100 virtual threads and max_limit of say, 4 so only 4 threads run at any given time. I can add 100 tasks to the threadpool and call the run() method and need not worry about what tasks are running when - there is a FIFO policy inside the threadpool that takes care of running the tasks as and when a thread is available.

I was wondering if there is a similar capability within tbb - then I can eliminate the boost dependency and hopefully, tbb's version of threadpool might work better with tbb::pipeline - is this a wrong assumption? If there is no built-in boost::threadpool like capability in tbb, how should I design one that works as well as the boost threadpool - if not better?

ARCH_R_Intel · ‎01-08-2009

I can create a tbb::thread and run my pipeline within that. But what if I want to capitalize 100 files? I don't want to create 100 tbb threads, right? Boost::threadpool (mentioned above) lets me create a pool with 100 virtual threads and max_limit of say, 4 so only 4 threads run at any given time. I can add 100 tasks to the threadpool and call the run() method and need not worry about what tasks are running when - there is a FIFO policy inside the threadpool that takes care of running the tasks as and when a thread is available.

I was wondering if there is a similar capability within tbb - then I can eliminate the boost dependency and hopefully, tbb's version of threadpool might work better with tbb::pipeline - is this a wrong assumption? If there is no built-in boost::threadpool like capability in tbb, how should I design one that works as well as the boost threadpool - if not better?

I think there is a way to do this in TBB. Use a single pipeline that has only a single stage. The stage should be parallel and process an entire file.(TBB 2.1 made some fixes so that a parallel input stage really runs in parallel).

The input to the pipeline should be the list of files. Set the max_number_of_live_tokens parameter to the thread limit that you want. That will give you FIFO processing, and limit the number of files being processed at any moment

If the list of files is itself a serial stream, then use an initial serial stage to pop file names from the list, and feed each name into a subsequent parallel stage that processes the file.

Alexey-Kukanov · ‎01-08-2009

Quoting - tbbnovice

Thanks. Let me explain what I want to do. I am working on something like the capitalize-words-in-chunk example in the book, except I have multiple files to read from (there is no need to merge the files, so the pipelines are completely independent). However, the first stage can block on I/O because it reads from a file, so I want to instantiate each pipeline in its own thread (earlier, I was thinking of using a parallel_for but because these are long-running+blocking tasks, I heard on this forum that I should not be doing that).

Itmight be controversial toearlier adviceyou heard (including possibly my own), but for the described task I'd first try parallel_for over files (usingsimple partitioner (default) and grain size of 1 file) for outer level parallelism, and pipeline at inner level. Well, I must add "unless you really need to have simultaneous progresswith processing each file".

Then if you see that the system is undersubscribed due to blocking I/O, try oversubscribing it a little, by initializing TBB to use more worker threads. Ask the default number of threads (there is a static method of class task_scheduler_init), add somewhat more, or possibly multiply by some factor (depending on what the typical loadis expected to be- e.g. for 50% load, try multiplying by two). You will likely end up being oversubscribed for some periods (not too bad, especially if each file is processed independently of others), possibly undersubscribed for some other periods, and just fine at lucky periods :)

I think dealing with thread pool will have essentially the same effect, but possibly with more efforts. By starting every pipeline in its own thread, you effectively add one more master thread working with TBB, and it has almost the same effect as adding one more TBB worker thread -because every master will run a TBB scheduler and so potentially participate in completing the work of others. If you use thread pool together with TBB initialized by default, you oversubscribe the system - just as if you added more TBB workers. And so on.

I also like Arch's idea of using pipeline at the outer level. Still, if the amount of files to process could vary, as well as the amount of work in each file (in particular, if there are just a few files with uneven work in each), I'd add inner-level parallelism, i.e. process each file by another(inner) pipeline started from the parallel filter of the outer pipeline.

To answer the question about a boost-like thread pool in TBB: for the moment, there is no such thing.

jimdempseyatthecove · ‎01-09-2009

For the many files to process scenario I might suggest you consider a hybrid approach. Create number of HW thread number (or less) number of TBB worker tasks that process a mailbox (feeding end of pipeline). Create as many non-TBB threads as desired to process the file list. The non-TBB thread's job is to read the next buffer and pass a context pointer (file number, status,file pointer, buffer pointer, data in buffer) into an empty mailbox. The TBB tasks wait for entry into mailbox, when found, extract pointer replacing with NULL and process packet. As the non-TBB tasks finish a file, they try to start with the next file, when no next file, decrement a count of number of threads processing files counter and then exit. The TBB threads can observe this count and then terminate when number of threads processing files counter goes to 0 (then tweak to exit earlier).

Other than adding a file context to the data going through the pipeline there would be little change to the pipeline code. The non-TBB thread spawining as well as mailbox is relatively trivial to write.

Jim Dempsey

Dmitry_Vyukov · ‎01-16-2009

Quoting - tbbnovice

I am missing something. Depends on what? I believe a shared pointer contains a raw pointer (and an atomic ref count). If I switch to the intrusive, I guess I remove the ref count from the heap but what would I gain? I simply moved the ref counter from the pointer to the object. (aside: would there be any impact if I use tbb atomic count instead of the boost atomic count if I implement intrusive?)

I will be generating millions of objects in my simulation so this seems important. Thanks for the help.

It depends on many things. Frequency of reference counting calls, mutual object placement, object layout, number of objects, mapping between threads and objects, etc.
Generally, atomic reference counting is not very multi-core friendly technique. The only situation when you can get not very bad performance and scalability (not counting the case when the number of reference counting calls is negligible) is when every object is situated in dedicated cache-line and there is a mapping between threads and objects so that every single object is accessed mainly by one thread.

Dmitry_Vyukov · ‎01-16-2009

Quoting - tbbnovice

I am missing something. Depends on what? I believe a shared pointer contains a raw pointer (and an atomic ref count). If I switch to the intrusive, I guess I remove the ref count from the heap but what would I gain? I simply moved the ref counter from the pointer to the object. (aside: would there be any impact if I use tbb atomic count instead of the boost atomic count if I implement intrusive?)

Yes, you simply move the ref counter to the object. Thus, you save one memory allocation and deallocation and combine refcount and your object (or at least a part of the object) into single cache-line (this can result either in scalability improvement or in degradation, depending on usage pattern).