How to Control the Run-Time Behaviour of Multi-threaded Program

timminn · ‎04-20-2008

The multi-threaded program makes a pure use of TBB to implement its parallelism.

Suppose itwill be running on a four core cpu. It involvesa large amount oflow speed IO and some heavy computation. How can we assign one core[any one of the four cores] to perform all the IO action and the rest cores to do the computation?

If it is absolutely 100% very much sure it is definitely impossible, please confirm the impossiblity. Otherwise, please specify how to do it in detail, and describe which component of TBB, like pipeline/task object, is to be used to solved this problem.

Thank you very much for your help

robert-reed · ‎04-21-2008

Intel Threading Building Blocks takes a specific approach to the application of parallelism to application code, trying as much as is possible to abstract the notion of processors intoa generic resource, a pool whose size may vary from execution to execution and evolve to larger numbers as time goes on, a pool better left to be scheduled dynamically depending on the resources available and the moment-by moment parallelism exposed in the application. Placing constraints such as designing for a specific number of cores or requiring that one of those cores does all the I/O can seriously limit Intel TBB's ability to do that dynamic scheduling and limit the potential scaling your application might be able to achieve.

Consider, for example, the Intel TBB goal to maximize local processing to make best use of data already in cache. Going to a single processing element for all the I/O guarantees that the computational PEs will have to pull data into their local caches from memory, whereas in an organization where each PE participates in I/O for computation that it might do, there's a chance some of those data may still be in a local cache (depending on how the I/O is managed and executed) and be available for computation faster than if they had to be pulled in from memory.

The parameters for the task you've outlined above are sketchy enough that it's hard to be more specific about a design. Is the I/O one input and one output file? Or multiple files? Is the I/O simple streaming through data, or is there some random access I/O required on one or more open files? Are the data ordered (requiring, for example, First In-First Out sequential processing) or can they be processed in any order? Is there some hardware reason why all I/O should be done by a single core or is that just an expectation of the current design?

On computational processing of streaming I/O, TBB has demonstrated good scaling in several applications, using among other things constructs such as the pipeline. Some results I blogged about last year show very good scaling by allowing each PE to repeatedly read, process and write its portion in sequence. So while it may be difficult to squeeze TBB into the shape you have in mind for your current design, thinking from a direction of TBB's strengths may lead you to a different and more efficient, scalable design.

Alexey-Kukanov · ‎04-22-2008

timminn,

while there are ways in TBB to approach the design you possibly have in mind (with one thread on a single core doing all IO and other threads doing computation), I agree with Robert that such a design should better be considered as last resort after no better (more scalable and HW-agnostic) design were found.

If you provided just a high-level view of the problem and how you approach it, might be someone could help with design ideas. Even suggesting how to do the separate IO thread right would require some additional information, such as how much do you expect this thread to load the core it runs on, in order to understand whether it makes sense to share this core with a computational thread.

RafSchietekat · ‎04-23-2008

With low-speed I/O (which I assume means that it will be blocking a lot), you should consider doing it "yourself", and leaving all physical threads available for TBB to work on computation (cache locality takes second place to not blocking any physical thread): create a concurrent queue (of pointers to data, not the actual data), directly or indirectly spawn a task to read from it (creating further worker threads in a chain reaction), loop on doing I/O and writing into the queue, and at the end wait for the child.

If output is similarly blocking, again interface it from the tasks to the main thread with another concurrent queue and throttle input based on output (assuming the output/input ratio is bounded to a reasonable limit), otherwise just do it from the task that creates the output.

The trick is to initialise the task scheduler with one more thread than is physically available, ~~for which there is no TBB-pure interface (you should hard-code it or get the number in an O.S.-specific way)~~by constructing or initialising task_scheduler_init with argument default_num_threads()+1, or to start the task scheduler from a second O.S.-level thread (the "indirectly spawn a task" above), e.g., by using tbb_thread, and this will also be a prerequisite for not blocking on a single-core system (TBB assumes relaxed sequential execution). Leave it to the TBB and O.S. schedulers to decide what executes where (maybe the I/O code will stick to a specific core, maybe it will get juggled around a bit, but it probably doesn't really matter).

This advice (?) should be validated/corrected/rejected by a TBB developer, who will also be able to comment whether and how task affinity plays a role, although it probably does not matter (a lot) if I/O truly is limited compared to the amount of computation, which should be the case to consider this technique in the first place (if you have so much I/O that you were willing to dedicate a core to it, it will probably be wrong to try this technique and you should follow Robert Reed's advice instead).

(Added) I stand corrected about default_num_threads() and tbb_thread, as edited above, see Alexey's reply below.

Alexey-Kukanov · ‎04-23-2008

Yes, exchanging data items via concurrent_queue as Raf suggested is one of possible designs if you believe you need a separate thread. On the other hand, tbb::pipeline used as outlined by Robert in his blog might be an alternative, unless your I/O has some context associated with a single thread. The I/O thread can either be borrowed from the TBB pool by spawning a long-running task as Raf suggested, or it can be explicitly created; and for the latter, TBB now provides a tbb_thread wrapper class which interface follows std::thread proposed for C++0x as close as possible and practical at the moment (we would not even bother with doing this class if std::thread was ubiquitos).

Raf_Schietekat:
The trick is to initialise the task scheduler with one more thread than is physically available, for which there is no TBB-pure interface (you should hard-code it or get the number in an O.S.-specific way)

This is not true for quite a time; task_scheduler_init::default_num_threads does the job of getting the number which, if passed to the constructor of task_scheduler_init, would be equivalent to default initialization. In other words, this call returns the number of worker threads created in TBB by default, plus one (which counts for the calling master thread). So if you need one more or one less thread in the pool, you now know what to do :)

Raf_Schietekat:
Leave it to the TBB and O.S. schedulers to decide what executes where (maybe the I/O code will stick to a specific core, maybe it will get juggled around a bit, but it probably doesn't really matter).

Completely agree. The benefits of setting hard thread-to-core affinity often do not outweigh the losses if the affinity is set wrong way. So we will not provide thread-to-core affinity in TBB because there is no solution that would fit everyone needs. And definitely affinity is not the first thing to be tried in the design and implementation of a parallel program; I would only start experimenting with it having solid reasons to think that it would improve performance significantly.