There is no support for message-passing in TBB. During the design of TBB, we investigated many different approaches. What we settled on, largely inspired by Cilk, is an approach that yield good cache locality, good space bounds, load balancing, full support for nested parallelism, and could be implemented efficiently as a pure C++ library. Message passing, though definitely useful in some contexts, did not have all the properties we were looking for.
That's not meant to disparage message-passing. A language like Erlang that has compiler support for message passing can do it very well. And MPI's message passing is certainly appropriate for writing programs that run on distributed-memory machines, albeit at some cost in programmer labor.
Master-worker is generally not scalable, because eventually the number of workers outstrips the ability of the master to direct/feed them. I usualy recommend hierarchical structures in such situations. But sometimes master-worker suffices and is easy to program. There's no one true way to do parallel programming.
Since TBB does not support all possible parallel programming paradigms, we did go to some effort to make it interoperate reasonably with native threads on each platform. So you can certainly mix TBB and native threads. Appendix B of the Tutorial describes how to do. Basically, the requirement is that the native threadhas to have an activetask_scheduler_init object while it is running a TBB algorithm.
If there is some higher-level way to express your pattern than "master-worker", and it seems like a generally useful pattern, perhaps there is a way we could add the pattern to TBB. What's the general nature of your algorithm?
- Arch Robison
It might help to stud the sequential version of the program. Another nice property of TBB is that there is that a TBB program can be run sequentially (e.g. consumer-producer programs require parallelism, which can be a bummer to debug). How would the tasks be sequenced in a sequential version of your program? If you can split the sequential execution, or a part thereof, into two independent parts, that might reveal a divide and conquer approach. Part of the advantage of divide and conquer is, when a Cilk/TBB style scheduler is used, that all the potential parallelism is not turned loose all at once. That tends to swamp a machine. Instead, the parallel depth-first execution tends to create just enough parallelism to keep the machine busy.
As far as loops being in different files, maybe there's a way to build some kind of registery of them at startup time? One way to approach it is to ask how does the sequential version know which loops to invoke when?