I've been running some test with our renderer using TBB in combination with MPI and the results are pretty poor. For instance running a 8 thread TBB implementation on a single 8 core machine outperforms handily a MPI/TBB setup with 2 machines with 8 cores/threads as slaves and 1 machine as master. Has anyone done this? Should the MPI communication run in a separate thread? Any advise is welcome.
Use MPI to "shotgun" the work to your 2 machines, then within each MPI process (receiving the load) run your TBB thread pool to do the work. If you can setup the app to have the MPI slave node(s) (if that is the correct word for it) wait in a listening loop for a message, then you can initialize the TBB thread pool once (outside the message loop).
Who calls whom? MPI at the top should be fine, but TBB assumes that tasks are always doing something useful, and so any waiting should be done in a separate thread.
Thanks, I realized after sending in my post that any MPI communication must go in a separate thread. The new task::enqueue seems to be able to do the job. I presume sub tasks spawned by an enqueued task won't end up in the original enqueing thread. One question though: It probably still means I'm out one thread which will be mostly waiting for MPI messages, so should I ask for 1 thread more than what tbb would give me by default?
"The new task::enqueue seems to be able to do the job. I presume sub tasks spawned by an enqueued task won't end up in the original enqueing thread." Enqueued tasks are executed along with spawned tasks by normal worker threads, possibly in parallel with any number of other enqueued tasks or their descendants (there may be only a single queue, but each worker thread considers it, even in preference to stealing from another worker thread), and without prejudice against the worker thread that originally enqueued the task, so you may want to revise your assumptions here. I would still recommend tbb_thread (or your choice of API) to perform work that may incur significant delays.