I wonder whether there is a BKM for using tbb:task directly vs. using parallel_for/parallel_invoke?
For example, in the case I want to have X and Y parallel computation, I can use parallel_for and set the loop size to 2, or I can use task and spawn the task directly.
If there is some benchmark on this, that will be even better.
Well, if you use tbb::task, which is a thin wrapper over the underlying and OS-native thread creation process, you'll pay the cost of thread creation each time you launch the code. Using parallel_for or parallel_invoke from a thread registered with TBB will take advantage of the already created thread pool, which should have less overhead since the threads are already created and (hopefully) sitting, waiting for work.
Thanks for your response.
I have one test which I tried with parallel_for first and then using task directly. Although the results are the same, using task directly achieves better scaling performance. I think it is because it has less over-head.
Use tasks with parallel_invoke(), which was recently added for this kind of thing. Previously it would have been most appropriate to use tasks directly: everything runs on top of tasks (including parallel_for and parallel_invoke), and a worker thread pool is used to execute them with minimal overhead (once the pool has been set up, which is why you should have at least one long-lived task_scheduler_init object to keep the scheduler airborne).
(Added) Actually, parallel_invoke() hides the task use, so you might even try lambdas (new in C++0x).
Sorry for my confused response yesterday. I was seeing tbb::task but thinking tbb_thread, which is the construct that wraps the OS-native thread interface (that should teach me about trying to respond to forum postsduring lulls in other meetings ;-). (Thanks, Anton for noting my confusion.)
While it's true that using tbb:task directly gives you access "closer to the metal" in the TBB call hierarchy, as Raf points out, parallel_for and parallel_invoke use the same mechanism for managing threads. Whether you provide the code to generate tasks directly or rely on the parallel_for and parallel_invoke constructs as a convenience and for improved readability of your code, they should amount to similar levels of overhead (I'd expect parallel_invoke, which takes tasks directly, to be a little more efficient in the case you described than parallel_for, which would need to go through a task-splitting process to create a pair of tasks to handle your X and Y). How much better scaling are you seeing using the tbb::task interface?
Raf also mentions lambda constructs, which would further improve code readability and locality, and are available in the Intel C++ compiler v11(enable the -Qstd=c++0x command line switch). I doubt this would benefit performance at all, but definitely would simplify maintenance.
Somewhat irrelevant to the main topic's question, but important to note: when a compiler does the job of creating a functor from a lambda expression, it has more information about the context being captured than when a programmer does it manually in a named functor class. And this information can be used for good, e.g. to enable some optimization that would be conservatively rejected otherwise. With some simple examples I saw better performance of lambda based parallel loops than those using naive implementations of loop body functor.