Use tasks with parallel_invoke(), which was recently added for this kind of thing. Previously it would have been most appropriate to use tasks directly: everything runs on top of tasks (including parallel_for and parallel_invoke), and a worker thread pool is used to execute them with minimal overhead (once the pool has been set up, which is why you should have at least one long-lived task_scheduler_init object to keep the scheduler airborne).
(Added) Actually, parallel_invoke() hides the task use, so you might even try lambdas (new in C++0x).
Sorry for my confused response yesterday. I was seeing tbb::task but thinking tbb_thread, which is the construct that wraps the OS-native thread interface (that should teach me about trying to respond to forum postsduring lulls in other meetings ;-). (Thanks, Anton for noting my confusion.)
While it's true that using tbb:task directly gives you access "closer to the metal" in the TBB call hierarchy, as Raf points out, parallel_for and parallel_invoke use the same mechanism for managing threads. Whether you provide the code to generate tasks directly or rely on the parallel_for and parallel_invoke constructs as a convenience and for improved readability of your code, they should amount to similar levels of overhead (I'd expect parallel_invoke, which takes tasks directly, to be a little more efficient in the case you described than parallel_for, which would need to go through a task-splitting process to create a pair of tasks to handle your X and Y). How much better scaling are you seeing using the tbb::task interface?
Raf also mentions lambda constructs, which would further improve code readability and locality, and are available in the Intel C++ compiler v11(enable the -Qstd=c++0x command line switch). I doubt this would benefit performance at all, but definitely would simplify maintenance.