I am performing experiments with different implementations of parallel_while pattern for my specific application including parallel graph traversal in topological order. I hope to find the more efficient implementation but my implementations become only worse than yours so far .
My question is did you try different implementations and what is the motivation of the current choice? What is the purpose of 2-level task hierarchy used in TBB 2.1.0: while_group_task and while_iteration_task? At first glance I could implement more efficient parallel topological graph ordering without extra classes, but I fail so far...
The input to parallel_while is inherently serial. If you look at the documented requirements for a Stream class to be used with parallel_while, you will see that its only required method, pop_if_present, is guaranteed to not invoke concurrently. Usually, streams are serial; think of reading a file, or traversing a container that does not support random access iterators. Thus, parallel_while has to deal with serial inefficiency.
First, the algorithm assumes that the stream is already fulfilled with data, so its pop_if_present does not block and returns data elements rather quickly. Thus it makes sense to pop a few elements at once, to decrease the overhead required to syncronize access to the stream; this job is done by while_task. The data processing presumably takes much longer, thus it makes sense to have a separate task for each element for better load balancing; the per-element processing is done by while_iteration_task. However these processing tasks should be spawned for execution first; and this job is done by while_group_task. Why not by while_task, you might ask. The latter is re-executed to select the next few elements from the stream; there is only one instance of this task, which implicitly is used as the lock to serialize pop_if_present (you might think of getting this task from the pool as the lock acquisition, and re-spawning it back to the pool as the lock release. The "lock" should only be held as long as it is required to protect the stream, thus while_task delegates spawning to the successor. I hope now the task hierarchy in parallel_while makes more sense for you.
Another thing worth mentioning is the parallel_while::add method. It is intended to add more work on the fly, if you find it while processing the elements of the initial stream. This operation is rather straightforward; it just spawns a new while_iteration_task to process the given element. The natural implementation of DAG topological traversal could start with only the root element (or elements, if you have a forest of DAGs)in the initial stream, and processing of each element would add the elements it has edges to for further processing. This implementation uses while_task and while_group_task minimally, and the serial nature of the initial stream is well compensated when much more work is added on the fly. I believe the parallel_preorder example shipped with TBB implements something like this.