On 2006, Intel made to the C++ standard commitee, a "Proposal to Add parallel Iteration to the STL", based on TBB: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2104.pdfwhich is very close to many programming practice in the STL.
Does Intel intend to go on with standardization, or rather continue to evolve TBB in parallel ?
Another question :Is there a plan to provide an entirely sequential form of the algorithms parallel_for, parallel_while and parallel_reduce ?This would ease debugging, and would allow to 'prepare' software for future parallelization, by giving them from now on, the right algorithmic structure : Parallizing implies that race conditions are very often suspected and enforce tests without threads, first. For this reason, I developped in library similar to TBB (RPA = Range Partition Adaptors, on sourceforge) where each component has a sequentiql couterpart (By passing a dummy thread type as template parameter). This allows to split testing into two parts: First the sequential tests, and them extra tests with threads (... and ifferent types of threads and synchronization primitives, to be sure).
This technique could probably be successfully used for TBB: For example, nothing prevents a 'tbb::parallel_for' to be purely sequential 'just to test, first'.
I know that questions of standardization have been discussed but I don't know enough details so I'll leave comment onthat to a later discussion.
Regarding "entirely sequential," there is a mechanism alreadythat more or less gives you that. If you instantiate the thread pool with only a single member (pass a processor count of 1 to the the task_scheduler_init object) TBB will use a single (the main) thread but with all the mechanism for task handing. In fact, this technique is part of a method for finding optimal grain size for blocked ranges when using the simple partitioner.
Using this technique, there is no separate sequential counterpart: all the code is the same but all the tasks are scheduled to a single thread. I'd fear using separate sequential code, which could get out of sync in an evolving algorithm development.
This suggestion certainly would add to the flexibility of the thread creation interface, but it seems to me to go against the guiding philosophy behind the TBBtask scheduler. Because the number of available concurrent threads is an unknown, we want to remove programmer concerns about that numberas much as possible. If there were to be another template parameter, perhaps it might be a task scheduler object that could do something different with the partitionable task blocks, say on a NUMA architecture which could have radically different scheduling priorities for optimal performance.
As it is now, at the creation of the task_scheduler_init object, a thread pool of user controllable size (default one thread per concurrent (HW)threads of execution) is created and retained for the life of the object. This avoids oversubscription, which can tie up resources and slow things down, especially when parallel functions call other parallel functions. We'd also like to avoid the extra overhead of routine thread creation and destruction.