Raf_Schietekat:Have you determined how much faster your memory-pools approach is than using TBB's allocator? For example, using a concurrent_queue will not help keep memory local to the cache of the thread that last used it, something TBB's allocator will do for you.
Raf_Schietekat:Assuming that using a pool of something (not necessarily memory) is a good thing, you could use a modified Singleton pattern with the static variable declared as thread-local, registering it with a global concurrent_vector for later cleanup. (I did not actually understand what you meant starting with "But this only [...]".)
A couple of side notes first.
- __declspec(thread) had some limitations whe used together with dynamically loaded libraries, at least prior to Windows Vista and the latest Visual Studio. The relevant info can be found in MSDN. We found it more reliable to use Tls* functions for thread-local storage functionality.
- in Visual Studio 2003 and 2005, the default allocator (malloc) was awfully slow in multithreaded mode. The TBB scalable allocator is significantly faster for small objects, and scales well with growing number of threads. You might look at tree_sum example in the TBB packages; it compares the TBB scalable allocator with malloc.
Now to the point. It is not guaranteed that future versions of TBB will always keep constant number of threads in the thread pool. We might apply some techniques such as dynamic adjustments of the number of active threads depending on total system load, temporary substitution for worker threads that are blocked, and who knows what else in the future.
However there is a solution to your initialization and clean-up problem. Recently, the new class was introduced into TBB called task_scheduler_observer. It provides functionality to execute a user-defined function (exactly, one of two virtual method in a class that inherited task_scheduler_observer) each time a thread enters or exits TBB scheduler. I think that might work well for your case.
I'll confirm the clarification that Raf suggested: when I said the execution-by-execution count of threads available in the pool may vary, I meant that it is a run-time value which may vary depending on the machine you happen to be running upon: I typically bounce code back and forth between my two-PE laptop and my eight-PE workstation. In the quoted statement, I was arguing against programming to a model that assumes a fixed number of available processors. The comments Alexey added make that suggestion even more relevant: the number of available Processing Elements will become more dynamic as we develop evermore adaptable means to schedule the available and growing pool of PEs (as the PE counts in the underlying hardware evolve with new generations of processors).
One other comment: at one point Christian asks whether there's a way to disable task stealing. I'm uncertain of the motivation here, but it is my observation that task stealing is precisely the way that tasks created with the example grainsize 1 parallel_for get distributed to the other pool threads. Moreover, it is during these task stealing events that the task_scheduler_observer checks for call-backs to alert the application about TBB task scheduler activity. There is also test code for the task_scheduler_observer that demonstrates how to use it to establish Thread Local Storage for each of the worker threads.