With the latest TBB version (tbb30_056oss) I observe that sometimes additional threads are created. For example, I create a task_scheduler_init object in the main function to request 4 threads. Then I use a thread-specific data structure to assign a unique id (0, 1, 2, 3) to each thread. But at run-time I occasionally see thread ids larger than 3, which means that more than 4 threads are created.
This breaks some existing code because I preallocate thread specific data in advance, for example:
task_scheduler_init init (4); thread_data data ;
The thread id is used to index 'data'. This is mostly done for performance reasons, to avoid repeated allocation and deallocation of large data sets.
The TBB reference manual mentions that some implementations may create more workers than necessary. I also see that there are now new template classes (combinable and enumerable_thread_specific) that give access to thread specific storage. So I can rewrite my own code with these new primitives.
However, my company also offers an interface to our customers where we provide the thread id. We guarantee that the thread id always satisfies the condition 0 <= id < number_of_threads. But this does not work any more with the latest TBB version.
Is there a way to find out which threads are currently active and assign integer ids on that basis? Or is the idea of a simple thread id with 0 <= id < number_of_threads doomed with TBB 3.0?
With help of ETS, you may do global IDs the way you want. But for team-wide ID's, it's doomed. If there are multiple "foreign" threads (team masters) working with TBB, worker threads might migrate between them unpredictably.
We have support for TLS, and we have support for comparable "generic" thread IDs (std::thread::id), Besides that, what doyou or your customers need thread IDs for?
So far we have used thread ids as a simplified interface to thread-local storage. In a typical situation we would pre-allocate storage for all threads:
data_type* data = new data_type [number_of_threads]; tbb::parallel_for (...); delete  data;
Inside the parallel section each thread would usethe thread id to access its own location in the array 'data'. In this way we can avoid repeated allocation/deallocation of data in the body of the parallel loop.
With the initial version of TBB this scheme worked as expected. But it fails as soon as additional threads appear.
By the way, in our application we have a single task_scheduler_init object in the main function. Therefore I would expect that only the requested number of threads will be created. What is the advantage of these extra threads? Isn't there additional overhead (CPU time as well as memory consumption) for each pthread_create call?
For the TLS interface, use enumerable_thread_specific. We designed it exactly for the cases like you described, but it allows you being agnostic of the total number of threads, and of thread IDs.
If the application has a single thread that provides parallel work ("master thread" in TBB terminology), then only the requested number of threads will activelyworkat any given time. Additional threads can be created in some corner case situations when it appears easier/faster to create a new thread than to synchronize with an existing one; but then extra threads will just sleep. It's a consequence of the highly asynchronous nature of the TBB scheduler (to say it differenty, consequence of the absense of internalbarriers).
When your thread team will consist of at times all threads then a TLS pointer to your buffer (initialized to NULL) can be tested at task start-up and if null, perform a once only allocation.
However, if your execution thread team is a known small subset of your total threads and when the allocation is large, then consider a technique whereby prior to launching the thread team you reset an atomic variable to 0 then at start of task perform an atomic exchange add of 1 to the counter variable. There is a small peanalty for the atomic exchange add. You might want to add an assertin the _DEBUG build to assert that more threads enter the task than allocations made.
You can also use parallel_for_each to accomplish a similar thing (at slightly higher overhead).