Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Access to Thread Local Storage via TBB

christian_roessel
1,507 Views
Hi,

thanks for the great TBB library!

I have a timestep-based simulation where I use parallel_for every timestep. In order to reduce memory allocations I use several memory pools, stored in a concurrent_queue (actually a pointer to a pool is stored in the queue). In operator(), every task pops a pool-pointer from the queue, performs it's operation over the range and pushes the pointer back to the queue.

Recently I heard about thread-local-storage that can be easily used in Visual Studio (via "__declspec( thread )"). In my case, it would be sufficient to have a pool for each TBB hardware thread. The access to the pool via the concurrent_queue could be omitted in this scenario.

But this only makes sense if the TBB thread-pool is fixed after initialization. Does this assumption hold? If yes, I need to do initialization- and clean-up-operations on my memory-pools. I can't do this in a ctor and dtor because one cannot declare objects with ctor and dtor a thread local (in Visual C++). If I use a parallel_for with grainsize 1 over the number of cores (task-scheduler is initialized with the same number),
task-stealing may happen and not all memory-pools are accessed.

Is there a way to prevent task-stealing? What do you think about the scenario in general. Are there alternatives?

Thanks and regards,
Christian

0 Kudos
6 Replies
RafSchietekat
Valued Contributor III
1,507 Views
Have you determined how much faster your memory-pools approach is than using TBB's allocator? For example, using a concurrent_queue will not help keep memory local to the cache of the thread that last used it, something TBB's allocator will do for you.

Assuming that using a pool of something (not necessarily memory) is a good thing, you could use a modified Singleton pattern with the static variable declared as thread-local, registering it with a global concurrent_vector for later cleanup. (I did not actually understand what you meant starting with "But this only [...]".)

0 Kudos
christian_roessel
1,507 Views
Raf_Schietekat:
Have you determined how much faster your memory-pools approach is than using TBB's allocator? For example, using a concurrent_queue will not help keep memory local to the cache of the thread that last used it, something TBB's allocator will do for you.


Our experience with the single-threaded version and the
default Visual Studio malloc/new shows that the pool version is much faster. There are many allocations and dealloctions of small objects each timestep. I could try to allocate a pool for each task (and deallocate it afterwards) in order to maintain cache locality, but I suspect that this is faster than one pool per thread that gets allocated once. We can also change our data structures to be more multithreading friendly, but in legacy-systems this may have undesirable side-effects ... smiley [:-)]


Raf_Schietekat:
Assuming that using a pool of something (not necessarily memory) is a good thing, you could use a modified Singleton pattern with the static variable declared as thread-local, registering it with a global concurrent_vector for later cleanup. (I did not actually understand what you meant starting with "But this only [...]".)



Ok, got it. I'll try this.

The
"But this only [...]" section refers to part of a recent post:

mad reed:

Intel Threading Building Blocks takes a specific approach to the application of parallelism to application code, trying as much as is possible to abstract the notion of processors intoa generic resource, a pool whose size may vary from execution to execution and evolve to larger numbers as time goes on, a pool better left to be scheduled dynamically depending on the resources available and the moment-by moment parallelism exposed in the application.


If this ressource pool would vary from execution to execution I had no idea how to do proper initialization and clean-up. With your suggested Singleton-like solution with associated concurrent_vector this seems to be easy, even if the pool size isn't fixed.

Thanks,
Christian
0 Kudos
RafSchietekat
Valued Contributor III
1,507 Views
I don't see how the default Visual Studio allocator enters into the comparison (you can and probably should substitute TBB's), and you give no indication why the pool might perform better than a general allocator like TBB's, like, e.g., a lot of traffic of same-sized memory chunks that might have a chance of being better handled by a specialised pool (I presume), so you probably should make sure you try TBB's allocator as well. Another option might be pools of refurbishable objects if that is cheaper than always creating new ones.

A correction: you can probably get away with a conventionally serialised std::vector for registration (it would only be accessed a few times).

I'm sure Robert Reed didn't mean things would be unpredictable between runs on the same machine, but this quote certainly makes it sound that way... :-)

(Added after Alexey's reply below) My suggestion would only work as-is with long-lived worker threads (compared to the time until user-initiated cleanup). Variability in number of threads during execution actually makes more sense to me than variability between executions (on the same machine). task_scheduler_observer should therefore be used to manage a pool of pools, as it were.

0 Kudos
Alexey-Kukanov
Employee
1,507 Views

A couple of side notes first.
- __declspec(thread) had some limitations whe used together with dynamically loaded libraries, at least prior to Windows Vista and the latest Visual Studio. The relevant info can be found in MSDN. We found it more reliable to use Tls* functions for thread-local storage functionality.
- in Visual Studio 2003 and 2005, the default allocator (malloc) was awfully slow in multithreaded mode. The TBB scalable allocator is significantly faster for small objects, and scales well with growing number of threads. You might look at tree_sum example in the TBB packages; it compares the TBB scalable allocator with malloc.

Now to the point. It is not guaranteed that future versions of TBB will always keep constant number of threads in the thread pool. We might apply some techniques such as dynamic adjustments of the number of active threads depending on total system load, temporary substitution for worker threads that are blocked, and who knows what else in the future.

However there is a solution to your initialization and clean-up problem. Recently, the new class was introduced into TBB called task_scheduler_observer. It provides functionality to execute a user-defined function (exactly, one of two virtual method in a class that inherited task_scheduler_observer) each time a thread enters or exits TBB scheduler. I think that might work well for your case.

0 Kudos
christian_roessel
1,507 Views
Yes, I read the note about __declspec(thread) and DLLs. In my particular case, however, this is not relevant. Thanks for the TLS* hint.

I made some quick tests that revealed that the TBB allocator is indeed extremely fast in my use case, compared to VS2005 malloc, both in single threaded-mode. This tests also revealed a memory leak ... I have to investigate this further and will come back with some hard data. This will take some time, however.

Ok, the task_scheduler_observer will solve the initialization and clean-up problem, if I still have to use a pool.

Thanks for your help,
Christian

0 Kudos
robert-reed
Valued Contributor II
1,507 Views

I'll confirm the clarification that Raf suggested: when I said the execution-by-execution count of threads available in the pool may vary, I meant that it is a run-time value which may vary depending on the machine you happen to be running upon: I typically bounce code back and forth between my two-PE laptop and my eight-PE workstation. In the quoted statement, I was arguing against programming to a model that assumes a fixed number of available processors. The comments Alexey added make that suggestion even more relevant: the number of available Processing Elements will become more dynamic as we develop evermore adaptable means to schedule the available and growing pool of PEs (as the PE counts in the underlying hardware evolve with new generations of processors).

One other comment: at one point Christian asks whether there's a way to disable task stealing. I'm uncertain of the motivation here, but it is my observation that task stealing is precisely the way that tasks created with the example grainsize 1 parallel_for get distributed to the other pool threads. Moreover, it is during these task stealing events that the task_scheduler_observer checks for call-backs to alert the application about TBB task scheduler activity. There is also test code for the task_scheduler_observer that demonstrates how to use it to establish Thread Local Storage for each of the worker threads.

0 Kudos
Reply