Best way to deal with thread specific workspace - Page 2

Plagne__Laurent · ‎12-01-2013

Dear TBB experts !

I would like to know what is the best solution to the following simple problem. I want to perfom a parallel_for with a work functor that requires some workspace (e.g. an array that should not be accessed concurently.

1) The simplest solution is to use an automatic vector inside the () operator of my functor

struct WorkFunctor{

void operator()(block_range..){

std::vector<double> workspace(2048);

... do some work
}
};

Unfortunately, for small granularity, this solution is slow because of the time consumed for allocating/deallocating the workspace array.

2) I may improve the situation be using tbb scalable allocator

struct WorkFunctor{
void operator()(block_range..){
std::vector<double,tbb::cache_aligned_allocator<double> > workspace(2048);

... do some work
}
};

3) I improve a bit the perf by using static size array, be I have to be very carefull with the

stack size per thread (I have encountered erratic bugs due to this issue).

struct WorkFunctor{
void operator()(block_range..){
double workspace[2048];

... do some work
}
};

4) I wonder if the use of thread specific storage is the appropriate solution (tbb::enumerable_thread_specific). I found very few examples for this.

Thank you in advance for your help.

Plagne__Laurent · ‎12-11-2013

Thank you again for your support !

I confess that I was not able to follow the last posts about TLC (?) flying a little far above my head ;)

As a non TBB expert my conclusions would be :

* The normal-safe pattern to deal with temporary workspace is to use automatic vector variables defined in the scope of the functor operator() together with the scalable tbb allocator. In most cases, the allocation cost should be negligible.

* In order to remove completely the allocation cost, one could consider to use the TLS tbb mechanism but this implies to completely control the context of the parallel loop.

At the present time I am not 100% satisfied by these conclusions for two reasons:

* I still don't understand why the TLS construct is unsafe: maybe it is possible to build a simple example where this construct fails. I recall that the workspace obtained by the .local() method does not need to be specifically bounded to a given thread. The only requirement is that the thread calling the local() function gets a private workspace instance (not accessible by other threads during the functor execution).

* I still wonder why my request seems to be so singular: I find rather natural the need for a structured workspace private to each concurrent thread performing a piece of parallel loop. Since I did not found simple and safe tbb construct to fulfill this need neither than a clear explanation why this need is unjustified in tbb documentation, tutorial or forum, I am afraid that my request is based on a personal fundamental misconception of the tbb programming paradigm, or more generally shared memory programming principles...

RafSchietekat · ‎12-11-2013

Have you timed how long it takes to allocate just an array (to avoid initialisation by a vector), and what happens if granularity is increased somewhat?

Maybe with direct knowledge of the application it would be possible to trigger the problem, but it might not be easy, because a thread has to be robbed and then it has to steal from the robber, or a more complicated scenario (no clear opinion at this time about which is the less unlikely). The innermost task assumes it can just overwrite the workspace, but the original task would be sabotaged if that happens. It's not like a cache, or a reduction of a commutative operation, or something else where things may safely happen out of order. Do you want to live in fear that the user of your code uses a nested TBB algorithm and causes this to happen? Or is there a reason why this would be impossible?

This particular problem arises from the desire to second-guess the scalable allocator by declaring a TLS instance in a wider scope (outside the parallel_for instead of inside a Body), and I can't immediately think of another situation where you would need a construct with exactly the properties needed here (apparently shared but requiring isolation), out of the box. But you can use TLS with a pool of workspaces as described, where each Body takes out a new or free workspace, and where all code operating on a "chunk" (subrange executed by a Body) uses the same reference.

Of course, humare erranum est...

jiri · ‎12-11-2013

Trying to make a simple "conclusion" of the TLS discussion: There is a dangerous block in your code - the block that uses the local workspace obtained from TLS. If you make a dangerous call to TBB (I'll explain later) in the dangerous block, it could damage the local workspace and thus make your results incorrect. To make it worse, the error will most likely be undetected (no exception or assertion triggered) and occur only randomly from time to time. The question is "what is a dangerous call?". It is any call to TBB that would allow the TBB scheduler to spawn a different task on the calling thread. Mainly running a different parallel algorithm or invoking any of the functions that wait for a task(s) to complete (wait_for_all, spawn_and_wait_for_all, spawn_root_and_wait). Naturally, running a function (e.g., from some third party library) that uses TBB internally is also dangerous.

If no dangerous call is made in the dangerous block, you should be OK. As long as nobody changes your code to include a dangerous call...

jimdempseyatthecove · ‎12-12-2013

Using TLS (Thread Local Storage) is sound practice provided:

a) The Task (not thread) that produced the data within the TLS does not change thread context between/during/after the run of the task. On TBB this is the case. Cilk++ this is not necessarily the case. I suggest you insert a #define ... filter that errors the compilation should you compile for threading paradigm that does not meet this requirement. BTW I use TLS a lot, but I am careful.

b) The routine using the TLS is not recursive with the expectation (requirement) that it is not.

Jim Dempsey

Plagne__Laurent · ‎12-15-2013

Thank you,

a) and b) requirements can be meet in my case (only TBB) and the routine is not recursive.

I did not try (yet) to use unitialized vector. In this case I guess that I should use the scalable_malloc from the C tbb interface.

RafSchietekat · ‎12-15-2013

laurent.plagne wrote:

the routine is not recursive

If you insist on simple STL (instead of thread-specific pools), make sure that you don't call TBB algorithms, or unknown code, including caller-provided code, that might call TBB algorithms, and put warnings all over the code that you do use to make clear to those who will maintain the code what to avoid.

laurent.plagne wrote:

In this case I guess that I should use the scalable_malloc from the C tbb interface.

How about using the STACK_BASED Workspace from #7 as the element type in a vector of 1 element with the scalable allocator, which also takes care of cleaning up? You may also have to provide a default constructor for Workspace that doesn't initialise the array.

BTW, I also noticed that you use (void) in the default constructors, but () is preferred.

Don't forget experimenting with grainsize!

(2013-12-20 Added) I've observed on OS X that even an automatic array of floating-point values seems to get initialised, at first sight anyway. Very inconsistent results with integral types, depending on debug/release and/or size, and with mysteriously recurring non-zero values. Your mileage may vary, and I didn't try the exact suggestion, just various nonsystematic combinations of automatic/new and single/array, but it may be necessary to use scalable_malloc directly after all.