Solved: tbb::combinable::local() is too slow -- and suggestion

Peter_F_2 · ‎06-11-2015

see also

http://stackoverflow.com/questions/30407691/tbbcombinablelocal-is-too-slow/30585151#30585151

I would appreciate if there would be the following functionality:

1) Every thread in the tbb thread pool should have an integer index associated with it. This integer index is in the range of 0..NumberOfThreadsInPool.

2) Every combinable consists of a vector of NumberOfThreadsInPool elements

3) combinable::local() requires the index from 1) to be passed. It is simply an indexed access to the vector in 2).

4) Every function object passed to any of the parallel_foreach/parallel_for algorithms requires that the index from 1) be passed to it as an argument.

This would avoid the slow access of ThreadLocalStorage and whatever else is done in local().

What do you think?!

RafSchietekat · ‎06-12-2015

You might get away with this for TBB worker threads (assuming that they are preserved by your particular RML implementation!), but even TBB master threads can pop in and out of existence, leading to ever-increasing index values and array sizes. So not only can you not assume a fixed-size array, but you would also have to have TLS-specific index values to avoid large holes (allowing you to start from 0 each time), which probably defeats the purpose (see below). Imposing restrictions to get around that seems rather error-prone.

Since looking up the index value of a thread is equally slow as looking up any individual local() reference (right?), I must assume that you are accessing many TLS variables in each local scope, and you want to amortise the cost of one TLS lookup over all those TLS variables. But isn't it then easier to just flip the orientation from bunch of TLS to TLS of bunch, where you can implement the bunch as a struct, or would that not be possible in your application?

View solution in original post

RafSchietekat · ‎06-12-2015

You might get away with this for TBB worker threads (assuming that they are preserved by your particular RML implementation!), but even TBB master threads can pop in and out of existence, leading to ever-increasing index values and array sizes. So not only can you not assume a fixed-size array, but you would also have to have TLS-specific index values to avoid large holes (allowing you to start from 0 each time), which probably defeats the purpose (see below). Imposing restrictions to get around that seems rather error-prone.

Since looking up the index value of a thread is equally slow as looking up any individual local() reference (right?), I must assume that you are accessing many TLS variables in each local scope, and you want to amortise the cost of one TLS lookup over all those TLS variables. But isn't it then easier to just flip the orientation from bunch of TLS to TLS of bunch, where you can implement the bunch as a struct, or would that not be possible in your application?

Christophe_H_Intel · ‎06-12-2015

Hi, Peter,

Raf is right about one solution; you can use TLS directly if you are hitting thread-local storage a lot.

The implementation you described is almost exactly the implementation of enumerable_thread_specific (the storage for combinable.) There are two things that differ (in the ets_no_key case):

You cannot depend on the values TBB uses for thread IDs; each OS has their own version, and the only guarantee is they don't change for a particular thread. That is why ETS uses the thread ID to access a lock-free expandable hash table which points to the thread's element.
You cannot depend on the location of an element of a std::vector being constant. If you add an element to a vector it may allocate new space and copy the elements to it; invalidating every local() instance that is "in flight". ETS uses concurrent_vector, which has the guarantee that if you take the address of element N, that address will not change as long as the ETS is not destroyed. The downside is it is segmented, so finding a particular element can take a little time.

Raf had a good suggestion (basically rolling your own TLS-based structs.) There are a couple other things you might try:

In combinable.h the template parameter for the ETS component is ets_no_key. The specification for combinable didn't include an implementation type for the ETS component, but you can try changing ets_no_key to ets_key_per_instance to see if it reduces the time spent in local(). (This was Arch's suggestion in the Stack Overflow comment.)
Depending on how you are accessing the local instances, you can create a reference to the type and initialize it to local() at the start of the body you are executing. (This was the suggestion by Rick in the first Stack Overflow answer.) If this doesn't reduce the contribution of local() to your runtime, it may be that you are doing to little work per execution instance.

Regards,
Chris

RafSchietekat · ‎06-12-2015

"basically rolling your own TLS-based structs"

I would still use tbb::combinable, but combine the instances (pun not intended), ...

"you can try changing ets_no_key to ets_key_per_instance"

...or tbb::enumerable_thread_specific, to be able to change the template parameter from the outside. (Hmm, ETS seems to have all the features of Combinable, so why is it not a subclass of Combinable instead of Combinable being a wrapper around ETS? Historic reasons?)

"If this doesn't reduce the contribution of local() to your runtime, it may be that you are doing to little work per execution instance."

This solution is ruled out because "I only call local() once for any one object.", which I take at face value, i.e., the "worker function" is not just called in a straightforward parallel_for() where the implementation is neglecting to take references just once for each chunk, but something more complicated is going on where a simple refactoring of the code without also refactoring the data is not an option.