Switch off task stealing in parallel_for ?

Andrei_Vermel · ‎08-03-2010

I have sequential iterations that make use of data of a previous iteration to make the result a bit more optimal.
The same iterations done with parallel_for benefit from the same optimization most of the time, because each thread does consequent chunks most of the time, until it's original share of 1/N chunks is done. Then it starts helping other threads (stealing tasks) and the optimization is not possible as data from previous iteration is not available.

Task stealing is a bit random between runs of the program, and the result is slightly different each time. This makes regression testing problematic.
I'd like to switch off task stealing for regression testing. Is there an easy way to do it?

So far I came up with a following scheme, which I am not very happy with.
After completing a chunk I check if the next sequential chunk is not started yet. If it is, then this thread will start task stealing next time. So I want to make the thread not return from the task until other threads finish doing all work. I use an atomic counter to accumulate length of chunks done, and once it reaches the total length it signals the waiting threads that they can proceed to exit.
One problem I have with this, is how to wait. I tried using tbb::mutex. I acquire it before parallel_for, release it when work is done and also acquire it with scoped_lock to wait. This doen't work so well when the main thread that already holds the original lock needs to wait. It doesn't wait, but exits immediately (also asserts in a debug). Is there any other way to wait for completion of other threads using tbb api?

Alexey-Kukanov · ‎08-03-2010

For regressiontesting you may try the following:

tbb::parallel_for( tbb::blocked_range(LoopBegin, LoopEnd, (LoopEnd-LoopBegin)/NumberOfThreads), LoopBody, tbb::simple_partitioner() );

This way there will be just as many sub-ranges to process as threads, and so no stealing happensafter initial work distribution. If the total amount of work is sufficiently large, each thread will get just one portion of the whole iteration space.

This code is not recommended for porduction, as there is no work slack for load balancing.

Andrei_Vermel · ‎08-03-2010

Thanks Alexey,

This exactly what I finally did and was just about to post it.
Only I came to (LoopEnd-LoopBegin)/NumberOfThreads+1. Without the (+1) e.g. with 2 threads odd number of elements would produce 3 chunks.

ARCH_R_Intel · ‎08-03-2010

Alexey's approach is what I would do.

For the optimization, it might help to use parallel_reduce instead of parallel_for, because the logic for resusing information from the previous iterationis built into parallel_reduce.The abstract reduction operator is "return right argument", which is trivially associative. Here is a sketch:

[bash]struct Body {
    bool start_from_scratch;   // True if "state" must be recomputed.
    State state;                    // State from previous iteration

    void operator()( blocked_range r ) {
        if( start_from_scratch ) {
            ...initialize state for case where we do not know previous iteration...
            start_from_scratch = false;
        }
        for( int i=r.begin(); i!=r.end(); ++i )
            ...process iteration i...
        ...update state to reflect knowledge about last iteration...
    }
    void join( Body& rhs ) {
        // Last iteration of joined subranges is last iteration of right subrange.
        state = rhs.state;
        start_from_scratch = false;  // Not really necessary because start_from_scratch will always be false here. 
    }
    Body() : start_from_scratch(true) {}
    Body( Body&, split ) : start_from_scratch(true) {}
};

const int N = 1000;

int main() {
    Body b;
    parallel_reduce( blocked_range(0,N), b );
}

[/bash]

tbb::parallel_reduce has an internal optimization where it does a split only when stealing occurs, but the strategy above does not depend on that behavior. Alexey's suggestion would apply just the same to it.

Andrei_Vermel · ‎08-03-2010

Thanks Arch,

I can't see advantage of parallel_reduce over parallel_for for my case.
As I understand, join happens when both left and right chunks are done, and there isn't any work for it.
The algorithm is such that the right chunk may benefit from knowing where the left one ended,
but it needs to know this before starting the calculation.
Or am I missing anything?

RafSchietekat · ‎08-04-2010

What you may have missed is that parallel_reduce guarantees to reuse a Body across successive subranges, except for detectable events like splits and joins (which it tries to minimise). So it's event, invoke, invoke, ..., invoke, event. With Arch's suggestion, you don't need to devise and debug your own communication scheme.

Andrei_Vermel · ‎08-04-2010

Indeed, this makes sense now. Thanks for explaining.